In the development branch of webwork (58df67e), we're seeing student answers vanish in problems involving many (more than 32?) inputs (thus in problems involving, say, matrices)--or even in problem involving very long text inputs.
Examples of problems where this is happening:
many inputs: Library/TCNJ/TCNJ_MatrixInverse/problem14.pg
long text inputs: Library/Hope/Calc1/00-00-Essays/GQ_Limits_11.pg
This happens only in case answers are loaded from the 'last_answer' field in the 'problem_user' table--which doesn't happen I suppose when answers are submitted, previewed or checked. Thus, the problem arises when a student returns to finish or look at a (long) problem they started earlier, at which stage whatever inputs beyond for instance the 32nd--or the latter parts in a long text answer--are no longer visible. Although the inputs remain in the problem_user table until they are overwritten, the students of course have to enter them anew if they intend to move forward.
The exact text of the warning is as follows:
Odd number of elements in hash assignment at /opt/webwork/webwork2/lib/WeBWorK/ContentGenerator/Problem.pm line 698.
where line 698 is:
my %oldAnswers = decodeAnswers($problem->last_answer);
Which is wrapped in the conditional:
if (not ($submitAnswers or $previewAnswers or $checkAnswers) and $will{showOldAnswers})
This has only been happening since a recent migration, and didn't afflict a much earlier revision of webwork--making me wonder if it isn't something in my db configuration.
In any case, I'll continue to try to understand the problem better, but would be very happy to hear if anyone else has seen--and overcome!--this strange behaviour.
Many thanks!
Andy
student answers vanishing with Odd number of elements in hash assignment warning
by Andy Fuchs - Number of replies: 4
In reply to Andy Fuchs
Re: student answers vanishing with Odd number of elements in hash assignment warning
by Andy Fuchs -
The problem relates to the use of utf8 in the default character set for text fields in my tables. I quote Jeremy Sylvester, who makes the following astute diagnosis in http://webwork.maa.org/moodle/mod/forum/discuss.php?d=3521&parent=9519:
*** begin quote ***
The Storable::nfreeze command in WeBWorK::Utils::encodeAnswers prefixes each field with a byte or two indicating the length of the data in the field. When the answer entered is <= 127 characters, this data length byte is in the ordinary ascii range and there's no problem. But as soon as the answer data > 127 characters, it spills over into the unicode range. For some reason the database doesn't like this at all, and truncates the last_answer data at the first occurrence of such a character.
On a test WW server, after issuing the following command the problem does not occur for the modified course.
MariaDB [webwork]> ALTER TABLE Testing_problem_user MODIFY last_answer TEXT CHARACTER SET 'binary' COLLATE 'binary';
I think that one should consider the serialized answer data obtained from the Storable::nfreeze command as binary data rather than text. Maybe WeBWorK::DB::Record::UserProblem should be changed so that last_answer has type BLOB instead of TEXT ?
*** end quote ***
Now, as an (unsatisfactory) workaround--as to allow students to get ahead with their work--I've followed the above and changed the type of 'last_answer' to binary in the course in question. Will I need to do this in all courses? And, will there be any side-effects? Perhaps there can indeed be some checking on the level of WeBWorK::Utils::decodeAnswers as suggested by J. Sylvester.
Many thanks for whatever thoughts and recommendations,
Andy
*** begin quote ***
The Storable::nfreeze command in WeBWorK::Utils::encodeAnswers prefixes each field with a byte or two indicating the length of the data in the field. When the answer entered is <= 127 characters, this data length byte is in the ordinary ascii range and there's no problem. But as soon as the answer data > 127 characters, it spills over into the unicode range. For some reason the database doesn't like this at all, and truncates the last_answer data at the first occurrence of such a character.
On a test WW server, after issuing the following command the problem does not occur for the modified course.
MariaDB [webwork]> ALTER TABLE Testing_problem_user MODIFY last_answer TEXT CHARACTER SET 'binary' COLLATE 'binary';
I think that one should consider the serialized answer data obtained from the Storable::nfreeze command as binary data rather than text. Maybe WeBWorK::DB::Record::UserProblem should be changed so that last_answer has type BLOB instead of TEXT ?
*** end quote ***
Now, as an (unsatisfactory) workaround--as to allow students to get ahead with their work--I've followed the above and changed the type of 'last_answer' to binary in the course in question. Will I need to do this in all courses? And, will there be any side-effects? Perhaps there can indeed be some checking on the level of WeBWorK::Utils::decodeAnswers as suggested by J. Sylvester.
Many thanks for whatever thoughts and recommendations,
Andy
In reply to Andy Fuchs
Re: student answers vanishing with Odd number of elements in hash assignment warning
by Andy Fuchs -
Just to be clear, the data are truncated by Storable::thaw and not the database. I can see that the data are preserved faithfully under 'last_answer', but become corrupted only through retrieval in WeBWorK::Utils::decodeAnswers (and so Storable::thaw).
In reply to Andy Fuchs
Re: student answers vanishing with Odd number of elements in hash assignment warning
by Andy Fuchs -
A few remarks regarding the above. Again, in case you're seeing the 'Odd number of elements in hash assignment' warning in your Apache error log (via WeBWorK::Utils::decodeAnswers), data are being corrupted and students are likely losing their work.
The data corruption we were seeing in the last_answer field in the problem_user table relates to the encoding, as utf8, of serialized data (binary) in a text field (Latin 1, ISO 8859-1). The routines Storable::freeze (using native byte order) and Storable::nfreeze (using network, or portable, byte order), operating recursively, use byte sequences to indicate the length of structures, and some number of bytes to indicate the length of constituent parts--and so on. Since answers in WeBWorK are stored in a hash (an array of pairs) of the form AnSwErNNNN => 'answer string', Storable keeps the number of elements (i.e., twice the number of keys or values) in a short header; and precedes each key and value with its length, again in the form of a sequence of bytes. Problems arise in cases where the high bits are set in these sequences. This can happen in problems involving either more than 127/2 inputs (which can easily happen in problems involving, say, matrices); or, in problems where inputs are longer than 127 characters (e.g., in free-form essay questions). This is because utf8 (and so DBD::mysql under the utf8 flag) maps code points beyond 127 to more than one byte--'high-bit' bytes (in the range 128-255) are encoded as two byte sequences--thereby confounding the decode/encode (Storable::thaw/Storable::nfreeze) routines used in WeBWorK. Thus, whereas Storable is interested in byte sequences, the DBI and DBD::mysql are interested in code points and their encodings. The serialized data are binary--but stored in a text field--and DBD::mysql is doing the right thing.
This is the first server where we've seen this form of corruption. One workaround, as had already been suggested, is to re-cast the last_answer field as a BLOB. It seemed easier, though, to ask DBD::mysql to not encode text as utf8. This was done on the level of the connection--via the 'mysql_enable_utf8 => 0' attribute--in the file 'lib/WeBWorK/DB/Driver/SQL.pm'. We modified the (anonymous hash of) attributes in DBI->connect_cached as in the below excerpt:
################# lib/WeBWorK/DB/Driver/SQL.pm:
sub new($$$) {
my ($proto, $source, $params) = @_;
my $self = $proto->SUPER::new($source, $params);
# add handle
$self->{handle} = DBI->connect_cached(
$source,
$params->{username},
$params->{password},
{
PrintError => 0,
RaiseError => 1,
# our modification:
mysql_enable_utf8 => 0,
},
);
die $DBI::errstr unless defined $self->{handle};
# set trace level from debug param
#$self->{handle}->trace($params->{debug}) if $params->{debug};
return $self;
}
#################
Regarding the behaviour of the perl DBI and DBD::mysql in the handling of text fields under utf8 flags (in unicode systems)--and why setting the attribute 'mysql_enable_utf8 => 0' is not the same as NOT setting this attribute--please see the following:
https://github.com/perl5-dbi/DBD-mysql/issues/208
Again, the following are examples of problems that are susceptible to corruption in Unicode systems:
Many inputs: Library/TCNJ/TCNJ_MatrixInverse/problem14.pg
Long input: Library/Hope/Calc1/00-00-Essays/GQ_Limits_10.pg
In earlier versions of WeBWorK, answers were base64 encoded, and the history of inputs were stored in logs. In more recent revisions, answers are serialized (binary) in a text field, while the answer history, stored in the past_answer table in the form of simple text--seems to work without complications....
The data corruption we were seeing in the last_answer field in the problem_user table relates to the encoding, as utf8, of serialized data (binary) in a text field (Latin 1, ISO 8859-1). The routines Storable::freeze (using native byte order) and Storable::nfreeze (using network, or portable, byte order), operating recursively, use byte sequences to indicate the length of structures, and some number of bytes to indicate the length of constituent parts--and so on. Since answers in WeBWorK are stored in a hash (an array of pairs) of the form AnSwErNNNN => 'answer string', Storable keeps the number of elements (i.e., twice the number of keys or values) in a short header; and precedes each key and value with its length, again in the form of a sequence of bytes. Problems arise in cases where the high bits are set in these sequences. This can happen in problems involving either more than 127/2 inputs (which can easily happen in problems involving, say, matrices); or, in problems where inputs are longer than 127 characters (e.g., in free-form essay questions). This is because utf8 (and so DBD::mysql under the utf8 flag) maps code points beyond 127 to more than one byte--'high-bit' bytes (in the range 128-255) are encoded as two byte sequences--thereby confounding the decode/encode (Storable::thaw/Storable::nfreeze) routines used in WeBWorK. Thus, whereas Storable is interested in byte sequences, the DBI and DBD::mysql are interested in code points and their encodings. The serialized data are binary--but stored in a text field--and DBD::mysql is doing the right thing.
This is the first server where we've seen this form of corruption. One workaround, as had already been suggested, is to re-cast the last_answer field as a BLOB. It seemed easier, though, to ask DBD::mysql to not encode text as utf8. This was done on the level of the connection--via the 'mysql_enable_utf8 => 0' attribute--in the file 'lib/WeBWorK/DB/Driver/SQL.pm'. We modified the (anonymous hash of) attributes in DBI->connect_cached as in the below excerpt:
################# lib/WeBWorK/DB/Driver/SQL.pm:
sub new($$$) {
my ($proto, $source, $params) = @_;
my $self = $proto->SUPER::new($source, $params);
# add handle
$self->{handle} = DBI->connect_cached(
$source,
$params->{username},
$params->{password},
{
PrintError => 0,
RaiseError => 1,
# our modification:
mysql_enable_utf8 => 0,
},
);
die $DBI::errstr unless defined $self->{handle};
# set trace level from debug param
#$self->{handle}->trace($params->{debug}) if $params->{debug};
return $self;
}
#################
Regarding the behaviour of the perl DBI and DBD::mysql in the handling of text fields under utf8 flags (in unicode systems)--and why setting the attribute 'mysql_enable_utf8 => 0' is not the same as NOT setting this attribute--please see the following:
https://github.com/perl5-dbi/DBD-mysql/issues/208
Again, the following are examples of problems that are susceptible to corruption in Unicode systems:
Many inputs: Library/TCNJ/TCNJ_MatrixInverse/problem14.pg
Long input: Library/Hope/Calc1/00-00-Essays/GQ_Limits_10.pg
In earlier versions of WeBWorK, answers were base64 encoded, and the history of inputs were stored in logs. In more recent revisions, answers are serialized (binary) in a text field, while the answer history, stored in the past_answer table in the form of simple text--seems to work without complications....
In reply to Andy Fuchs
Re: student answers vanishing with Odd number of elements in hash assignment warning
by Michael Gage -
Thanks Andy for some really high quality detective work. I'll put your examples up on my development laptop this weekend and let you know what happens on my machine.
One thing I'm worrying about. In general we have been planning to move to
utf8 (in fact utf8mb4) encoding so that we can accommodate scripts from all languages. (Even French accents sometimes don't render in the systems we were using.). Is it feasible to label the fields that store binary data using the Storable routines so that things work properly and still have utf8mb4 for user names, titles, problem text and so forth?
-- Mike
One thing I'm worrying about. In general we have been planning to move to
utf8 (in fact utf8mb4) encoding so that we can accommodate scripts from all languages. (Even French accents sometimes don't render in the systems we were using.). Is it feasible to label the fields that store binary data using the Storable routines so that things work properly and still have utf8mb4 for user names, titles, problem text and so forth?
-- Mike