WeBWorK Main Forum

UTF8, LTI, and All That (2.15?)

Re: UTF8, LTI, and All That (2.15?)

by Nathan Wallach -
Number of replies: 0
Dear Wesley,

I cannot say that I have checked all the implementation details, but on my systems (running what was a pre-release version of 2.15 with various local patches) LTI authentication (and automatic account creation) works with UTF-8 (Hebrew) characters in student names (coming from our local Moodle system). I cannot think of any good reason that any other valid use of UTF-8 strings in LTI data would fail when it works for Hebrew.

As explained below, I suspect that there is a possibility that the problem is that the LTI data arriving on your WW server is not really valid in the sense of all strings LTI/OAuth need to process being encoded as valid UTF-8. If that is the problem, then the problem is on the LMS side rather than on the WW side.

In my opinion, it would be wise to try to check what "raw" request data is arriving (before any UTF-8 decoding is done), and what changes are made along the way. It would be necessary to "report" the raw form data very early on in the process before any decoding is done. The first idea I have for how you can get this "raw" data would be to add temporary debugging code to mutable_param()  in lib/WeBWorK/Request.pm just before and after the "decode_utf8()" to store the raw" data somewhere for review.  Such careful debugging might provide an indication of why problems are occurring, and whether the issue is really on the WW side, or if the LMS is sending "flaky" data of some sort.

Some history and guesses about what may be happening:

When I originally wrote the patch, my assumption was that the versions of WW before UTF-8 support was added were having trouble as the input was actually arriving as UTF-8 encoded data but was not being "decoded". Plain ASCII strings using only 7-bit characters are all considered valid also as UTF-8, which is why things work for "Latin1" only names, etc.

The patch I had proposed was intended to catch exactly the cases where the OAuth code would issue the multi-byte error message, and to run the "missing" decode step which was not in WW 2.14 and earlier.

Since things work without the patch in WW 2.15 (versions of WW with the UTF-8 support added) - the pull request in https://github.com/openwebwork/webwork2/pull/931 was closed, as these versions of WW seems to be able to process valid UTF-8 strings in the LTI data properly, and there was little interest/help in trying to test the proposed patch for older versions of WW. My understanding when the patch was set aside was that in WW 2.15 things are working properly as apparently all relevant request data was undergoing a "decode" from UTF-8 into Perl's internal character format. That apparently happens using Encode::decode_utf8() in the mutable_param() method defined in lib/WeBWorK/Request.pm. The Perl documentation (see https://perldoc.perl.org/Encode.html and search for "decode_utf8") notes that using Encode::decode_utf8() is not an "ideal" approach to decoding UTF-8 text, as it can fail:

Equivalent to $string = decode("utf8", $octets [, CHECK]) . The sequence of octets represented by $octets is decoded from (loose, not strict) utf8 into a sequence of logical characters. Because not all sequences of octets are valid not strict utf8, it is quite possible for this function to fail. For CHECK, see Handling Malformed Data.

WARNING: do not use this function for data exchange as it can produce $string with not strict utf8 representation! For strictly valid UTF-8 $string representation use $string = decode("UTF-8", $octets [, CHECK]) .

Since you are encountering problems, one likely possibility is that the call to Encode::decode_utf8() is misbehaving as the input received in the LTI request is not really a valid UTF-8 encoded string. Using the $string = decode("UTF-8", $octets [, CHECK])  approach which is supposed to provide strictly valid UTF-8 output is likely not to work better - as the failure is related to invalid input - so the decoded data from the alternate approach would still not match what the LMS signed in the LTI data. A second possibility is that the input WW receives was encoded twice (and only decoded once).

My guess is that the most likely cause would be the LMS using an 8-bit character set to send the name, and including characters with high-order bits set in a manner which cannot be decoded as valid UTF-8 so hitting a case where Encode::decode_utf8() is "known to fail" and leading to Perl's OAuth code receiving a string which triggers the multi-byte character warning message (has high-order buts set but not marked as being utf8). If my suspicion that the LMS is sending invalid UTF-8 and is instead sending 8-bit accented characters, then the patch I had once proposed would not help overcome the problem. If that is the case - the real problem is on the LMS side and not on the WW side.

Testing the patch I had once written would help determine whether it helps or not. If it fixes the problems for the student with an umlaut accented character in her name (something of which I am now somewhat skeptical), and does not cause problems for other students - it can certainly be considered for inclusion in WW in the future. However, if it helps with WW 2.15,  and an "additional" decode fixes your problem - it still probably indicates that the LMS is not doing something properly.