WeBWorK Main Forum

NonASCII characters in name break emails

NonASCII characters in name break emails

by Robert Mařík -
Number of replies: 6

Hello, I have the following problem.

I use gmail to send emails from webwork. If the name of the user is "Markéta Nováková", the "Email TA" button fails and email is not sent.

If the name is "Marketa Novakova", emails work fine.

Markéta is the first name, Nováková the second name. Both names are common in my country. Could check if you have the same behavior? I have ww_version: 2.15 | pg_version 2.15

Any Idea how to fix this problem? Thank you.


Robert Mařík

In reply to Robert Mařík

Re: NonASCII characters in name break emails

by Robert Mařík -
Not only the name. Also if the email body contans letter “á” or similar, email fails.
In reply to Robert Mařík

Re: NonASCII characters in name break emails

by Robert Mařík -

I was playing with various mail setting, such as transfer encoding and similar stuff and without success. However, if the accented letter are removed before sending the message, the emails work fine. Thus the following code in Feedback.pm file just before the line "$email->body_set(encode_utf8($msg));" Seems to be a workaround until someone more skilled in email handling will be interested in the problem. You should add these lines and restart Apache.

    $msg =~ s/ě/e/g;
    $msg =~ s/š/s/g;
    $msg =~ s/č/c/g;
    $msg =~ s/ř/r/g;
    $msg =~ s/ž/z/g;
    $msg =~ s/ý/y/g;
    $msg =~ s/á/a/g;
    $msg =~ s/í/i/g;
    $msg =~ s/é/e/g;
   $msg =~ s/Ě/E/g;
    $msg =~ s/Š/S/g;
    $msg =~ s/Č/C/g;
    $msg =~ s/Ř/R/g;
    $msg =~ s/Ž/Z/g;
    $msg =~ s/Ý/Y/g;
    $msg =~ s/Á/A/g;
    $msg =~ s/Í/I/g;
    $msg =~ s/É/E/g;
    $msg =~ s/ú/u/g;
    $msg =~ s/ů/u/g;
    $msg =~ s/Ú/U/g;
    $msg =~ s/ď/d/g;
    $msg =~ s/ť/t/g;
    $msg =~ s/ä/a/g;
    $msg =~ s/ö/o/g;
    $msg =~ s/ü/u/g;

In reply to Robert Mařík

Re: NonASCII characters in name break emails

by Nathan Wallach -
I added the current support for Unicode to the email code in https://github.com/openwebwork/webwork2/pull/973 . 

I'm focusing first on the issue of the accented characters in the email body. It seems likely that the student name as it appears in the body of the email (the Name line in the "Data about the user" section) is really the same problem as that for the rest of the message body.

On my system I can create feedback emails with the accented characters you mention above and they arrive as written in the destination email account. My impression is that something quite strange is happening with how these sorts of accented characters are handled on different systems / under different settings. It might relate to the "system" locale/encoding settings. I now suspect that the issue is a Perl issue related to the fine details of how Perl handles Unicode characters and how it internally stores strings. There is also some chance it has to do with the email transport or the like.

Are you able to send messages with other Unicode characters which are not expected in your local encoding (such as אבגד or 😀) in the text?

I suspect that the problem may be that I and others involved in adding the UTF-8 support to WeBWorK were not sufficiently expert in the complicated and confusing details about how some of Perl's "utf8" functions work, and that the decision to use "encode_utf8()" may be the culprit. That approach to "encoding" to utf8 was used in several places in the WW code (at least back then) but is apparently not robust enough for use in some places.  (Now I know more about the details of Perl's Unicode support than I did back then, and am more aware of the issues which can occur.)

Based on the documentation at https://perldoc.perl.org/Encode#encode_utf8  the encode_utf8() function is not really a proper conversion to UTF-8 (and should not be used for data exchange!) as it really makes use of Perl's internal (and lax) version of utff8. The email systems are certainly expecting proper UTF-8. I suspect that in certain cases, Perl's internal character representation may leave these sort of accented characters as 8-bit characters, so they are not really being converted to UTF-8 as had been expected, while it seems that in other cases the same characters are properly converted. It seems likely that the difference may be related to what system encoding is set.

If this conjecture is in fact the root problem, then replacing 
       $email->body_set(encode_utf8($msg)); 
with 
       $email->body_set(encode("UTF-8", $msg));
might help overcome the problem.

Would you be willing to test this on your system (without the replacement code active)? (Since my server does manage to send mails with these accented characters, I don't see any value in trying the change on my system.)

Another possible difference is how the mail is being sent. My outgoing mail is being relayed via a local Unix system, which may handle things differently than Gmail is. If we rule out strangeness related to "encode_utf8" someone with more expertise in email matters may need to help debug this issue.

In reply to Nathan Wallach

Re: NonASCII characters in name break emails

by Robert Mařík -

Hello, thank you for working on this.

I tried your code. The message "Pokus ěščřžýáíéúůťďŘŽ" went through, but I got "Pokus Ä›Å¡Ä Å™Å¾Ã½Ã¡Ã­Ã©ÃºÅ¯Å¥Ä Å˜Å½" as output.

You may be true that gmail needs some extra settings. Anyway, I can live with the substitution or setup unix emailer.

With best regards

Robert Marik

In reply to Robert Mařík

Re: NonASCII characters in name break emails

by Nathan Wallach -
When I put the string "ěščřžýáíéúůťďŘŽ" in a text file /tmp/11, and dump the bytes in hex:

od -tx1 /tmp/11

I get

0000000 c4 9b c5 a1 c4 8d c5 99 c5 be c3 bd c3 a1 c3 ad
0000020 c3 a9 c3 ba c5 af c5 a5 c4 8f c5 98 c5 bd 0a

which should be the correct UTF-8 octets for the letters. (I checked a few in https://www.utf8-chartable.de/unicode-utf8-table.pl ).

However, c4 in iso-8859-1 (Latin1) is Ä, c5 is Å, which explains the characters which appeared.

Thus, it seems that somewhere along the way - the character set is getting messed up. I tend to suspect that the problem is on the Perl end, as we made progress by changing the command used to put the message body into UTF-8 encoding.

A Google search found https://stackoverflow.com/questions/30734516/send-utf-8-encoded-mail-with-emailsender where the issue reported is also about similar mojibake in emails. The last suggestion there is to "require Net::SMTP;" before using Email::Sender. If that is what is needed the it probably needs to be added in lib/WeBWorK/ContentGenerator.pm where the transport is created. Another a bit earlier was to add additional MIME headers and not just the one about the charset.
In reply to Nathan Wallach

Re: NonASCII characters in name break emails

by Robert Mařík -

Hello, thank you for detailed support. Unfortunately, this new suggestion also did not help.

Gmail should work somehow, we use the same account in another php-based project (a custom extension of Wolf CMS) without difficulties.

Anyway, I can live with ASCII replacements for now.

Thank you also for reporting the bug. I will continue to investigate as time permits. Perhaps when the current online course will finish.

With bet regards

Robert