Installation

MariaDB on Debian: key too long

Re: MariaDB on Debian: key too long

by Nathan Wallach -
Number of replies: 0
I'm guessing that the issue was triggered by the default use of the 4-byte utf8mb4 charset for MariaDB 10.1 in Debian 9.

The reason this causes problems is explained below.

Your solution of "downgrading" to "mySQL utf8" should be fine unless you try to put into your database table utf8 characters not allowed by "mySQL utf8" (a somewhat restricted subset of real UTF8).

Would you mind looking at the utf8 related settings in the configuration files in /etc/mysql/mariadb.conf.d/ on the machine (probably 50-server.cnf and 50-client.cnf). From what I can tell, Debian's MariaDB 10.1 is defaulting to using "utf8mb4" (see https://salsa.debian.org/mariadb-team/mariadb-10.1/commit/e6ade2be57856736e8bc8039d71b35f9ffcde48e for the commit where Debian seems to have made this change.)

If that is the issue, then we probably need to adjust the installation instructions for Debian 9 (and probably also for Ubuntu 18) to tell people how to bypass the problem for now.

The rest of the post explains the cause of the problem and why WW will need to determine how to better deal with the problem in the future.

Nathan

To the best of my recollection none of the standard WW branches are yet trying to use the "utf8mb4" charset. However, the discussions about the future internationalized versions are looking into moving to this charset (see PR 800 in the references below), and most mySQL/MariaDB experts recommend using utfumb4 and not utf8.

The reasons to use "utf8mb4" and not "plain utf8" is explained at https://mathiasbynens.be/notes/mysql-utf8mb4 and in various other web pages. PR 594 also had some discussion of the differences.

In short SQL's "plain utf8" is a subset with various limitations and issues, and only supports the characters from the "real utf8" which require 1-3 bytes to encode. That means that the character to byte conversion factor is only 3, which allows longer keys than a "real UTF8" does.

"utf8mb4" (a full UTF8 for SQL) supports all of Unicode, but as it includes the 4 byte encoded characters, requires a character to byte conversion factor of 4, and this reduced what 1000 bytes will allow from about 333 to about 250. Some versions of InnoDB tables allowed only 767 bytes, while MyIASM allowed 1000 bytes. The 767 byte limit seems to require cutting back to 191 characters with the older settings and utf8mb4. Newer DB versions/table types can apparently allow longer keys, as explained in some of the links below.

The additional byte reserved per character when using "utf8mb4" as the SQL charset means that some WW table keys end up being over the length limit (1000 bypes), while they are OK for the more restricted "utf8" charset.

At present, I'm very skeptical that you will run into trouble, as common letters is most alphabets need only 3 bytes in UTF8. Special characters, emoji, and other "strange' things do require the extra byte, so may trigger problems if somehow your WW system tries to pass them into the database.

The problem in changing to utf8mb4 is that various tables include keys which require over 1000 bytes (or 767 bytes) when utf8mb4 is used. This will be one of the challenges in getting full UTF8 support in WW, and due to that - I am somewhat aware of the issue and those of us involved in the internationalization efforts will need to consider how to deal with this issue in the near future.

There are some discussions of using INNODB tables and "innodb_large_prefix = on" with a COMPRESSED or DYNAMIC row formats and related settings to help avoid bypass issues.

Some references:
I mainly paid attention to this issue several months ago when looking at the OPL tables, in which several varchar(255) fields are included in database keys. Since you had the problem when creating the admin course, there seem to be standard course tables with the same sort of issues.

I had a look at the contents of the OPL tables a while back due to this issue, and the actual data does not currently use such long fields. The two fields which were close to 200 characters of real usage were:
  • Field: source_file in OPL_global_statistics order which had a max length of 198 characters when I looked at it, and that is a PRImary key field.
  • Field: path in OPL_path order which had a max length of 171 characters when I looked at it.