pheloniusfriar | UTF-8 or bust!

Just a quick rant: UTF-8 is a character encoding that is used for almost all web sites and much electronic information these days. It has become ubiquitous. I am using MariaDB (which was a fork from MySQL after Oracle acquired it and fucked it up), and there is a character encoding that they call "utf8" that they inherited from MySQL. Unlike ASCII, which was ubiquitous before UTF-8 became dominant in about 2009. The main difference is that UTF-8 encodes for Unicode characters (which can represent just about every character and variant currently and historically in use globally) and can be 1 to 4 bytes long whereas ASCII is 7-bits and fits in 1 byte (with one bit left over, which gave rise to a plethora of other character encodings that used that upper bit to denote "extended characters", which is another reason why UTF-8 became so popular once it was introduced: one character encoding could be used for all characters). I use UTF-8 everywhere in the application I'm writing because it's the only sensible thing to do when multiple languages need to be supported.

I had a problem yesterday that I was storing strings in the database as UTF-8, and explicitly stated in the schema that the strings were "utf8", but I was getting gibberish in my query results (using the C API) any time the string contained a UTF-8 encoded character. Doing a bit of searching, I found that I needed to use the function call 'mysql_set_character_set(db_con, "utf8")' to tell the API to return the results in UTF-8 format (I have no idea what it was doing before). Problem solved. However, today I was looking to see if GNU m4 supported UTF-8 character encoding (it doesn't fyi, but there are workarounds, sigh), when I ran across references to MySQL and "utf8mb4" and mumblings about problems with "utf8". Upon further reading... holy fuck, what is the matter with people??? I implemented support in my application for UTF-8 from the start and it took nearly no effort, but the chuckleheads working on MySQL decided that they would only support a subset of UTF-8 and called it "utf8". They apparently quietly introduced a new character encoding called "utf8mb4" (again, searching for UTF-8 and m4 brought up this information randomly, which was lurking as a bomb for me to step on some time in the future thinking I was moving in safe territory). Apparently, this encoding properly supports the full UTF-8 character encoding set.

This article gives an excellent overview of the issue: In MySQL, never use “utf8”. Use “utf8mb4”. Fuuuuuu.

As the saying goes, "if builders built buildings the way programmers wrote programs, then one woodpecker could destroy the entirety of civilisation".

Ugh.

Here's a guide on how to convert from "utf8" to "utf8mb4" if you, like me, have found yourself bitten by this lossage: How to support full Unicode in MySQL databases. The connection and client information should probably also be updated: "SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';".

I also don't have a lot of polite things to say about a program like m4 not natively supporting UTF-8 in 2018. I tried to volunteer to write the necessary changes (and may still do it as part of wanting to contribute back to open source projects I use), but the web site I found was apparently abandoned and the email address they had against the "hey, we need to implement UTF-8 would you like to volunteer?" entry bounced. There is a more modern site, but they don't really list UTF-8 as a task available for doing. Their statement on multi-byte characters is: GNU m4 does not yet understand multibyte locales; all operations are byte-oriented rather than character-oriented (although if your locale uses a single byte encoding, such as ISO-8859-1, you will not notice a difference). However, m4 is eight-bit clean, so you can use non-ASCII characters in quoted strings (see Changequote), comments (see Changecom), and macro names (see Indir), with the exception of the NUL character (the zero byte ‘'\0'’). m4 is niche, but this makes it even more niche, and I may just write my own text UTF-8 aware text substitution program to do the work I need and just ditch m4.

For those that have the sense to skip my technobabble, here's an absolutely delightful music video with some tremendous visuals and fun music and people (and a cat in a diving suit?).

"Crumb - Locket [Official Video]" (Watch on YouTube)