The development team proposes to switch to UTF-8

Recently, a manifesto of programmers from Tel Aviv was published on Hacker News. They suggested making UTF-8 the default solution for storing text strings in memory and communication.

The material generated an active discussion, and we decided to understand the situation, consider the arguments of IT experts - including IBM engineers and W3C consortium specialists.


Photos - Raphael Schaller - Unsplash

Encoding Situation


In 1988, Joe Becker introduced the first draft of the Unicode standard. The document was based on the assumption that 16 bits would be enough to store any character. However, pretty quickly it became clear that this was not enough. Therefore, new encoding options have appeared - including UTF-8 and UTF-16. But the variety of formats and the lack of strict recommendations on their use led to confusion in the IT industry (including terminology).

The internal format of Windows is UTF-16 . At the same time, the authors of the manifesto, which was discussed at Hacker News, say that at one time Microsoft used the terms Unicode and widechar as synonyms for UTF-16 and UCS-2 (which is consideredoriginal predecessor of UTF-16). As for the Linux ecosystem, it is customary to use UTF-8 in it. The variety of encodings sometimes leads to the fact that files are damaged during transfer between computers with different operating systems.

The industry’s standardization can be a solution - the transition to UTF-8 for storing text strings in memory or on disk and exchanging packets over the network.

Why UTF-8 is considered better than UTF-16


One of the main arguments is that UTF-8 reduces the amount of memory occupied by characters in the Latin alphabet (they are used by many programming languages). Latin letters, numbers, and common punctuation are encoded in UTF-8 with just one byte. Moreover, their codes correspond to codes in ASCII, which gives backward compatibility.

Also, IBM experts say that UTF-8 is better for interacting with systems that do not expect multibyte data to arrive. Other Unicode encodings contain numerous null bytes. Utilities can find them the end of the file. For example, in UTF-16, the character A looks like this: 00000000 01000001. In a C line, this sequence can be trimmed. In the case of UTF-8, zero is only NUL. In this encoding, the first letter of the Latin alphabet is represented as 01000001 - there are no problems with an unexpected break.

For the same reason, engineers at the W3C consortium recommend using UTF-8 when developing front-end interfaces. So you can avoid difficulties with the operation of network devices.


Photos - Kristian Strand - Unsplash

Resident Hacker News notedthat UTF-8 allows you to catch coding errors in the early stages. In it, bytes are read sequentially, and overhead bits determine their number. Thus, the code point value is calculated unambiguously and application developers do not need to think about the Little-Endian or Big-Endian problem .

Where UTF-16 has the advantage


Latin letters and punctuation can take up less memory in UTF-8 (compared to UTF-16). Some code points require the same number of bytes in both encodings - for example, this fact is true for Greek and Hebrew.

The situation is different with Asian characters - in the case of UTF-8, they need more space . For example, the Chinese character will be represented by three bytes: 11101000 10101010 10011110 . The same character in UTF-16 will look like 10001010 10011110 .

What is the result


Debate over the problem of introducing a single encoding has been going on for a long time. This question was raised almost eleven years ago in a thread on Stack Overflow. Pavel Radzivilovsky (Pavel Radzivilovsky) - one of the authors of the manifesto took part in it. Since then, UTF-8 has already become one of the most popular encodings on the Internet. And it was recognized as mandatory for “all situations” in the WHATWG, a community of HTML and API specialists that develops relevant standards.

Recently, Microsoft has also begun recommending the use of UTF-8 in developing web applications. Perhaps in the future this practice will extend to other utilities.



:

« www»: -
« IaaS»: 1cloud
: AdTech- GDPR?
10- —
,


All Articles