Unicode is more than encoding


Nowadays, the average users don’t have to work with encodings generally and don’t need to know Unicode. Better software, better library and better understanding contributes a lot. The most basic operations we leant as an programmer is manipulating characters and strings. By writing the simplest hello world program in different scripts requires understanding encodings or relies on the programing language structure.

There are some recommend readings:

Encodings is not the Unicode

Unicode works on character-level for the world’s writing system. It doesn’t help you deal with the language, such as searching Chinese which isn’t separated by the blank. But it reveals the relation between the encodings and characters.

Simply remember:

There isn’t a string without encoding.

If you ever forgot that or not understood, you won’t know why a same string can’t be the same in the byte array.

Many programing languages make use of UTF-x encoding now. It isn’t the whole story. Ruby uses Code Set Independent system for storing strings which basically stores the encoding with the string. While most programming language mainly uses a UTF-x to represent the string which is already encoded. The most scary thing happens as the string may comes from a database, a JSON from HTTP response, writing to a file and your javascript literals. Now the string involves with I/O and I/O operates on byte level. It must related with the encoding. Being arrogant and ignore the fact that libraries, the operating system and your programing language handles the encoding in a way, it will eventually bite you.

Unicode regulates more than code points

A common programming is comparing the string. This problem has two aspects, one is with encoding above explained. The other is Unicode standard. Unicode intends to represent character in the writing system. A single code point can be represented by a series of code points. The abstract layers in the software also wants to make it easy for the programmers. So now it’s time to remind you for another thing.

For example, the Vietnamese letter “ệ” can be expressed in five different ways:

  • Fully precomposed: U+1EC7 “ệ”
  • Partially precomposed: U+1EB9 “ẹ” + U+0302 “◌̂”
  • Partially precomposed: U+00EA “ê” + U+0323 “◌̣”
  • Fully decomposed: U+0065 “e” + U+0323 “◌̣” + U+0302 “◌̂”
  • Fully decomposed: U+0065 “e” + U+0302 “◌̂” + U+0323 “◌̣”

A Programmer’s Introduction to Unicode, Nathan Reed

A character have different representation even with the same encoding.

I won’t elaborate more. Here is some further directions to read.

Unicode website offers many readings for the problem and intention. And ICU has a C/C++/Java implementation for these problems.