Bridging ICU with Ruby

Chinese Discourse users have more complains to the text problems. A community software induce users to read and write which certainly deals with texts. Numerous efforts are made along the way such as tokenizers for Chinese. Maintaining a project is not easy. One of feature request for Discourse is Unicode username. A core technical problem is visually confusing username. Discourse community may be in a multilingual community. This is certainly important to deal with. Although username is the core identify of the user representation. It’s more than Unicode.

Motivation and goals

Somehow, Ruby community doesn’t have much tools for Unicode processing. Unicode normalization is implemented in the core library in 2.2. And what about Unicode security problems? I don’t want to reinvent the wheel. ICU (International Components for Unicode) is an old and battle-tested library for the Unicode. A binding with it would worth it. It should have high performance, easy to maintain and can be deployed with MRI.

There are already some gems.

Do it right

Ruby is really different than other programing languages in terms of its string implementation. The Code Set Independent (CSI) model doesn’t set a common internal encoding but stores the bytes presentation with encoding information. It allows Ruby convert encodings when it involves I/O operations. Also, Ruby holds the external encoding Encoding.default_external which is how Ruby reads from an IO object. The IO object is typically a file (File is a subclass of IO). The Encoding.default_internal for new created string is usually UTF-8 from the environment. This starts from Ruby 1.9.

ICU provides many internationalization features. It operates with strings. So the most operations happen in memory. But ICU uses UTF-16 internally. The conversion will happen if UTF-8 is used as the default encoding.

Also, ICU’s C API uses callback function for error reporting.

A gem with C code can easily access byte arrays. And the conversion in C will be faster than MRI implementation.

Design

ICU binding should be as transparent as possible. The user only uses the equivalent Ruby API of ICU. Its module holds a optional internal encoding for the returning result (usually a string). Since ruby’s string can be any encodings, it have to be converted to UTF-16 for feeding ICU.

Along the way, The Definitive Guide to Ruby’s C API helps me quite a lot. It’s the most clear reference for Ruby’s C API I’ve seen.

Tags:

Updated:

Leave a Comment