Unicode Normalization

Technical Report 15 from the Unicode Consortium specifies a way to normalize Unicode Characters to enable string comparisons when there are two or more representaions for the same glyph in the Unicode character set.

For example, eacute, é can either be represented as the latin-1 compatible character eacute, or can be composed of two characters; the ascii e, plus the combining diacritic '. Unicode defines a non-spacing diacritic combining mark for all the latin-1 diacritic precomposed characters.

There are some legacy character sets which use the concept of combining characters to extend the available number of characters with a constrained number of code-points.

As part of the technical report, the Unicode Consortium provides sample code written in Java to illustrate the algorithms involved. We have ported this code over to C# for a real world project and as a experiment in Java-C# porting.

Download (25K) ::