Unicode string normalization schemes in Python
By Niraj Zade | 2024 May 06 | 8 min read
The usual ASCII string comparison is simple -
# ascii string comparison
def compare_ascii(str1: str, str2:str) ->bool:
return str1.lower == str2.lower
On the other hand, unicode string comparison is much harder to do. The strings cannot always be normalized by simple uppercase/lowercase conversion. Eg - scripts that don't have the concept of uppercase/lowercase, but still have different representations for the same character.
Note : Unicode calls the written letters to be part of scripts. Not languages. Eg - The language Hindi is written in Devanagari script
Here is comparison of 2 different representations of the letter 'a' in different scripts:
print('a')=='๐'
>> False
So, how do you even tackle this problem?
The source of the problem is -
Example:
Here are the ways in which the letter 'a' can be written:
๏ฝ ๐ แต ๐ข ๐ ยช ๐ช ๐ ๐ ๐ฎ โ โ ๐ ๐บ ๐ถ ๐ ๐ ๐
And when you compare these characters, the comparison will fail.
# compare `๐บ` and `๏ฝ`
print('๐บ'=='๏ฝ')
>> False
# compare lowercase representations
>>> print('๐บ'.lower()=='๏ฝ'.lower())
False
Example - In the German script 'ร'
and 'SS'
are equivalent. But, they cannot be directly compared.
'ร' == 'SS'
>> False
'ร'.lower() == 'SS'.lower()
>> False
So, to compare such strings, you need a smarter comparison system. The system should know which characters are logically equivalent. The python unicodedata
module helps with this.
The solution has 2 steps:
from unicodedata import normalize
def check_equality_unicode(str1: str, str2:str) -> bool:
str1_normalized = normalize("NFKC",str1)
str1_normalized_caseless = str1_normalized.casefold()
str2_normalized = normalize("NFKC",str2)
str2_normalized_caseless = str2_normalized.casefold()
return str1_normalized_caseless == str2_normalized_caseless
This function is universal - It works for both Unicode and ASCII strings.
"NFKC"
used above is a normalization form. There are 4 standard unicode normalization forms, each with different behaviours. More on them later.
There are 3 test cases that you can put into in your codebases:
check_equality_unicode('ฮฃฮฏฯฯ
ฯฮฟฯ', 'ฮฃฮฮฃฮฅฮฆฮฮฃ')
should equate to Truecheck_equality_unicode('a', '๏ฝ')
should equate to Truecheck_equality_unicode('abc', 'ABC')
should equate to TrueThere are 4 normalization form available for unicode:
NFD
- Normalization Form D - Characters undergo canonical decompositionNFC
- Normalization Form C - Characters undergo canonical decomposition, followed by canonical compositionNFKD
- Normalization Form KD - Characters undergo compatibility decompositionNFKC
- Normalization Form KC - Characters undergo compatibility decomposition, followed by canonical compositionLegend to understand the naming scheme:
A table makes their differences clearer -
Form | Normalization form | Canonical Decomposition | Compatibility Decomposition | Canonical Composition |
---|---|---|---|---|
NFD | Form D | Yes | ||
NFC | Form C | Yes | Yes | |
NFKD | Form KD | Yes | ||
NFKC | Form KC | Yes | Yes |
The decomposition can be canonical or compatible. But the re-composition is always canonical.
More details are available in unicode's documentation - unicode.org - Unicode Normalization Forms
Choice of normalization form depends on the application. The decision is made by answering by two questions:
Canonical vs compatibility decomposition
The compatibility decomposition is built over canonical decomposition, by putting in additional rules. The compatibility conversion transforms characters into their more common forms. This "simplification" using more common forms leads to some information loss. This is the reason why all canonical sequences are compatible, but all compatible sequences are not canonical.
Following is an example from the original unicode.org document.
Notice how the exponent 5 gets converted into a simpler integer 5 by the Compatibility Decomposition algorithms of NFKD and NFKC. This simplification causes information loss - when looking at the normalized string, there is no way to know if the 5 originally was an exponent 5 or a normal 5.
Normalization causes information loss. Once a string is normalized, it cannot always be converted back to the original string.
Here's an example - Suppose an ASCII string "HelLO TheRE" is converted to lowercase - "hello there". During this operation, the information that told which characters are uppercase got lost. So now, the lowercase "hello there" cannot be reverted back to the original "HelLO TheRE".
A similar thing happens during unicode normalization. Once a string is normalized, it cannot always be converted back to the original form due to information loss.
There are several reasons for this. Some of them are:
Here is an example from stackoverflow discussion on NFC vs NFD normalization - stackoverflow - When to use Unicode Normalization Forms NFC and NFD?
U+0387 GREEK ANO TELEIA (ฮ) is defined as canonical equivalent to U+00B7 MIDDLE DOT (ยท) This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But itโs too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.
WHEN TO USE?
Normalization is required when two strings are being compared. Especially when dealing with non-english strings in multi-lingual apps.
WHEN NOT TO USE?
By default, never normalize during data storage. If you're overriding this default, you better have a good reason for it. If you're normalizing a string before storing it into a database, store both versions of the string - original(un-normalized) and normalized.
Don't use normalization during storage in use cases where the information loss due to normalization can cause problems. Normalization can change the way a string looks.
Some examples are: