While handling text in python, you might receive the dreaded error
"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 21: ordinal not in range(128)" After about 2 hours of searching on the internet and reading all I could about Encoding, Decoding, Unicode, UTF-8, (and encountering a class called UnicodeDammit :) ) the simple jist of the problem according to me was:
- Trying to convert a sting / text from a file to unicode using the unicode() function. By default unicode() will try to decode the string you passed in 'ascii' unless you specify which decoding you want to use. So when you pass a string to the function unicode('Café') you get the following error:
>>> print unicode('Café')
Traceback (most recent call last):
File "", line 1, in
print unicode('Café')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 3: ordinal not in range(128)
The letters like 'é' are very common in english language. But they are encoded in 'latin-1'. To convert this to unicode, you must specify the encoding also. For e.g.
print unicode('Café', 'latin-1') gives you
>>> print unicode('Café', 'latin-1')
Café
Café
Amen !!! :)
This useful piece of information is described further here.
Understanding Unicode and how it works is very important when you are working on Text Processing. I will be posting a few more articles on this in the future.