I know I know another tech blog... : November 2010

While handling text in python, you might receive the dreaded error

"UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in position 21: ordinal not in range(128)"

After about 2 hours of searching on the internet and reading all I could about Encoding, Decoding, Unicode, UTF-8, (and encountering a class called UnicodeDammit :) ) the simple jist of the problem according to me was:

Trying to convert a sting / text from a file to unicode using the unicode() function. By default unicode() will try to decode the string you passed in 'ascii' unless you specify which decoding you want to use. So when you pass a string to the function unicode('Café') you get the following error:

>>> print unicode('Café')
Traceback (most recent call  last):
  File "", line 1, in 
    print  unicode('Café')
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe9 in  position 3: ordinal not in range(128)

The letters like 'é' are very common in english language. But they are encoded in 'latin-1'. To convert this to unicode, you must specify the encoding also. For e.g.

print unicode('Café', 'latin-1') gives you

 >>> print  unicode('Café', 'latin-1')
Café

Amen !!! :)

This useful piece of information is described further here.

Understanding Unicode and how it works is very important when you are working on Text Processing. I will be posting a few more articles on this in the future.

I know I know another tech blog...

Saturday, November 27, 2010

Unicode!!!