Skip to main content

Pragmatic Unicode, or, How do I stop the pain?

At some point the following started to happen in my small httplib2-based script:

Traceback:
...
File "/usr/lib/python2.7/httplib.py", line 996, in _send_request
    self.endheaders(body)
File "/usr/lib/python2.7/httplib.py", line 958, in endheaders
    self._send_output(message_body)
File "/usr/lib/python2.7/httplib.py", line 816, in _send_output
    msg += message_body
UnicodeDecodeError: 'ascii' codec can't decode byte 0xd0 in position 245: ordinal not in range(128)

I know that earlier I would start putting .encode() and .decode() randomly to make it run but now I am much better at understanding the reason of the failure after watching an awesome talk by Net Batchelder titled “Pragmatic Unicode, or, How do I stop the pain?”

Now it took me mere seconds to find the reason. In the traceback above, UnicodeDecodeError was raised because msg was already a unicode object, and message_body was a str. It happened because the URL supplied to the request method was unicode. Python 2.7 was trying to concatenate unicode and str, decided that it’s best way to make message_body a unicode string using the default ascii encoding, but the content was full of symbols outside ASCII space. Converting URL to str fixed the issue as URLs are not good candidates to be passed around decoded.