Sebastian Kirsch: Blog

Friday, 11 February 2005

Python string handling

Filed under: — Sebastian Kirsch @ 01:30

LUUSA runs the feed aggregator PlanetPlanet! on planet.luusa.org, which also subscribes to my weblog’s feed.

PlanetPlanet is written in Python, and we got bitten by some peculiarities in Python string handling, specifically the conversion between byte strings and unicode strings.

For some reason, it appears that the feed parser puts all parts of the content into byte strings (even if they contain unicode characters), but sometimes, very rarely, constructs unicode strings. These typically contain hyperlinks with, let’s say, “strange” URLs, for example URLs with query strings. In this case, it was the URL http://ithaka.ikp.uni-bonn.de/cgi-bin/lv/view.pl?lvNummer=3919&semDir=winter0405. I haven’t been able to identify the exact cause yet.

When it tries to merge the byte strings and the unicode strings, this error occurs and causes the offending feed to be ignored.

I found a very strange workaround for this problem: By converting all unicode strings to byte strings (unicodestring.encode("utf-8″)) and all byte strings to unicode (bytestring.decode("utf-8″, “ignore")), I was able to make the error disappear. I still don’t know what caused it, and why this method caused it to disappear.

Our version of PlanetPlanet uses feedparser.py 2.7.6 by Mark Pilgrim; the error occurs in the output method of the class BaseHTMLProcessor of feedparser.py. There’s a version 3.3 of feedparser on sourceforge; we’ll have to see whether it’s a drop-in replacement for our version, and whether it fixes the problem.

Mark Pilgrim, the author of feedparser, also has a few choice words to say about Python and unicode:

I had a flash of insight and suddenly the entirety of Python’s Unicode support became clear to me. I coded madly for several hours until it faded. It’s entirely possible that that’s just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.

1 Comment

  1. utf8 problems on planet continued
    Unfortunately, Sebastian’s hacks to feedparser.py did not solve our utf8 problems for good. Waldemar’s latest blog entry caused another error. The sf.net version of feedparser.py contains critical api changes and therefore is not a drop-in replacemen…

    Trackback by Lars\’ weblog — Monday, 14 February 2005 @ 00:32

RSS feed for comments on this post.

Leave a comment

Sorry, the comment form is closed at this time.


Copyright © 1999--2004 Sebastian Marius Kirsch webmaster@sebastian-kirsch.org , all rights reserved.