PlanetPlanet is written in Python, and we got bitten by some peculiarities in Python string handling, specifically the conversion between byte strings and unicode strings.
For some reason, it appears that the feed parser puts all parts of the content into byte strings (even if they contain unicode characters), but sometimes, very rarely, constructs unicode strings. These typically contain hyperlinks with, let’s say, “strange” URLs, for example URLs with query strings. In this case, it was the URL http://ithaka.ikp.uni-bonn.de/cgi-bin/lv/view.pl?lvNummer=3919&semDir=winter0405. I haven’t been able to identify the exact cause yet.
When it tries to merge the byte strings and the unicode strings, this error occurs and causes the offending feed to be ignored.
I found a very strange workaround for this problem: By converting all unicode strings to byte strings (unicodestring.encode("utf-8″)) and all byte strings to unicode (bytestring.decode("utf-8″, “ignore")), I was able to make the error disappear. I still don’t know what caused it, and why this method caused it to disappear.
Our version of PlanetPlanet uses feedparser.py 2.7.6 by Mark Pilgrim; the error occurs in the output method of the class BaseHTMLProcessor of feedparser.py. There’s a version 3.3 of feedparser on sourceforge; we’ll have to see whether it’s a drop-in replacement for our version, and whether it fixes the problem.
I had a flash of insight and suddenly the entirety of Python’s Unicode support became clear to me. I coded madly for several hours until it faded. It’s entirely possible that that’s just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.
Leave a comment
Sorry, the comment form is closed at this time.