Warning: Creating default object from empty value in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions.php on line 341

Warning: session_start(): Cannot send session cookie - headers already sent by (output started at /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions.php:341) in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/my-hacks.php on line 3

Warning: session_start(): Cannot send session cache limiter - headers already sent (output started at /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions.php:341) in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/my-hacks.php on line 3
Sebastian Kirsch: Blog » 2005 » February » 11

Sebastian Kirsch: Blog

Friday, 11 February 2005

Python string handling

Filed under: — Sebastian Kirsch @ 01:30

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions-formatting.php on line 76

LUUSA runs the feed aggregator PlanetPlanet! on planet.luusa.org, which also subscribes to my weblog’s feed.

PlanetPlanet is written in Python, and we got bitten by some peculiarities in Python string handling, specifically the conversion between byte strings and unicode strings.

For some reason, it appears that the feed parser puts all parts of the content into byte strings (even if they contain unicode characters), but sometimes, very rarely, constructs unicode strings. These typically contain hyperlinks with, let’s say, “strange” URLs, for example URLs with query strings. In this case, it was the URL http://ithaka.ikp.uni-bonn.de/cgi-bin/lv/view.pl?lvNummer=3919&semDir=winter0405. I haven’t been able to identify the exact cause yet.

When it tries to merge the byte strings and the unicode strings, this error occurs and causes the offending feed to be ignored.

I found a very strange workaround for this problem: By converting all unicode strings to byte strings (unicodestring.encode("utf-8″)) and all byte strings to unicode (bytestring.decode("utf-8″, “ignore")), I was able to make the error disappear. I still don’t know what caused it, and why this method caused it to disappear.

Our version of PlanetPlanet uses feedparser.py 2.7.6 by Mark Pilgrim; the error occurs in the output method of the class BaseHTMLProcessor of feedparser.py. There’s a version 3.3 of feedparser on sourceforge; we’ll have to see whether it’s a drop-in replacement for our version, and whether it fixes the problem.

Mark Pilgrim, the author of feedparser, also has a few choice words to say about Python and unicode:

I had a flash of insight and suddenly the entirety of Python’s Unicode support became clear to me. I coded madly for several hours until it faded. It’s entirely possible that that’s just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.

Copyright © 1999--2004 Sebastian Marius Kirsch webmaster@sebastian-kirsch.org , all rights reserved.