Sebastian Kirsch: Blog

Friday, 11 February 2005

Python string handling

Filed under:

tech

— Sebastian Kirsch @ 01:30

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions-formatting.php on line 76

LUUSA runs the feed aggregator PlanetPlanet! on planet.luusa.org, which also subscribes to my weblog’s feed.

PlanetPlanet is written in Python, and we got bitten by some peculiarities in Python string handling, specifically the conversion between byte strings and unicode strings.

For some reason, it appears that the feed parser puts all parts of the content into byte strings (even if they contain unicode characters), but sometimes, very rarely, constructs unicode strings. These typically contain hyperlinks with, let’s say, “strange” URLs, for example URLs with query strings. In this case, it was the URL http://ithaka.ikp.uni-bonn.de/cgi-bin/lv/view.pl?lvNummer=3919&semDir=winter0405. I haven’t been able to identify the exact cause yet.

When it tries to merge the byte strings and the unicode strings, this error occurs and causes the offending feed to be ignored.

I found a very strange workaround for this problem: By converting all unicode strings to byte strings (unicodestring.encode("utf-8″)) and all byte strings to unicode (bytestring.decode("utf-8″, “ignore")), I was able to make the error disappear. I still don’t know what caused it, and why this method caused it to disappear.

Our version of PlanetPlanet uses feedparser.py 2.7.6 by Mark Pilgrim; the error occurs in the output method of the class BaseHTMLProcessor of feedparser.py. There’s a version 3.3 of feedparser on sourceforge; we’ll have to see whether it’s a drop-in replacement for our version, and whether it fixes the problem.

Mark Pilgrim, the author of feedparser, also has a few choice words to say about Python and unicode:

I had a flash of insight and suddenly the entirety of Python’s Unicode support became clear to me. I coded madly for several hours until it faded. It’s entirely possible that that’s just the LSD talking, but thanks to the magic of open source, everyone can now share in my good trip.

1 Comment

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/kses.php on line 527

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/kses.php on line 96

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/functions-formatting.php on line 76

utf8 problems on planet continued
Unfortunately, Sebastian’s hacks to feedparser.py did not solve our utf8 problems for good. Waldemar’s latest blog entry caused another error. The sf.net version of feedparser.py contains critical api changes and therefore is not a drop-in replacemen…

Trackback by
Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/kses.php on line 527

Deprecated: preg_replace(): The /e modifier is deprecated, use preg_replace_callback instead in /homepages/u37107/www.sebastian-kirsch.org/moebius/blog/wp-includes/kses.php on line 96
Lars\’ weblog — Monday, 14 February 2005 @ 00:32

RSS feed for comments on this post.

Sorry, the comment form is closed at this time.

S	M	T	W	T	F	S
« Jan				Mar »
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28

Sebastian Kirsch: Blog

Friday, 11 February 2005

Python string handling

1 Comment

Leave a comment