Sebastian Kirsch: Blog

Friday, 01 April 2005

Data Structures in Python

Filed under: — Sebastian Kirsch @ 19:05

Modern programming languages, especially scripting languages like Perl and Python exhibit a suspicious lack of data structures. They provide arrays (or lists), and associative arrays, which are often implemented using a hash table with a fixed number of buckets.

More complicated data structures are missing: you won’t find binary search trees (such as AVL trees or red-black trees), or priority queues (like fibonacci heaps), or tree structures for implementing sets in their standard library, nor data structures for common problems, such as graph structures. And this is despite the fact that the standard library of those languages is usually huge, providing everything from HTTP clients over modules for manipulating emails to XML-RPC interfaces. (In contrast, the C++ Standard Template Library provides a set and priority queue implementation, as does the Moscow ML standard library.)

Why is that? Is it because everyone who needs a specialised data structure is supposed to implement it himself? Or are the performance gains deemed to be negligible because the built-in data structures are heavily optimized and therefore “fast enough"? Or because nobody expects these languages to be used in performance-critical environments anyway? I don’t know …

Common wisdom in data structure design is that an algorithm that is in a lower complexity class will always outperform one in a higher class for large problems, regardless of its overhead. So a search tree implementation in Python should be able to outperform the built-in hash table implementation for large data sets regardless of how well optimized the hash table is.

For example: A lookup in a hash with 8 buckets (the standard size in Perl) and about 10000 items will need on average about 600 operations. (On average 1250 items in one bucket, and the right one will be found after searching on average half of the items.) Hash tables are only appropriate when the number of items that will be stored is about the same as the number of buckets. A lookup in a balanced search tree of the same size will need at most 14 operations (because a balanced binary tree with 10000 nodes has a depth of log2(10000) ≈ 13.3)

Every computer science student has implemented these data structures at least once, and most have before the end of their second year – so it cannot be the shortage of capable programmers or maintainers that prevents them from being included in the standard library.

Of course, there are Perl packages on CPAN that implement trees and sets and whatnot. Usually, they are at version 0.1 or 0.3 and have not seen a release for a couple of years – and look like they were implemented by a computer science student before the end of his second year. By putting those algorithms in the standard library, one would ensure that the code is properly maintained and follows the coding standards.

For Python, there is an online book called Data Structures in Python, with downloadable source code. The author has a Ph.D. in computer science, was associate professor at the University of Waterloo, and published two books on data structures and algorithm with OO languages, namely Java and C++. These books, plus their C#, Python and Ruby versions are all available online. However, the source code is copyrighted by the author and is not licensed for further use, so one cannot use it in one’s own projects.

Comments

No comments yet.

RSS feed for comments on this post.

Leave a comment

Sorry, the comment form is closed at this time.


Copyright © 1999--2004 Sebastian Marius Kirsch webmaster@sebastian-kirsch.org , all rights reserved.