Similarity thesauri

This code was used to produce examples for my talk Similarity thesauri and cross-language retrieval. More information about this talk is on my studies page; a handout is also available.

The example code was written in Python; it requires Python version 2.2 or higher.

`IRstruct.py`

This file contains the following classes:

Token: This class provides the link between items and features.
Tokenizable: This is the super class for all objects that can be decomposed into tokens.
TokString: The simplest kind of document, consisting of just a string
Properties: This class is derived from UserDict and implements the subsumption order on feature structures as operators <= and >=.
Document: Documents can have additional properties, for example their language. Furthermore, they can be composed of other documents.
IndexComp: This is the common super class of Item and Feature. The constructor takes a Properties object as argument and either returns a previously constructed object with the same properties, or constructs a new one.
Item: This class was derived from IndexComp without any changes.
Feature: This class was derived from IndexComp without any changes.
IRstruct: This class provides the basic functions for IR systems, for example weighting methods and storage of items and features.
SimThes: This class implements the construction of a similarity thesaurus as described in the handout.
SimThes_CL: A class implementing a cross-language similarity thesaurus. This changes only the output functions.

Most classes contain methods .asTeX and .asMP that produce TeX and MetaPost snippets describing the object.

`docs.py`

This file contains the documents used to construct the examples in the handout.

`irtest.py`

This file constructs two similarity thesaury from the documents in docs.py and writes the corresponding TeX and Metapost snippets to files.