A student software project by Timo Taglieber and Florian Winkelmeier.
Completed in 2007. You can only read the documentation here.

Tokentree concept and implementation

The main idea of this project is to transform a subset of a corpus into a binary tree-like structure, in which the user is able to navigate and find the document(s) that best match the search query. We used the current German Wikipedia article corpus, a large UTF-8 encoded XML file (~4 GB) downloadable at http://download.wikimedia.org/dewiki/. The code is entirely written in Python. The database back-end is a MySQL server and the web interface is provided by a standard apache installation with CGI execution of Python scripts enabled.

Preprocessing the corpus

The following steps were taken to make this web interface (and therefore the whole project) available:

  1. Tokenize corpus (map tokens to IDs).
  2. Store the tokens in the database.
  3. Vectorize documents (store the tokens and their frequency in the database).

How the tree grows breadth-first

If a query is submitted through the web interface, the database is searched for all documents that contain any matching tokens. Those documents are the relevant subset from which the tokentree will be created. The node-pairing algorithm, which is explained in the section below, bundles similiar documents to node-pairs. After this is done progressively for every level of the tree, the core method of the tree-module returns the tree's root node for further operations. Two documents/nodes are simply merged by determining their token intersection. Hence, every token that exists in both nodes at least once will be copied to the freshly created parent node that points to the pair nodes with its child references. The binary tree structure (necessary to always result in one root node) is retained by adding empty dummy nodes on tree levels with an odd number of nodes (the number of leaves in a binary tree is a power of two). The main tree traversal algorithm iterates over all nodes in the tree starting breadth-first from the root node. As the HTML output of the tree is implemented with simple tables, the dummy nodes are useful in making the connection between child and parent nodes visible, even without drawing actual arrows between them.

Node-pairing algorithm

Combining two similiar vectors (documents) to a pair works rather simply in our project, since this part of the tree-building engine is relatively performance-critical. The first vector gets taken from the stack and compared to every other vector in the subset. The vector with the highest similarity (determined by similarity measures in the module simfunctions.py) is used as the first vector's partner. This pair is then removed from the stack and the selection starts over again. This method is of course only an approximation to a perfect pairing, but it is nevertheless good enough to get reasonable results.

Approaches for improvements

The bottle neck of this project is the performance required to compute the subset of documents relevant to a given query. Reading four gigabytes of data from a database (even on the same machine) can hardly be done in a reasonable amount of time until the actual output (the tokentree) appears. In order to get quick results, we currently use a smaller version of the big corpus (only 40 megabytes), containing about 3000 documents. Please keep in mind that this massively affects the variety of any tokentree created right now. The idea of using a binary tree to represent the corpus goes back to Dr. Markus Demleitner, a former lecturer at the Department of Computational Linguistics. We have limited experience with statistics, so there might be many ways to improve this project by implementing improved calculations. In conclusion, the query results show that the basic idea works (you just have to take a close enough look :> ).