An important part of doing research in computer science involves coding. Implementing ideas and algorithms in order to verify an hypothesis or visualizing certain aspects of data all involve getting your hands dirty. Especially in the field of information retrieval, practical and system-based evaluations of novel retrieval models and algorithms is an essential aspect. As such, I have implemented all of the mathematical models developed in my research using a variety of languages, frameworks, and technologies, including (but not limited to) C++, Java, Perl, and Hadoop. All in all, over the years I’ve coded quite a number of things in a variety of programming languages, ranging from Web 2.0 interfaces to C++ libraries. Below you can find a sample of these. In case you’re interested in the implementation of a particular model in a paper of mine, don’t hesitate to ask.

AIDA toolkit

I was the main developer of the AIDA toolkit from 2007 through 2009. It is a suite of tools with which to extract, store, and retrieve information from textual documents. In particular, it uses Text Mining techniques to populate an RDF knowledge base, which in turn is used to improve information access to the source documents. Here, the main language is Java and we use a SOA in which each component is exposed as both a SOAP and a REST web service. The web services were used by four project partners (companies and/or other universities) who would use them to integrate our tools into their workflows. I also developed several clients using HTML/Javascript/Servlets that integrate and aggregate webservices in a common interface. For development there were five team members, four of which actively contributed code. It was my responsibility to make sure all components were thoroughly tested and fully functional/interoperable. The main OS I used for this project is Linux (centOS and Red Hat) and Mac OS X, although I also commonly use Windows for cross-platform testing purposes. All in all I have produced thousands of lines of code for this project.

ILPS Lucene

ILPS Lucene is a heavily modified version of Apache Lucene, that replaces Apache Lucene’s heuristic retrieval model with an implementation of the multinomial language modeling framework for information retrieval. Apache Lucene is highly optimized towards their own retrieval model and adding “common” language modeling calculations resulted in big rewrites of the (Java) code. See http://ilps.science.uva.nl/resources/lm-lucene for more information.

Lucene Query Interface

This interface is based on SOAP Lucene and can interact with a Sesame repository (also through SOAP).

SOAP Lucene

This wrapper turns Lucene into a SOAP webservice.

GridLucene

I have implemented Grid-specific classes for Lucene, to let Lucene interact with files on a Grid (both for indexing and retrieval), as described in Deploying Lucene on the Grid. They use the Jargon API extensively. Additionally, the use of Jargon makes it possible to incorporate metadata about files, directories, and/or collections transparantly into Lucene. The files can be obtained here, under the same license as the one Lucene is distributed with. You will also need the Jargon and Lucene jar files. GridLucene has been tested to work with Jargon v1.4.20 and Lucene v2.0.0. If you have any questions, suggestions and/or comments regarding GridLucene, feel free to send me an e-mail. I’ll be happy to answer any questions you might have.

Lucene/Lemur/Indri Utilities and Classes

I have written various tools for preprocessing, normalization, calculations such as PMI, etc. One day these will all end up here.

Parsimonious Implementation

Please send me an e-mail if you’re interested in obtaining a copy of the code (based on Lemur/Indri) that I used for the parsimonization experiments.

OTRS Statistics

I did my master’s internship project at a webhosting company called Hostnet, located in Amsterdam. My assignment was to perform an in-depth statistical and qualitative analysis of the response times of the various departments, based on the open source ticketing system OTRS. To this end I’ve written a Perl module that performs several statistical functions, based on OTRS data.

Caveat

Mind you, it was the first major Perl script I ever wrote, so it’s definitely not optimized. Secondly, the script is biased towards the specific environment of Hostnet, meaning for example that it only supports a MySQL DB. However, I do believe some of the ideas may be of use to the interested reader. It has been tested to work using OTRS v1, but should port to v2. Please note that the most recent versions promise to offer a new statistical framework, built directly into OTRS.

Statistics

The script itself is able to produce the following ticket statistics, aggregated on a per-day or per-week basis, between specified dates and in specified queues:

  • Number of new tickets (total/calls/e-mails)
  • Workload (number of calls, in- and outgoing e-mails)
  • Efficiency (time calculations):
    • Open time
    • Reply time
    • Resolution time
  • Efficacy:
    • Average number of follow-ups/replies per ticket
    • Number of first time fixes

The output can be selected as well; graphs, CSV and HTML (tables) are supported. The package can be obtained upon request.

Example

An example graph, created with this package:

workload