An important part of doing research in computer science involves coding. Implementing ideas and algorithms in order to verify an hypothesis or visualizing certain aspects of data all involve getting your hands dirty. Especially in the field of information retrieval, practical and system-based evaluations of novel retrieval models and algorithms is an essential aspect. As such, I have implemented all of the mathematical models developed in my research using a variety of languages, frameworks, and technologies, including (but not limited to) C++, Java, Perl, and Hadoop. All in all, over the years I’ve coded quite a number of things in a variety of programming languages, ranging from Web 2.0 interfaces to C++ libraries. Below you can find a sample of these. In case you’re interested in the implementation of a particular model in a paper of mine, don’t hesitate to ask.
ILPS Lucene is a heavily modified version of Apache Lucene, that replaces Apache Lucene’s heuristic retrieval model with an implementation of the multinomial language modeling framework for information retrieval. Apache Lucene is highly optimized towards their own retrieval model and adding “common” language modeling calculations resulted in big rewrites of the (Java) code. See http://ilps.science.uva.nl/resources/lm-lucene for more information.
Lucene Query Interface
This interface is based on SOAP Lucene and can interact with a Sesame repository (also through SOAP).
This wrapper turns Lucene into a SOAP webservice.
I have implemented Grid-specific classes for Lucene, to let Lucene interact with files on a Grid (both for indexing and retrieval), as described in Deploying Lucene on the Grid. They use the Jargon API extensively. Additionally, the use of Jargon makes it possible to incorporate metadata about files, directories, and/or collections transparantly into Lucene. The files can be obtained here, under the same license as the one Lucene is distributed with. You will also need the Jargon and Lucene jar files. GridLucene has been tested to work with Jargon v1.4.20 and Lucene v2.0.0. If you have any questions, suggestions and/or comments regarding GridLucene, feel free to send me an e-mail. I’ll be happy to answer any questions you might have.
Lucene/Lemur/Indri Utilities and Classes
I have written various tools for preprocessing, normalization, calculations such as PMI, etc. One day these will all end up here.
Please send me an e-mail if you’re interested in obtaining a copy of the code (based on Lemur/Indri) that I used for the parsimonization experiments.
I did my master’s internship project at a webhosting company called Hostnet, located in Amsterdam. My assignment was to perform an in-depth statistical and qualitative analysis of the response times of the various departments, based on the open source ticketing system OTRS. To this end I’ve written a Perl module that performs several statistical functions, based on OTRS data.
Mind you, it was the first major Perl script I ever wrote, so it’s definitely not optimized. Secondly, the script is biased towards the specific environment of Hostnet, meaning for example that it only supports a MySQL DB. However, I do believe some of the ideas may be of use to the interested reader. It has been tested to work using OTRS v1, but should port to v2. Please note that the most recent versions promise to offer a new statistical framework, built directly into OTRS.
The script itself is able to produce the following ticket statistics, aggregated on a per-day or per-week basis, between specified dates and in specified queues:
- Number of new tickets (total/calls/e-mails)
- Workload (number of calls, in- and outgoing e-mails)
- Efficiency (time calculations):
- Open time
- Reply time
- Resolution time
- Average number of follow-ups/replies per ticket
- Number of first time fixes
The output can be selected as well; graphs, CSV and HTML (tables) are supported. The package can be obtained upon request.
An example graph, created with this package: