Deploying Lucene on the Grid

We investigate if and how open source retrieval engines can be deployed in a grid environment. When comparing grids to conventional distributed IR, the lack of a-priori knowledge about available nodes is one of the most significant differences. On top of that, it is also unknown when a particular node has time and resources available and starts a submitted job. Therefore, conventional methods such as RMI are not directly usable and we propose a different approach, using middleware designed specifically for grids. We describe GridLucene, an extension of the open source engine Lucene with grid-specific classes, based on this middleware. We report on an initial comparison between GridLucene and Lucene, and find a minor penalty (in terms of execution time) for grid-based indexing and a more serious penalty for grid-based retrieval.

The used middleware can gather a set of physical resources to form a single logical resource with some abstract properties. The user-definable properties can be used during indexing and retrieval to let GridLucene know which files it needs to access. By using this kind of semantic information, grid nodes can “discover” which indices exist on the grid and which particular documents need to be indexed.

GridLucene is available for downloading under the same license as Lucene.

  • [PDF] E. Meij and M. de Rijke, “Deploying lucene on the grid,” in Proceedings sigir 2006 workshop on open source information retrieval (osir2006), 2006.
    Author = {Meij, E. and de Rijke, M.},
    Booktitle = {Proceedings SIGIR 2006 workshop on Open Source Information Retrieval (OSIR2006)},
    Date-Added = {2011-10-12 23:08:51 +0200},
    Date-Modified = {2011-10-12 23:08:51 +0200},
    Title = {Deploying Lucene on the Grid},
    Year = {2006}}

Combining Thesauri-based Methods for Biomedical Retrieval

This paper describes our participation in the TREC 2005 Genomics track. We took part in the ad hoc retrieval task and aimed at integrating thesauri in the retrieval model. We developed three thesauri-based methods, two of which made use of the existing MeSH thesaurus. One method uses blind relevance feedback on MeSH terms, the second uses an index of the MeSH thesaurus for query expansion. The third method makes use of a dynamically generated lookup list, by which acronyms and synonyms could be inferred. We show that, despite the relatively minor improvements in retrieval performance of individually applied methods, a combination works best and is able to deliver significant improvements over the baseline.

  • [PDF] E. Meij, L. H. L. IJzereef, L. A. Azzopardi, J. Kamps, M. de Rijke, M. Voorhees, and L. P. Buckland, “Combining thesauri-based methods for biomedical retrieval,” in The fourteenth text retrieval conference, 2006.
    Author = {Meij, E. and IJzereef, L.H.L. and Azzopardi, L.A. and Kamps, J. and de Rijke, M. and Voorhees, M. and Buckland, L.P.},
    Booktitle = {The Fourteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:16:44 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2005},
    Title = {Combining Thesauri-based Methods for Biomedical Retrieval},
    Year = {2006}}