We inves­ti­gate if and how open source retrieval engines can be deployed in a grid envi­ron­ment. When com­par­ing grids to con­ven­tional dis­trib­uted IR, the lack of a-priori knowl­edge about avail­able nodes is one of the most sig­nif­i­cant dif­fer­ences. On top of that, it is also unknown when a par­tic­u­lar node has time and resources avail­able and starts a sub­mit­ted job. There­fore, con­ven­tional meth­ods such as RMI are not directly usable and we pro­pose a dif­fer­ent approach, using mid­dle­ware designed specif­i­cally for grids. We describe Grid­Lucene, an exten­sion of the open source engine Lucene with grid-specific classes, based on this mid­dle­ware. We report on an ini­tial com­par­i­son between Grid­Lucene and Lucene, and find a minor penalty (in terms of exe­cu­tion time) for grid-based index­ing and a more seri­ous penalty for grid-based retrieval.

The used mid­dle­ware can gather a set of phys­i­cal resources to form a sin­gle log­i­cal resource with some abstract prop­er­ties. The user-definable prop­er­ties can be used dur­ing index­ing and retrieval to let Grid­Lucene know which files it needs to access. By using this kind of seman­tic infor­ma­tion, grid nodes can “dis­cover” which indices exist on the grid and which par­tic­u­lar doc­u­ments need to be indexed.

Grid­Lucene is avail­able for down­load­ing under the same license as Lucene.

  • [PDF] E. Meij and M. de Rijke, “Deploy­ing Lucene on the Grid,” in Pro­ceed­ings SIGIR 2006 work­shop on Open Source Infor­ma­tion Retrieval (OSIR2006), 2006.
    [Bib­tex]
    @inproceedings{OSIR:2005:meij,
      Author = {Meij, E. and de Rijke, M.},
      Booktitle = {Proceedings SIGIR 2006 workshop on Open Source Information Retrieval (OSIR2006)},
      Date-Added = {2011-10-12 23:08:51 +0200},
      Date-Modified = {2011-10-12 23:08:51 +0200},
      Title = {Deploying Lucene on the Grid},
      Year = {2006}}