TREC

Expanding Queries Using Multiple Resources

We describe our participation in the TREC 2006 Genomics track, in which our main focus was on query expansion. We hypothesized that applying query expansion techniques would help us both to identify and retrieve synonymous terms, and to cope with ambiguity. To this end, we developed several collection-specific as well as web-based strategies. We also performed post-submission experiments, in which we compare various retrieval engines, such as Lucene, Indri, and Lemur, using a simple baseline topic set. When indexing entire paragraphs as pseudo-documents, we find that Lemur is able to achieve the highest document-, passage-, and aspect-level scores, using the KL-divergence method and its default settings. Additionally, we index the collection at a lower level of granularity, by creating pseudo-documents comprising of individual sentences. When we search these instead of paragraphs in Lucene, the passage-level scores improve considerably. Finally we note that stemming improves overall scores by at least 10%.

  • [PDF] E. Meij, M. Jansen, and M. de Rijke, “Expanding queries using multiple resources (the AID group at TREC 2006: genomics track),” in The fifteenth text retrieval conference, 2007.
    [Bibtex]
    @inproceedings{TREC:2006:meij,
    Author = {Meij, E. and Jansen, M. and de Rijke, M.},
    Booktitle = {The Fifteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:24:14 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2006},
    Title = {Expanding Queries Using Multiple Resources (The {AID} Group at {TREC} 2006: Genomics Track)},
    Year = {2007}}

Deploying Lucene on the Grid

We investigate if and how open source retrieval engines can be deployed in a grid environment. When comparing grids to conventional distributed IR, the lack of a-priori knowledge about available nodes is one of the most significant differences. On top of that, it is also unknown when a particular node has time and resources available and starts a submitted job. Therefore, conventional methods such as RMI are not directly usable and we propose a different approach, using middleware designed specifically for grids. We describe GridLucene, an extension of the open source engine Lucene with grid-specific classes, based on this middleware. We report on an initial comparison between GridLucene and Lucene, and find a minor penalty (in terms of execution time) for grid-based indexing and a more serious penalty for grid-based retrieval.

The used middleware can gather a set of physical resources to form a single logical resource with some abstract properties. The user-definable properties can be used during indexing and retrieval to let GridLucene know which files it needs to access. By using this kind of semantic information, grid nodes can “discover” which indices exist on the grid and which particular documents need to be indexed.

GridLucene is available for downloading under the same license as Lucene.

  • [PDF] E. Meij and M. de Rijke, “Deploying lucene on the grid,” in Proceedings sigir 2006 workshop on open source information retrieval (osir2006), 2006.
    [Bibtex]
    @inproceedings{OSIR:2005:meij,
    Author = {Meij, E. and de Rijke, M.},
    Booktitle = {Proceedings SIGIR 2006 workshop on Open Source Information Retrieval (OSIR2006)},
    Date-Added = {2011-10-12 23:08:51 +0200},
    Date-Modified = {2011-10-12 23:08:51 +0200},
    Title = {Deploying Lucene on the Grid},
    Year = {2006}}

Combining Thesauri-based Methods for Biomedical Retrieval

This paper describes our participation in the TREC 2005 Genomics track. We took part in the ad hoc retrieval task and aimed at integrating thesauri in the retrieval model. We developed three thesauri-based methods, two of which made use of the existing MeSH thesaurus. One method uses blind relevance feedback on MeSH terms, the second uses an index of the MeSH thesaurus for query expansion. The third method makes use of a dynamically generated lookup list, by which acronyms and synonyms could be inferred. We show that, despite the relatively minor improvements in retrieval performance of individually applied methods, a combination works best and is able to deliver significant improvements over the baseline.

  • [PDF] E. Meij, L. H. L. IJzereef, L. A. Azzopardi, J. Kamps, M. de Rijke, M. Voorhees, and L. P. Buckland, “Combining thesauri-based methods for biomedical retrieval,” in The fourteenth text retrieval conference, 2006.
    [Bibtex]
    @inproceedings{TREC:2005:meij,
    Author = {Meij, E. and IJzereef, L.H.L. and Azzopardi, L.A. and Kamps, J. and de Rijke, M. and Voorhees, M. and Buckland, L.P.},
    Booktitle = {The Fourteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:16:44 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2005},
    Title = {Combining Thesauri-based Methods for Biomedical Retrieval},
    Year = {2006}}

Van Case-Based Reasoning tot Information Retrieval; Case retrieval voor de helpdesk van een webhosting bedrijf

The helpdesk department of Hostnet, a web hosting company, daily receives 35 up to 50 questions from its customers. Within the domain in which Hostnet operates, only few off-the-shelf manuals exist and this is particularly noticeable on the helpdesk. Currently, only a few possibilities for knowledge management and/or elicitation exist within the organization. Questions are answered and problems are solved mostly by relying on the expertise of the staff. They therefore need to have up-to-date knowledge of a variety of possible questions, problem situations and solutions. They also need to be creative and flexible when handling novel questions.

Hostnet uses a ticketing system to handle questions from their customers. One of many advantages of using such a system is that all questions are stored, along with their corresponding answers. Hostnet uses the system for some time now and it has thus collected a large amount of domain and organization specific knowledge. This kind of information is exactly the type on which the research area of case-based reasoning focuses. Case-based reasoning uses previously solved problems (cases) as a knowledge source to aid solving similar cases in the future. One of the main components, in any case-based reasoning system, is the retrieval module. This module searches for alike cases, given a new case and a similarity measure. Techniques from the area of Information Retrieval may be used to assist in finding these alike questions, for example by implementing vector-space based, statistical methods.
This research focuses on analyzing to what extent previously solved cases can serve as a basis for a statistical information retrieval module of a case-based reasoning system within Hostnet by measuring the effects of different information retrieval techniques on the results. The evaluated techniques are stemming, term weighting and combinations thereof. The above described organizational setting is not unique to Hostnet. Every service-providing company with direct customer contacts is probably familiar with the described situation and could benefit from the presented results.

The suggested approach yields adequate results by which, at best, 60% of new questions can be answered, based on the first 10 retrieved stored questions. The mean reciprocal rank of the first matching question provided room for improvement however, with a value of 7 out of 10. The most important conclusion is that the best results are achieved when applying none of the before mentioned information retrieval techniques. The suggested approach needs to be improved for a successful integration within a case-based reasoning system, but it does seem viable.

  • [PDF] E. Meij, “Van case-based reasoning tot information retrieval; case retrieval voor de helpdesk van een webhosting bedrijf.,” Master Thesis, 2005.
    [Bibtex]
    @mastersthesis{2005:meij,
    Abstract = {The helpdesk department of Hostnet, a web hosting company, daily receives 35 up to 50 questions from its customers. Within the domain in which Hostnet operates, only few off-the-shelf manuals exist and this is particularly noticeable on the helpdesk. Currently, only a few possibilities for knowledge management and/or elicitation exist within the organization. Questions are answered and problems are solved mostly by relying on the expertise of the staff. They therefore need to have up-to-date knowledge of a variety of possible questions, problem situations and solutions. They also need to be creative and flexible when handling novel questions.
    Hostnet uses a ticketing system to handle questions from their customers. One of many advantages of using such a system is that all questions are stored, along with their corresponding answers. Hostnet uses the system for some time now and it has thus collected a large amount of domain and organization specific knowledge. This kind of information is exactly the type on which the research area of case-based reasoning focuses. Case-based reasoning uses previously solved problems (cases) as a knowledge source to aid solving similar cases in the future. One of the main components, in any case-based reasoning system, is the retrieval module. This module searches for alike cases, given a new case and a similarity measure. Techniques from the area of Information Retrieval may be used to assist in finding these alike questions, for example by implementing vector-space based, statistical methods.
    This research focuses on analyzing to what extent previously solved cases can serve as a basis for a statistical information retrieval module of a case-based reasoning system within Hostnet by measuring the effects of different information retrieval techniques on the results. The evaluated techniques are stemming, term weighting and combinations thereof. The above described organizational setting is not unique to Hostnet. Every service-providing company with direct customer contacts is probably familiar with the described situation and could benefit from the presented results.
    The suggested approach yields adequate results by which, at best, 60% of new questions can be answered, based on the first 10 retrieved stored questions. The mean reciprocal rank of the first matching question provided room for improvement however, with a value of 7 out of 10. The most important conclusion is that the best results are achieved when applying none of the before mentioned information retrieval techniques. The suggested approach needs to be improved for a successful integration within a case-based reasoning system, but it does seem viable.},
    Author = {Edgar Meij},
    Date-Added = {2011-10-12 21:53:59 +0200},
    Date-Modified = {2011-10-12 21:55:28 +0200},
    School = {University of Amsterdam},
    Title = {Van Case-Based Reasoning tot Information Retrieval; Case retrieval voor de helpdesk van een webhosting bedrijf.},
    Year = {2005}}