mathematics

Parsimonious Relevance Models

Relevance feedback is often applied to better capture a user’s information need. Automatically reformulating queries (or blind relevance feedback) entails looking at the terms in some set of (pseudo-)relevant documents and selecting the most informative ones with respect to the set or the collection. These terms may then be reweighed based on information pertinent to the query or the documents and—in a language modeling setting—be used to estimate a query model, P(t|θQ), i.e., a distribution over terms t for a given query Q.

Not all of the terms obtained using blind relevance feedback are equally informative given the query, even after reweighing. Some may be common terms, whilst others may describe the general domain of interest. We hypothesize that refining the results of blind relevance feedback, using a technique called parsimonious language modeling, will improve retrieval effectiveness. Hiemstra et al. already provide a mechanism for incorporating (parsimonious) blind relevance feedback, by viewing it as a three component mixture model of document, set of feedback documents, and collection. Our approach is more straightforward, since it considers each feedback document separately and, hence, does not require the additional mixture model parameter. To create parsimonious language models we use an EM algorithm to update the maximum-likelihood (ML) estimates. Zhai and Lafferty already proposed an approach which uses a similar EM algorithm; it differs, however, in the way the set of feedback documents is handled. Whereas we parsimonize each individual document, they apply their EM algorithm to the entire set of feedback documents.

To verify our hypothesis, we use a specific instance of blind relevance feedback, namely relevance modeling (RM). We choose this particular method because it has been shown to achieve state-of-the-art retrieval performance. Relevance modeling assumes that the query and the set of documents are samples from an underlying term distribution—the relevance model. Lavrenko and Croft formulate two ways of approaching the estimation of the parameters of this model. We build upon their work and compare the results of our proposed parsimonious relevance models with RMs as well as with a query-likelihood baseline. To measure the effects in different contexts, we employ five test collections taken from the TREC-7, TREC Robust, Genomics, Blog, and Enterprise tracks and show that our proposed model improves performance in terms of mean average precision on all the topic sets over both a query-likelihood baseline as well as a run based on relevance models. Moreover, although blind relevance feedback is mainly a recall enhancing technique, we observe that parsimonious relevance models (unlike their non-parsimonized counterparts) can also improve early precision and reciprocal rank of the first relevant result. Thus, our parsimonious relevance models (i) improve retrieval effectiveness in terms of MAP on all collections, (ii) significantly outperform their non-parsimonious counterparts on most measures, and (iii) have a precision enhancing effect, unlike other blind relevance feedback methods.

  • [PDF] E. Meij, W. Weerkamp, K. Balog, and M. de Rijke, “Parsimonious relevance models,” in Proceedings of the 31st annual international acm sigir conference on research and development in information retrieval, 2008.
    [Bibtex]
    @inproceedings{SIGIR:2008:Meij-prm,
    Author = {Meij, Edgar and Weerkamp, Wouter and Balog, Krisztian and de Rijke, Maarten},
    Booktitle = {Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:47:44 +0000},
    Series = {SIGIR 2008},
    Title = {Parsimonious relevance models},
    Year = {2008},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1390334.1390520}}
concepts

Parsimonious concept modeling

In many collections, documents are annotated using concepts from a structured knowledge source such as an ontology or thesaurus. Examples include the news domain, where each news item is categorized according to the nature of the event that took place, and Wikipedia, with its per-article categories. These categorizing systems originally stem from the cataloging systems used in libraries and conceptual search is commonly used in digital library environments at the front-end to support search and navigation. In this paper we want to employ the explicit knowledge used for annotation at the back-end, not just to improve retrieval performance, but also to generate high-quality term and concept suggestions. To do so, we use the dual document representation— concepts and terms—to create a generative language model for each concept, which bridges the gap between vocabulary terms and concepts. Related work has also used textual representations to rep- resent concepts, however, there are two important differences. First, we use statistical language modeling techniques to parametrize the concept models, by leveraging the dual represen- tation of the documents. Second, we found that simple maximum likelihood estimation assigns too much probability mass to terms and concepts which may not be relevant to each document. Thus we apply an EM algorithm to “parsimonize” the document models.

The research questions we address are twofold: (i) what are the results of applying our model as compared to a query-likelihood baseline as well as compared to a run based on relevance models and (ii) what is the influence of parsimonizing? To answer these questions, we use the TREC Genomics track test collections in conjunction with MedLine. MedLine contains over 16 million bibliographic records of publications from the life sciences domain and each abstract therein has been manually indexed by trained curators, who use concepts from the MeSH (Medical Subject Headings) thesaurus. We show that our approach is able to achieve similar or better performance than relevance models, whilst at the same time providing high quality concepts to facilitate navigation. Examples show that our parsimonious concept models generate terms that are more specific than those acquired through maximum likelihood estimates.

  • [PDF] E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij, “Parsimonious concept modeling,” in Proceedings of the 31st annual international acm sigir conference on research and development in information retrieval, 2008.
    [Bibtex]
    @inproceedings{SIGIR:2008:Meij-cm,
    Author = {Meij, Edgar and Trieschnigg, Dolf and de Rijke, Maarten and Kraaij, Wessel},
    Booktitle = {Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:46:38 +0000},
    Series = {SIGIR 2008},
    Title = {Parsimonious concept modeling},
    Year = {2008},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1390334.1390519}}
Text Mining

Bootstrapping Language Associated with Biomedical Entities

The TREC Genomics 2007 task included recognizing topic-specific entities in the returned passages. To address this task, we have designed and implemented a novel data-driven ap- proach by combining information extraction with language modeling techniques. Instead of using an exhaustive list of all possible instances for an entity type, we look at the language usage around each entity type and use that as a classifier to determine whether or not a piece of text discusses such an entity type. We do so by comparing it with language models of the passages. E.g., given the entity type “genes”, our approach can measure the gene-iness of a piece of text.

Our algorithm works as follows. Given an entity type, it first uses Hearst patterns to extract instances of the type. To extract more instances, we look for new contextual patterns around the instances and use them as input for a bootstrapping method, in which new instances and patterns are discovered iteratively. Afterwards, all discovered instances and patterns are used to find the sentences in the collection which are most on par with the requested entity type. A language model is then generated from these sentences and, at retrieval time, we use this model to rerank retrieved passages.

As to the results of our submitted runs, we find that our baseline run performs well above the median of all participant’s scores. Additionally, we find that applying our proposed method helps the entity types for which there are unambiguous patterns and numerous instances most.

  • [PDF] E. Meij and S. Katrenko, “Bootstrapping language associated with biomedical entities,” in The sixteenth text retrieval conference, 2008.
    [Bibtex]
    @inproceedings{TREC:2008:meij,
    Author = {Meij, E. and Katrenko, S.},
    Booktitle = {The Sixteenth Text REtrieval Conference},
    Date-Added = {2011-10-16 10:24:41 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2007},
    Title = {Bootstrapping Language Associated with Biomedical Entities},
    Year = {2008}}

Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method

We address the issue of combining explicit background knowledge with pseudo-relevance feedback from within a document collection. To this end, we use document-level annotations in tandem with generative language models to generate terms from pseudo-relevant documents and bias the probability estimates of expansion terms in a principled manner. By applying the knowledge inherent in document annotations, we aim to control query drift and reap the benefits of automatic query expansion in terms of recall without losing precision. We consider the parameters which are associated with our modeling and describe ways of estimating these automatically. We then evaluate our modeling and estimation methods on two test collections, both provided by the TREC Genomics track.

  • [PDF] E. Meij and M. de Rijke, “Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method,” in Proceedings of the 1st international conference on theory of information retrieval, 2007.
    [Bibtex]
    @inproceedings{ICTIR:2007:meij,
    Author = {E. Meij and de Rijke, M.},
    Booktitle = {Proceedings of the 1st International Conference on Theory of Information Retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:50:30 +0000},
    Series = {ICTIR 2007},
    Title = {{Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method}},
    Year = {2007}}

Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments

We propose and evaluate a query expansion mechanism that supports searching and browsing in collections of annotated documents. Based on generative language models, our feedback mechanism uses document-level annotations to bias the generation of expansion terms and to generate browsing suggestions in the form of concepts selected from a controlled vocabulary (as typically used in digital library settings). We provide a detailed formalization of our feedback mechanism and evaluate its effectiveness using the TREC 2006 Genomics track test set. As to the retrieval effectiveness, we find a 20% improvement in mean average precision over a query-likelihood baseline, whilst increasing precision at 10. When we base the parameter estimation and feedback generation of our algorithm on a large corpus, we also find an improvement over state-of-the-art relevance models. The browsing suggestions are assessed along two dimensions: relevancy and specificity. We present an account of per-topic results, which helps understand for what type of queries our feedback mechanism is particularly helpful.

  • [PDF] E. Meij and M. de Rijke, “Thesaurus-based feedback to support mixed search and browsing environments,” in Research and advanced technology for digital libraries, 11th european conference, ecdl 2007, 2007.
    [Bibtex]
    @inproceedings{ECDL:2007:meij,
    Author = {Edgar Meij and Maarten de Rijke},
    Booktitle = {Research and Advanced Technology for Digital Libraries, 11th European Conference, ECDL 2007},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-28 23:04:22 +0000},
    Title = {Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments},
    Year = {2007}}

Using Prior Information Derived from Citations in Literature Search

Researchers spend a large amount of their time searching through an ever increasing number of scientific articles. Although users of scientific literature search engines prefer the ranking of results according to the number of citations a publication has received, it is unknown whether this notion of authoritativeness could also benefit more traditional and objective measures. Is it also an indicator of relevance, given an information need? In this paper, we examine the relationship between citation features of a scientific article and its prior probability of actually being relevant to an information need. We propose various ways of modeling this relationship and show how this kind of contextual information can be incorporated within a language modeling framework. We experiment with three document priors, which we evaluate on three distinct sets of queries and two document collections from the TREC Genomics track. Empirical results show that two of the proposed priors can significantly improve retrieval effectiveness, measured in terms of mean average precision.

  • [PDF] E. Meij and M. de Rijke, “Using prior information derived from citations in literature search,” in Riao 2007, 2007.
    [Bibtex]
    @inproceedings{RIAO:2007:Meij,
    Author = {Meij, E. and de Rijke, M.},
    Booktitle = {RIAO 2007},
    Date-Added = {2011-10-13 09:05:34 +0200},
    Date-Modified = {2012-10-30 08:49:59 +0000},
    Title = {Using Prior Information Derived from Citations in Literature Search},
    Year = {2007}}