TREC

The University of Amsterdam at Trec 2010: Session, Entity, and Relevance Feedback

We describe the participation of the University of Amsterdam’s ILPS group in the session, entity, and relevance feedback track at TREC 2010. In the Session Track we explore the use of blind relevance feedback to bias a follow-up query towards or against the topics covered in documents returned to the user in response to the original query. In the Entity Track REF task we experiment with a window size parameter to limit the amount of context considered by the entity co-occurrence models and explore the use of Freebase for type filtering, entity normalization and homepage finding. In the ELC task we use an approach that uses the number of links shared between candidate and example entities to rank candidates. In the Relevance Feedback Track we experiment with a novel model that uses Wikipedia to expand the query language model.

  • [PDF] M. Bron, J. He, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “The University of Amsterdam at TREC 2010: session, entity, and relevance feedback,” in The nineteenth text retrieval conference, 2011.
    [Bibtex]
    @inproceedings{TREC:2011:bron,
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {M. Bron and He, J. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Nineteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 11:18:35 +0200},
    Date-Modified = {2012-10-30 09:25:06 +0000},
    Series = {TREC 2010},
    Title = {{The University of Amsterdam at TREC 2010}: Session, Entity, and Relevance Feedback},
    Year = {2011}}
thesis cover image of a smart computer

Combining Concepts and Language Models for Information Access

Since the middle of last century, information retrieval has gained an increasing interest. Since its inception, much research has been devoted to finding optimal ways of representing both documents and queries, as well as improving ways of matching one with the other. In cases where document annotations or explicit semantics are available, matching algorithms can be informed using the concept languages in which such semantics are usually defined. These algorithms are able to match queries and documents based on textual and semantic evidence.

Recent advances have enabled the use of rich query representations in the form of query language models. This, in turn, allows us to account for the language associated with concepts within the retrieval model in a principled and transparent manner. Developments in the semantic web community, such as the Linked Open Data cloud, have enabled the association of texts with concepts on a large scale. Taken together, these developments facilitate a move beyond manually assigned concepts in domain-specific contexts into the general domain.

This thesis investigates how one can improve information access by employing the actual use of concepts as measured by the language that people use when they discuss them. The main contribution is a set of models and methods that enable users to retrieve and access information on a conceptual level. Through extensive evaluations, a systematic exploration and thorough analysis of the experimental results of the proposed models is performed. Our empirical results show that a combination of top-down conceptual information and bottom-up statistical information obtains optimal performance on a variety of tasks and test collections.

See http://phdthes.is/ for more information.

  • [PDF] E. Meij, “Combining concepts and language models for information access,” PhD Thesis, 2010.
    [Bibtex]
    @phdthesis{2010:meij,
    Author = {Meij, Edgar},
    Date-Added = {2011-10-20 10:18:00 +0200},
    Date-Modified = {2011-10-22 12:23:33 +0200},
    School = {University of Amsterdam},
    Title = {Combining Concepts and Language Models for Information Access},
    Year = {2010}}

 

formula

A query model based on normalized log-likelihood

A query is usually a brief, sometimes imprecise expression of an underlying information need . Examining how queries can be transformed to equivalent, potentially better queries is a theme of recurring interest to the information retrieval community. Such transformations include expansion of short queries to long queries, paraphrasing queries using an alternative vocabulary, mapping unstructured queries to structured ones, identifying key concepts in verbose queries, etc.

To inform the transformation process, multiple types of information sources have been considered. A recent one is search engine logs for query substitutions . Another recent example is where users complement their traditional keyword query with additional information, such as example documents, tags, images, categories, or their search history . The ultimate source of information for transforming a query, however, is the user, through relevance feedback : given a query and a set of judged documents for that query, how does a system take advantage of the judgments in order to transform the original query and retrieve more documents that will be useful to the user? As demonstrated by the recent launch of a dedicated relevance feedback track at TREC, we still lack the definitive answer to this question.

Let’s consider an example to see what aspects play a role in transforming a query based on judgments for a set of initially retrieved documents. Suppose we have a set of documents which are judged to be relevant to a query. These documents may vary in length and, furthermore, they need not be completely on topic because they may discuss more topics than the ones that are relevant to the query. While the users’ judgments are at the document level, not all of the documents’ sections can be assumed to be equally relevant. Most relevance feedback models that are currently available do not model or capture this phenomenon; instead, they attempt to transform the original query based on the full content of the documents. Clearly this is not ideal and we would like to account for the possibly multi-faceted character of documents. We hypothesize that a relevance feedback model that attempts to capture the topical structure of individual judged documents (“For each judged document, what is important about it?”) as well as of the set of all judged documents (“Which topics are shared by the entire set of judged documents?”) will outperform relevance feedback models that capture only one of these types of information.

We are working in a language modeling (LM) setting and our aim in this paper is to present an LM-based relevance feedback model that uses both types of information—about the topical relevance of a document and about the general topic of the set of relevant documents— to transform the original query. The proposed model uses the whole set of relevance assessments to determine how much each document that has been judged relevant should contribute to the query transformation. We use the TREC relevance feedback track test collection to evaluate our model and compare it to other, established relevance feedback methods. We show that it is able to achieve superior performance over all evaluated models. We answer the following two research questions in this paper. (i) Can we develop a relevance feedback model that uses evidence from both the individual relevant documents and the set of relevant documents as a whole? (ii) Can our new model achieve state-of-the-art results and how do these results compare to related models? When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.

  • [PDF] E. Meij, W. Weerkamp, and M. de Rijke, “A query model based on normalized log-likelihood,” in Proceedings of the 18th acm conference on information and knowledge management, 2009.
    [Bibtex]
    @inproceedings{CIKM:2009:Meij,
    Author = {Meij, Edgar and Weerkamp, Wouter and de Rijke, Maarten},
    Booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:42:51 +0000},
    Series = {CIKM 2009},
    Title = {A query model based on normalized log-likelihood},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1645953.1646261}}
TREC

Incorporating Non-Relevance Information in the Estimation of Query Models

We describe the participation of the University of Amsterdam’s ILPS group in the relevance feedback track at TREC 2008. We introduce a new model which incorporates information from relevant and non-relevant documents to improve the estimation of query models. Our main findings are twofold: (i) in terms of statMAP, a larger number of judged non-relevant documents improves retrieval effectiveness and (ii) on the TREC Ter- abyte topics, we can effectively replace the estimates on the judged non-relevant documents with estimations on the document collection.

  • [PDF] E. Meij, W. Weerkamp, J. He, and M. de Rijke, “Incorporating non-relevance information in the estimation of query models,” in The seventeenth text retrieval conference, 2009.
    [Bibtex]
    @inproceedings{TREC:2009:meij,
    Abstract = {We describe the participation of the University of Amsterdam's ILPS group in the relevance feedback track at TREC 2008. We introduce a new model which incorporates information from relevant and non-relevant documents to improve the estimation of query models. Our main findings are twofold: (i) in terms of statMAP, a larger number of judged non-relevant documents improves retrieval effectiveness and (ii) on the TREC Terabyte topics, we can effectively replace the estimates on the judged non-relevant documents with estimations on the document collection.},
    Author = {Meij, E. and Weerkamp, W. and He, J. and de Rijke, M.},
    Booktitle = {The Seventeenth Text REtrieval Conference},
    Date-Added = {2011-10-16 16:03:56 +0200},
    Date-Modified = {2012-10-30 09:23:32 +0000},
    Series = {TREC 2008},
    Title = {Incorporating Non-Relevance Information in the Estimation of Query Models},
    Year = {2009}}