Supervised query modeling using Wikipedia

In a web retrieval setting, there is a clear need for precision enhancing methods. For example, the query “the secret garden” (a novel that has been adapted into movies and musicals) is a query that is easily led astray because of the generality of the individual query terms. While some methods address this issue at the document level, e.g., by using anchor texts or some function of the web graph, we are interested in improving the query; a prime example of such an approach is leveraging phrasal or proximity information. Besides degrading the user experience, another significant downside of a lack of precision is its negative impact on the effectiveness of pseudo relevance feedback methods. An example of this phenomenon can be observed for a query such as “indexed annuity” where the richness of the financial domain plus the broad commercial use of the web introduces unrelated terms. To address these issues, we propose a semantically informed manner of representing queries that uses supervised machine learning on Wikipedia. We train an SVM that automatically links queries to Wikipedia articles which are subsequently used to update the query model.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model. We, however, are interested in selecting those Wikipedia articles which best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we show, focusing this set using machine learning improves overall retrieval performance. In particular, we apply supervised machine learning to automatically link queries to Wikipedia articles and sample terms from the linked articles to re-estimate the query model. On a recent large web corpus, we observe substantial gains in terms of both traditional metrics and diversity measures.

  • [PDF] E. Meij and M. de Rijke, “Supervised query modeling using Wikipedia,” in Proceedings of the 33rd international acm sigir conference on research and development in information retrieval, 2010.
    Author = {Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
    Date-Added = {2012-05-03 22:16:10 +0200},
    Date-Modified = {2012-10-30 08:40:21 +0000},
    Series = {SIGIR 2010},
    Title = {Supervised query modeling using {Wikipedia}},
    Year = {2010},
    Bdsk-Url-1 = {}}
Traditional Library Card Catalog

Conceptual language models for domain-specific retrieval

Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model.

Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms.

Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.

  • [PDF] [DOI] E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij, “Conceptual language models for domain-specific retrieval,” Inf. process. manage., vol. 46, iss. 4, p. 448–469, 2010.
    Address = {Tarrytown, NY, USA},
    Author = {Meij, Edgar and Trieschnigg, Dolf and de Rijke, Maarten and Kraaij, Wessel},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2011-10-12 18:31:55 +0200},
    Doi = {},
    Issn = {0306-4573},
    Journal = {Inf. Process. Manage.},
    Number = {4},
    Pages = {448--469},
    Publisher = {Pergamon Press, Inc.},
    Title = {Conceptual language models for domain-specific retrieval},
    Volume = {46},
    Year = {2010},
    Bdsk-Url-1 = {}}