Dynamic query modeling for related content finding

While watching television, people increasingly consume additional content related to what they are watching. We consider the task of finding video content related to a live television broadcast for which we leverage the textual stream of subtitles associated with the broadcast. We model this task as a Markov decision process and propose a method that uses reinforcement learning to directly optimize the retrieval effectiveness of queries generated from the stream of subtitles. Our dynamic query modeling approach significantly outperforms state-of-the-art baselines for stationary query modeling and for text-based retrieval in a television setting. In particular we find that carefully weighting terms and decaying these weights based on recency significantly improves effectiveness. Moreover, our method is highly efficient and can be used in a live television setting, i.e., in near real time.

  • [PDF] D. Odijk, E. Meij, I. Sijaranamual, and M. de Rijke, “Dynamic query modeling for related content finding,” in SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval, 2015.
    [Bibtex]
    @inproceedings{SIGIR:2015:Odijk,
    Author = {Odijk, Daan and Meij, Edgar and Sijaranamual, Isaac and de Rijke, Maarten},
    Booktitle = {{SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval}},
    Date-Added = {2015-08-06 13:14:13 +0000},
    Date-Modified = {2015-08-06 13:39:24 +0000},
    Month = {August},
    Publisher = {ACM},
    Title = {Dynamic query modeling for related content finding},
    Year = {2015}}
TREC

The University of Amsterdam at the TREC 2011 Session Track

We describe the participation of the University of Amsterdam’s ILPS group in the Session track at TREC 2011.

The stream of interactions created by a user engaging with a search system contains a wealth of information. For retrieval purposes, previous interactions can help inform us about a user’s current information need. Building on this intuition, our contribution to this TREC year’s session track focuses on session modeling and learning to rank using session information. In this paper, we present and compare three complementary strategies that we designed for improving retrieval for a current query using previous queries and clicked results: probabilistic session modeling, semantic query modeling, and implicit feedback.

In our experiments we examined three complementary strategies for improving retrieval for a current query. Our first strategy, based on probabilistic session modeling, was the best performing strategy.

Our second strategy, based on semantic query modeling, did less well than we expected, likely due to topic drift from excessively aggressive query expansion. We expect that performance of this strategy would improve by limiting the number of terms and/or improving the probability estimates.

With respect to our third strategy, based on learning from feedback, we found that learning weights for linear weighted combinations of features from an external collection can be beneficial, if characteristics of the collection are similar to the current data. Feedback available in the form of user clicks appeared to be less beneficial. Our run learning from implicit feedback did perform substantially lower than a run where weights were learned from an external collection with explicit feedback using the same learning algorithm and set of features.

  • [PDF] B. Huurnink, R. Berendsen, K. Hofmann, E. Meij, and M. de Rijke, “The University of Amsterdam at the TREC 2011 session track,” in The twentieth text retrieval conference, 2012.
    [Bibtex]
    @inproceedings{TREC:2011:huurnink,
    Author = {Huurnink, Bouke and Berendsen, Richard and Hofmann, Katja and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:18 +0200},
    Date-Modified = {2013-05-22 11:44:53 +0000},
    Month = {January},
    Series = {TREC 2011},
    Title = {The {University of Amsterdam} at the {TREC} 2011 Session Track},
    Year = {2012}}
P30 difference plot

Team COMMIT at TREC 2011

We describe the participation of Team COMMIT in this year’s Microblog and Entity track.

In our participation in the Microblog track, we used a feature-based approach. Specifically, we pursued a precision oriented recency-aware retrieval approach for tweets. Amongst others we used various types of external data. In particular, we examined the potential of link retrieval on a corpus of crawled content pages and we use semantic query expansion using Wikipedia. We also deployed pre-filtering based on query-dependent and query-independent features. For the Microblog track we found that a simple cut-off based on the z-score is not sufficient: for differently distributed scores, this can decrease recall. A well set cut-off parameter can however significantly increase precision, especially if there are few highly relevant tweets. Filtering based on query-independent filtering does not help for already small result list. With a high occurrence of links in relevant tweets, we found that using link retrieval helps improving precision and recall for highly relevant and relevant tweets. Future work should focus on a score-distribution dependent selection criterion.

In this years Entity track participation we focused on the Entity List Completion (ELC) task. We experimented with a text based and link based approach to retrieve entities in Linked Data (LD). Additionally we experimented with selecting candidate entities from a web corpus. Our intuition is that entities occurring on pages with many of the example entities are more likely to be good candidates than entities that do not. For the Entity track there are no analyses or conclusions to report yet; at the time of writing no evaluation results are available for the Entity track.

  • [PDF] M. Bron, E. Meij, M. Peetz, M. Tsagkias, and M. de Rijke, “Team COMMIT at TREC 2011,” in The twentieth text retrieval conference, 2012.
    [Bibtex]
    @inproceedings{TREC:2011:commit,
    Author = {Bron, Marc and Meij, Edgar and Peetz, Maria-Hendrike and Tsagkias, Manos and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:19 +0200},
    Date-Modified = {2012-10-30 09:26:12 +0000},
    Series = {TREC 2011},
    Title = {Team {COMMIT} at {TREC 2011}},
    Year = {2012}}
Plot of a query-specific burst

Adaptive Temporal Query Modeling

We present an approach to query modeling that uses the temporal distribution of documents in an initially retrieved set of documents. Such distributions tend to exhibit bursts, especially in news related document collections. We hypothesize that documents in those bursts are more likely to be relevant than others. Predicated on this, we expand queries with the most distinguishing terms in high quality documents sampled from bursts. We show how the most commonly used decay function for recent document retrieval can be used as probabilistic model for temporal retrieval in general. The effectiveness of our models is demonstrated on both news collections and a collection of blog posts.

  • [PDF] M. Peetz, E. Meij, M. de Rijke, and W. Weerkamp, “Adaptive temporal query modeling,” in Advances in information retrieval – 34th european conference on ir research, ecir 2012, 2012.
    [Bibtex]
    @inproceedings{ECIR:2012:peetz,
    Author = {Peetz, Maria-Hendrike and Meij, Edgar and de Rijke, Maarten and Weerkamp, Wouter},
    Booktitle = {Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012},
    Date-Added = {2011-11-23 18:10:40 +0100},
    Date-Modified = {2012-10-28 23:01:12 +0000},
    Title = {Adaptive Temporal Query Modeling},
    Year = {2012}}
TREC

The University of Amsterdam at Trec 2010: Session, Entity, and Relevance Feedback

We describe the participation of the University of Amsterdam’s ILPS group in the session, entity, and relevance feedback track at TREC 2010. In the Session Track we explore the use of blind relevance feedback to bias a follow-up query towards or against the topics covered in documents returned to the user in response to the original query. In the Entity Track REF task we experiment with a window size parameter to limit the amount of context considered by the entity co-occurrence models and explore the use of Freebase for type filtering, entity normalization and homepage finding. In the ELC task we use an approach that uses the number of links shared between candidate and example entities to rank candidates. In the Relevance Feedback Track we experiment with a novel model that uses Wikipedia to expand the query language model.

  • [PDF] M. Bron, J. He, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “The University of Amsterdam at TREC 2010: session, entity, and relevance feedback,” in The nineteenth text retrieval conference, 2011.
    [Bibtex]
    @inproceedings{TREC:2011:bron,
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {M. Bron and He, J. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Nineteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 11:18:35 +0200},
    Date-Modified = {2012-10-30 09:25:06 +0000},
    Series = {TREC 2010},
    Title = {{The University of Amsterdam at TREC 2010}: Session, Entity, and Relevance Feedback},
    Year = {2011}}
thesis cover image of a smart computer

Combining Concepts and Language Models for Information Access

Since the middle of last century, information retrieval has gained an increasing interest. Since its inception, much research has been devoted to finding optimal ways of representing both documents and queries, as well as improving ways of matching one with the other. In cases where document annotations or explicit semantics are available, matching algorithms can be informed using the concept languages in which such semantics are usually defined. These algorithms are able to match queries and documents based on textual and semantic evidence.

Recent advances have enabled the use of rich query representations in the form of query language models. This, in turn, allows us to account for the language associated with concepts within the retrieval model in a principled and transparent manner. Developments in the semantic web community, such as the Linked Open Data cloud, have enabled the association of texts with concepts on a large scale. Taken together, these developments facilitate a move beyond manually assigned concepts in domain-specific contexts into the general domain.

This thesis investigates how one can improve information access by employing the actual use of concepts as measured by the language that people use when they discuss them. The main contribution is a set of models and methods that enable users to retrieve and access information on a conceptual level. Through extensive evaluations, a systematic exploration and thorough analysis of the experimental results of the proposed models is performed. Our empirical results show that a combination of top-down conceptual information and bottom-up statistical information obtains optimal performance on a variety of tasks and test collections.

See http://phdthes.is/ for more information.

  • [PDF] E. Meij, “Combining concepts and language models for information access,” PhD Thesis, 2010.
    [Bibtex]
    @phdthesis{2010:meij,
    Author = {Meij, Edgar},
    Date-Added = {2011-10-20 10:18:00 +0200},
    Date-Modified = {2011-10-22 12:23:33 +0200},
    School = {University of Amsterdam},
    Title = {Combining Concepts and Language Models for Information Access},
    Year = {2010}}

 

Wikipedia

Supervised query modeling using Wikipedia

In a web retrieval setting, there is a clear need for precision enhancing methods. For example, the query “the secret garden” (a novel that has been adapted into movies and musicals) is a query that is easily led astray because of the generality of the individual query terms. While some methods address this issue at the document level, e.g., by using anchor texts or some function of the web graph, we are interested in improving the query; a prime example of such an approach is leveraging phrasal or proximity information. Besides degrading the user experience, another significant downside of a lack of precision is its negative impact on the effectiveness of pseudo relevance feedback methods. An example of this phenomenon can be observed for a query such as “indexed annuity” where the richness of the financial domain plus the broad commercial use of the web introduces unrelated terms. To address these issues, we propose a semantically informed manner of representing queries that uses supervised machine learning on Wikipedia. We train an SVM that automatically links queries to Wikipedia articles which are subsequently used to update the query model.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model. We, however, are interested in selecting those Wikipedia articles which best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we show, focusing this set using machine learning improves overall retrieval performance. In particular, we apply supervised machine learning to automatically link queries to Wikipedia articles and sample terms from the linked articles to re-estimate the query model. On a recent large web corpus, we observe substantial gains in terms of both traditional metrics and diversity measures.

  • [PDF] E. Meij and M. de Rijke, “Supervised query modeling using Wikipedia,” in Proceedings of the 33rd international acm sigir conference on research and development in information retrieval, 2010.
    [Bibtex]
    @inproceedings{SIGIR:2010:meij,
    Author = {Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
    Date-Added = {2012-05-03 22:16:10 +0200},
    Date-Modified = {2012-10-30 08:40:21 +0000},
    Series = {SIGIR 2010},
    Title = {Supervised query modeling using {Wikipedia}},
    Year = {2010},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1835449.1835660}}
Traditional Library Card Catalog

Conceptual language models for domain-specific retrieval

Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model.

Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms.

Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.

  • [PDF] [DOI] E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij, “Conceptual language models for domain-specific retrieval,” Inf. process. manage., vol. 46, iss. 4, pp. 448-469, 2010.
    [Bibtex]
    @article{IPM:2010:Meij,
    Address = {Tarrytown, NY, USA},
    Author = {Meij, Edgar and Trieschnigg, Dolf and de Rijke, Maarten and Kraaij, Wessel},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2011-10-12 18:31:55 +0200},
    Doi = {http://dx.doi.org/10.1016/j.ipm.2009.09.005},
    Issn = {0306-4573},
    Journal = {Inf. Process. Manage.},
    Number = {4},
    Pages = {448--469},
    Publisher = {Pergamon Press, Inc.},
    Title = {Conceptual language models for domain-specific retrieval},
    Volume = {46},
    Year = {2010},
    Bdsk-Url-1 = {http://dx.doi.org/10.1016/j.ipm.2009.09.005}}