wsdm 2017

Utilizing Knowledge Bases in Text-centric Information Retrieval (WSDM 2017)

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners.

CIKM 2016

Document Filtering for Long-tail Entities

Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. Continue reading “Document Filtering for Long-tail Entities” »

Utilizing Knowledge Bases in Text-centric Information Retrieval (ICTIR 2016)

General-purpose knowledge bases are increasingly growing in terms of depth (content) and width (coverage). Moreover, algorithms for entity linking and entity retrieval have improved tremendously in the past years. These developments give rise to a new line of research that exploits and combines these developments for the purposes of text-centric information retrieval applications. This tutorial focuses on a) how to retrieve a set of entities for an ad-hoc query, or more broadly, assessing relevance of KB elements for the information need, b) how to annotate text with such elements, and c) how to use this information to assess the relevance of text. We discuss different kinds of information available in a knowledge graph and how to leverage each most effectively.
Continue reading “Utilizing Knowledge Bases in Text-centric Information Retrieval (ICTIR 2016)” »


Dynamic Collective Entity Representations for Entity Ranking

Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entity’s description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that are associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time and learn the best representation as time evolves so as to capture the dynamics of how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.

  • [PDF] D. Graus, M. Tsagkias, W. Weerkamp, E. Meij, and M. de Rijke, “Dynamic collective entity representations for entity ranking,” in Proceedings of the ninth acm international conference on web search and data mining, 2016.
    Author = {Graus, David and Tsagkias, Manos and Weerkamp, Wouter and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the ninth ACM international conference on Web search and data mining},
    Date-Added = {2016-01-07 17:24:16 +0000},
    Date-Modified = {2016-01-07 17:25:55 +0000},
    Series = {WSDM 2016},
    Title = {Dynamic Collective Entity Representations for Entity Ranking},
    Year = {2016},
    Bdsk-Url-1 = {}}

Dynamic query modeling for related content finding

While watching television, people increasingly consume additional content related to what they are watching. We consider the task of finding video content related to a live television broadcast for which we leverage the textual stream of subtitles associated with the broadcast. We model this task as a Markov decision process and propose a method that uses reinforcement learning to directly optimize the retrieval effectiveness of queries generated from the stream of subtitles. Our dynamic query modeling approach significantly outperforms state-of-the-art baselines for stationary query modeling and for text-based retrieval in a television setting. In particular we find that carefully weighting terms and decaying these weights based on recency significantly improves effectiveness. Moreover, our method is highly efficient and can be used in a live television setting, i.e., in near real time.

  • [PDF] D. Odijk, E. Meij, I. Sijaranamual, and M. de Rijke, “Dynamic query modeling for related content finding,” in SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval, 2015.
    Author = {Odijk, Daan and Meij, Edgar and Sijaranamual, Isaac and de Rijke, Maarten},
    Booktitle = {{SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval}},
    Date-Added = {2015-08-06 13:14:13 +0000},
    Date-Modified = {2015-08-06 13:39:24 +0000},
    Month = {August},
    Publisher = {ACM},
    Title = {Dynamic query modeling for related content finding},
    Year = {2015}}

Linking queries to entities

I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

The key properties of the dataset are as follows.

  • Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
  • There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
  • The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
  • The annotators have identified the “main” entity/ies for each query, if available.
  • The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
  • The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
  • Sessions are demarcated using a 30 minute time-out.

Entity Linking and Retrieval for Semantic Search (WSDM 2014)

This morning, we presented the last edition of our tutorial series on Entity Linking and Retrieval, entitled “Entity Linking and Retrieval for Semantic Search” (with Krisztian Balog and Daan Odijk) at WSDM 2014! This final edition of the series builds upon our earlier tutorials at WWW 2013 and SIGIR 2013. The focus of this edition lies on the practical applications of Entity Linking and Retrieval, in particular for semantic search: more and more search engine users are expecting direct answers to their information needs (rather than just documents). Semantic search and its recent applications are enabling search engines to organize their wealth of information around entities. Entity linking and retrieval is at the basis of these developments, providing the building stones for organizing the web of entities.

This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.

As before, all our tutorial materials are available for free online, see

Entity Linking and Retrieval Tutorial @ SIGIR 2013 – Slides, Code, and Bibliography

The material for our “Entity Linking and Retrieval” tutorial (with Krisztian Balog and Daan Odijk) for SIGIR 2013 has been updated and is available online on GitHub (slides), Dropbox (slides), Mendeley, and CodeAcademy. All material is summarized at the webpage for the tutorial: See my other blogpost for a brief summary.

Semantic TED

Multilingual Semantic Linking for Video Streams: Making “Ideas Worth Sharing” More Accessible

Semantic TEDThis paper describes our (winning!) submission to the Developers Challenge at WoLE2013, “Doing Good by Linking Entities.” We present a fully automatic system – called “Semantic TED” – which provides intelligent suggestions in the form of links to Wikipedia articles for video streams in multiple languages, based on the subtitles that accompany the visual content. The system is applied to online conference talks. In particular, we adapt a recently proposed semantic linking approach for streams of television broadcasts to facilitate generating contextual links while a TED talk is being viewed. TED is a highly popular global conference series covering many research domains; the publicly available talks have accumulated a total view count of over one billion at the time of writing. We exploit the multi-linguality of Wikipedia and the TED subtitles to provide contextual suggestions in the language of the user watching a video. In this way, a vast source of educational and intellectual content is disclosed to a broad audience that might otherwise experience difficulties interpreting it.

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making "ideas worth sharing" more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    Author = {Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    Booktitle = {Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},
    Date-Added = {2013-05-15 14:09:58 +0000},
    Date-Modified = {2013-05-15 14:11:37 +0000},
    Title = {Multilingual Semantic Linking for Video Streams: Making "Ideas Worth Sharing" More Accessible},
    Year = {2013}}