wsdm 2017

Utilizing Knowledge Bases in Text-centric Information Retrieval (WSDM 2017)

The past decade has witnessed the emergence of several publicly available and proprietary knowledge graphs (KGs). The increasing depth and breadth of content in KGs makes them not only rich sources of structured knowledge by themselves but also valuable resources for search systems. A surge of recent developments in entity linking and retrieval methods gave rise to a new line of research that aims at utilizing KGs for text-centric retrieval applications, making this an ideal time to pause and report current findings to the community, summarizing successful approaches, and soliciting new ideas. This tutorial is the first to disseminate the progress in this emerging field to researchers and practitioners.

Utilizing Knowledge Bases in Text-centric Information Retrieval (ICTIR 2016)

General-purpose knowledge bases are increasingly growing in terms of depth (content) and width (coverage). Moreover, algorithms for entity linking and entity retrieval have improved tremendously in the past years. These developments give rise to a new line of research that exploits and combines these developments for the purposes of text-centric information retrieval applications. This tutorial focuses on a) how to retrieve a set of entities for an ad-hoc query, or more broadly, assessing relevance of KB elements for the information need, b) how to annotate text with such elements, and c) how to use this information to assess the relevance of text. We discuss different kinds of information available in a knowledge graph and how to leverage each most effectively.
Continue reading “Utilizing Knowledge Bases in Text-centric Information Retrieval (ICTIR 2016)” »

Linking queries to entities

I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

The key properties of the dataset are as follows.

  • Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
  • There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
  • The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
  • The annotators have identified the “main” entity/ies for each query, if available.
  • The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
  • The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
  • Sessions are demarcated using a 30 minute time-out.

Entity Linking and Retrieval Tutorial @ SIGIR 2013 – Slides, Code, and Bibliography

The material for our “Entity Linking and Retrieval” tutorial (with Krisztian Balog and Daan Odijk) for SIGIR 2013 has been updated and is available online on GitHub (slides), Dropbox (slides), Mendeley, and CodeAcademy. All material is summarized at the webpage for the tutorial: See my other blogpost for a brief summary.

Semantic TED

Multilingual Semantic Linking for Video Streams: Making “Ideas Worth Sharing” More Accessible

Semantic TEDThis paper describes our (winning!) submission to the Developers Challenge at WoLE2013, “Doing Good by Linking Entities.” We present a fully automatic system – called “Semantic TED” – which provides intelligent suggestions in the form of links to Wikipedia articles for video streams in multiple languages, based on the subtitles that accompany the visual content. The system is applied to online conference talks. In particular, we adapt a recently proposed semantic linking approach for streams of television broadcasts to facilitate generating contextual links while a TED talk is being viewed. TED is a highly popular global conference series covering many research domains; the publicly available talks have accumulated a total view count of over one billion at the time of writing. We exploit the multi-linguality of Wikipedia and the TED subtitles to provide contextual suggestions in the language of the user watching a video. In this way, a vast source of educational and intellectual content is disclosed to a broad audience that might otherwise experience difficulties interpreting it.

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making "ideas worth sharing" more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    Author = {Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    Booktitle = {Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},
    Date-Added = {2013-05-15 14:09:58 +0000},
    Date-Modified = {2013-05-15 14:11:37 +0000},
    Title = {Multilingual Semantic Linking for Video Streams: Making "Ideas Worth Sharing" More Accessible},
    Year = {2013}}

Hadoop code for TREC KBA

I’ve decided to put some of the Hadoop code I developed for the TREC KBA task online. It’s available on Github: In particular, it provides classes to read/write topic files, read/write run files, and expose the documents in the Thrift files as Hadoop-readable objects (‘ThriftFileInputFormat’) to be used as input to mappers. I obviously also implemented a toy KBA system on Hadoop :-). See Github for more info.

Twitter aspects

Identifying Entity Aspects in Microblog Posts

Online reputation management is about monitoring and handling the public image of entities (such as companies) on the Web. An important task in this area is identifying aspects of the entity of interest (such as products, services, competitors, key people, etc.) given a stream of microblog posts referring to the entity. In this paper we compare different IR techniques and opinion target identification methods for automatically identifying aspects and find that (i) simple statistical method such as TF.IDF are a strong baseline for the task, being significantly better than applying opinion-oriented methods and (ii) only considering terms tagged as nouns improves the results for all the methods analyzed.

More information on the dataset that we created (and used in this paper) can be found here.

  • [PDF] D. Spina, E. Meij, M. de Rijke, A. Oghina, B. M. Thuong, and M. Breuss, “Identifying entity aspects in microblog posts,” in The 35th international acm sigir conference on research and development in information retrieval, 2012.
    Author = {Damiano Spina and Meij, Edgar and de Rijke, Maarten and Andrei Oghina and Bui Minh Thuong and Mathias Breuss},
    Booktitle = {The 35th International ACM SIGIR conference on research and development in Information Retrieval},
    Date-Added = {2012-05-03 22:17:17 +0200},
    Date-Modified = {2012-10-30 08:40:47 +0000},
    Series = {SIGIR 2012},
    Title = {Identifying Entity Aspects in Microblog Posts},
    Year = {2012}}
Twitter aspects

A Corpus for Entity Profiling in Microblog Posts

Microblogs have become an invaluable source of information for the purpose of online reputation management. An emerging problem in the field of online reputation management consists of identifying the key aspects of an entity commented in microblog posts. Streams of microblogs are of great value because of their direct and real-time nature and synthesizing them in form of entity profiles facilitates reputation managers to keep a track of the public image of the entity. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication.

In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

You can find more information on this test collection at

  • [PDF] D. Spina, E. Meij, A. Oghina, B. M. Thuong, M. Breuss, and M. de Rijke, “A corpus for entity profiling in microblog posts,” in Lrec 2012 workshop on language engineering for online reputation management, 2012.
    Author = {Damiano Spina and Edgar Meij and Andrei Oghina and Bui Minh Thuong and Mathias Breuss and Maarten de Rijke},
    Booktitle = {LREC 2012 Workshop on Language Engineering for Online Reputation Management},
    Date-Added = {2012-03-29 12:18:51 +0200},
    Date-Modified = {2012-03-29 12:20:09 +0200},
    Title = {A Corpus for Entity Profiling in Microblog Posts},
    Year = {2012}}