Fast and Space-Efficient Entity Linking in Queries

Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k^2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times—at least two orders of magnitude faster than existing systems.

  • [PDF] R. Blanco, G. Ottaviano, and E. Meij, “Fast and space-efficient entity linking in queries,” in Proceedings of the eighth acm international conference on web search and data mining, 2015.
    [Bibtex]
    @inproceedings{WSDM:2015:blanco,
    Author = {Blanco, Roi and Ottaviano, Giuseppe and Meij, Edgar},
    Booktitle = {Proceedings of the eighth ACM international conference on Web search and data mining},
    Date-Added = {2011-10-26 11:21:51 +0200},
    Date-Modified = {2015-01-20 20:29:19 +0000},
    Series = {WSDM 2015},
    Title = {Fast and Space-Efficient Entity Linking in Queries},
    Year = {2015},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1935826.1935842}}
CIKM 2014

Time-Aware Rank Aggregation for Microblog Search

We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.

Entity Linking and Retrieval for Semantic Search (WSDM 2014)

This morning, we presented the last edition of our tutorial series on Entity Linking and Retrieval, entitled “Entity Linking and Retrieval for Semantic Search” (with Krisztian Balog and Daan Odijk) at WSDM 2014! This final edition of the series builds upon our earlier tutorials at WWW 2013 and SIGIR 2013. The focus of this edition lies on the practical applications of Entity Linking and Retrieval, in particular for semantic search: more and more search engine users are expecting direct answers to their information needs (rather than just documents). Semantic search and its recent applications are enabling search engines to organize their wealth of information around entities. Entity linking and retrieval is at the basis of these developments, providing the building stones for organizing the web of entities.

This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.

As before, all our tutorial materials are available for free online, see http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/.

RepLab 2014

RepLab is a competitive evaluation exercise for Online Reputation Management systems. In 2012 and 2013, RepLab focused on the problem of monitoring the reputation of (company) entities on Twitter, and dealt with the tasks of entity linking (“Is the tweet about the entity?”), reputation polarity (“Does the tweet have positive or negative implications for the entity’s reputation?”), topic detection (“What is the issue relative to the entity that is discussed in the tweet?”), and topic ranking (“Is the topic an alert that deserves immediate attention?”).

RepLab 2014 will again focus on Reputation Management on Twitter and will be addressing two new tasks, see below. We will use tweets in two languages: English and Spanish.

  1. The classification of tweets with respect to standard reputation dimensions such as Performance, Leadership, Innovation, etc.
  2. The classification of Twitter profiles (authors) with respect to a certain domain, classifying them as journalists, professionals, etc. Second, this task focuses on finding the opinion makers.

The second task is a part of the shared PAN-RepLab author profiling task. Besides the characterization of profiles from a reputation analysis perspective, participants can also attempt to classify authors by gender and age, which is the focus of PAN 2014.

Important dates:

  • March 1 – Training data released
  • March 17 – Test data released
  • May 5 – System results due

See http://nlp.uned.es/replab2014/ for more info and how to participate.

Time-Aware Chi-squared for Document Filtering over Time

To appear at TAIA2013 (a SIGIR 2013 workshop).

Document filtering over time is widely applied in various tasks such as tracking topics in online news or social media. We consider it a classification task, where topics of interest correspond to classes, and the feature space consists of the words associated to each class. In “streaming” settings the set of words associated with a concept may change. In this paper we employ a multinomial Naive Bayes classifier and perform periodic feature selection to adapt to evolving topics. We propose two ways of employing Pearson’s χ2 test for feature selection and demonstrate its benefit on the TREC KBA 2012 data set. By incorporating a time-dependent function in our equations for χ2 we provide an elegant method for applying different weighting schemes. Experiments show improvements of our approach over a non-adaptive baseline.

Semantic TED

Multilingual Semantic Linking for Video Streams: Making “Ideas Worth Sharing” More Accessible

Semantic TEDThis paper describes our (winning!) submission to the Developers Challenge at WoLE2013, “Doing Good by Linking Entities.” We present a fully automatic system – called “Semantic TED” – which provides intelligent suggestions in the form of links to Wikipedia articles for video streams in multiple languages, based on the subtitles that accompany the visual content. The system is applied to online conference talks. In particular, we adapt a recently proposed semantic linking approach for streams of television broadcasts to facilitate generating contextual links while a TED talk is being viewed. TED is a highly popular global conference series covering many research domains; the publicly available talks have accumulated a total view count of over one billion at the time of writing. We exploit the multi-linguality of Wikipedia and the TED subtitles to provide contextual suggestions in the language of the user watching a video. In this way, a vast source of educational and intellectual content is disclosed to a broad audience that might otherwise experience difficulties interpreting it.

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making "ideas worth sharing" more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    [Bibtex]
    @inproceedings{WOLE:2013:Odijk,
    Author = {Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    Booktitle = {Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},
    Date-Added = {2013-05-15 14:09:58 +0000},
    Date-Modified = {2013-05-15 14:11:37 +0000},
    Title = {Multilingual Semantic Linking for Video Streams: Making "Ideas Worth Sharing" More Accessible},
    Year = {2013}}
Trade-off between diversity and precision

Result diversification based on query-specific cluster ranking

Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.

  • [PDF] [DOI] J. He, E. Meij, and M. de Rijke, “Result diversification based on query-specific cluster ranking,” J. am. soc. inf. sci., vol. 62, iss. 3, pp. 550-571, 2011.
    [Bibtex]
    @article{JASIST:2011:he,
    Abstract = {Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.},
    Address = {New York, NY, USA},
    Author = {He, Jiyin and Meij, Edgar and de Rijke, Maarten},
    Citeulike-Article-Id = {9425102},
    Citeulike-Linkout-0 = {http://portal.acm.org/citation.cfm?id=1952338},
    Citeulike-Linkout-1 = {http://dx.doi.org/10.1002/asi.21468},
    Date-Added = {2011-10-20 10:40:50 +0200},
    Date-Modified = {2012-10-28 21:59:28 +0000},
    Doi = {10.1002/asi.21468},
    Issn = {1532-2882},
    Journal = {J. Am. Soc. Inf. Sci.},
    Keywords = {todo},
    Number = {3},
    Pages = {550--571},
    Posted-At = {2011-10-20 09:40:35},
    Priority = {2},
    Publisher = {Wiley Subscription Services, Inc., A Wiley Company},
    Title = {Result diversification based on query-specific cluster ranking},
    Url = {http://dx.doi.org/10.1002/asi.21468},
    Volume = {62},
    Year = {2011},
    Bdsk-Url-1 = {http://dx.doi.org/10.1002/asi.21468}}