CIKM 2016

Document Filtering for Long-tail Entities

Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities.

In this paper we propose a document filtering method for long-tail entities that is entity-independent and thus also generalizes to unseen or rarely seen entities. It is based on intrinsic features, i.e., features that are derived from the documents in which the entities are mentioned. We propose a set of features that capture informativeness, entity-saliency, and timeliness. In particular, we introduce features based on entity aspect similarities, relation patterns, and temporal expressions and combine these with standard features for document filtering.

Experiments following the TREC KBA 2014 setup on a publicly available dataset show that our model is able to improve the filtering performance for long-tail entities over several baselines. Results of applying the model to unseen entities are promising, indicating that the model is able to learn the general characteristics of a vital document. The overall performance across all entities–i.e., not just long-tail entities–improves upon the state-of-the-art without depending on any entity-specific training data.

  • [PDF] R. Reinanda, E. Meij, and M. de Rijke, “Document filtering for long-tail entities,” in Cikm 2016: 25th acm conference on information and knowledge management, 2016.
    [Bibtex]
    @inproceedings{CIKM:2016:Reinanda,
    Author = {Reinanda, Ridho and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {CIKM 2016: 25th ACM Conference on Information and Knowledge Management},
    Date-Added = {2016-09-05 18:55:21 +0000},
    Date-Modified = {2016-09-05 19:00:33 +0000},
    Month = {October},
    Publisher = {ACM},
    Title = {Document filtering for long-tail entities},
    Year = {2016}}
WSDM

Dynamic Collective Entity Representations for Entity Ranking

Entity ranking, i.e., successfully positioning a relevant entity at the top of the ranking for a given query, is inherently difficult due to the potential mismatch between the entity’s description in a knowledge base, and the way people refer to the entity when searching for it. To counter this issue we propose a method for constructing dynamic collective entity representations. We collect entity descriptions from a variety of sources and combine them into a single entity representation by learning to weight the content from different sources that are associated with an entity for optimal retrieval effectiveness. Our method is able to add new descriptions in real time and learn the best representation as time evolves so as to capture the dynamics of how people search entities. Incorporating dynamic description sources into dynamic collective entity representations improves retrieval effectiveness by 7% over a state-of-the-art learning to rank baseline. Periodic retraining of the ranker enables higher ranking effectiveness for dynamic collective entity representations.

  • [PDF] D. Graus, M. Tsagkias, W. Weerkamp, E. Meij, and M. de Rijke, “Dynamic collective entity representations for entity ranking,” in Proceedings of the ninth acm international conference on web search and data mining, 2016.
    [Bibtex]
    @inproceedings{WSDM:2016:Graus,
    Author = {Graus, David and Tsagkias, Manos and Weerkamp, Wouter and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the ninth ACM international conference on Web search and data mining},
    Date-Added = {2016-01-07 17:24:16 +0000},
    Date-Modified = {2016-01-07 17:25:55 +0000},
    Series = {WSDM 2016},
    Title = {Dynamic Collective Entity Representations for Entity Ranking},
    Year = {2016},
    Bdsk-Url-1 = {http://aclweb.org/anthology/P15-1055}}
SIGIR logo

Mining, ranking and recommending entity aspects

Entity queries constitute a large fraction of web search queries and most of these queries are in the form of an entity mention plus some context terms that represent an intent in the context of that entity. We refer to these entity-oriented search intents as entity aspects. Recognizing entity aspects in a query can improve various search applications such as providing direct answers, diversifying search results, and recommending queries. In this paper we focus on the tasks of identifying, ranking, and recommending entity aspects, and propose an approach that mines, clusters, and ranks such aspects from query logs.

We perform large-scale experiments based on users’ search sessions from actual query logs to evaluate the aspect ranking and recommendation tasks. In the aspect ranking task, we aim to satisfy most users’ entity queries, and evaluate this task in a query-independent fashion. We find that entropy-based methods achieve the best performance compared to maximum likelihood and language modeling approaches. In the aspect recommendation task, we recommend other aspects related to the aspect currently being queried. We propose two approaches based on semantic relatedness and aspect transitions within user sessions and find that a combined approach gives the best performance. As an additional experiment, we utilize entity aspects for actual query recommendation and find that our approach improves the effectiveness of query recommendations built on top of the query-flow graph.

  • [PDF] R. Reinanda, E. Meij, and M. de Rijke, “Mining, ranking and recommending entity aspects,” in SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval, 2015.
    [Bibtex]
    @inproceedings{SIGIR:2015:Reinanda,
    Author = {Reinanda, Ridho and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {{SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval}},
    Date-Added = {2015-08-06 13:12:53 +0000},
    Date-Modified = {2015-08-06 13:39:33 +0000},
    Month = {August},
    Publisher = {ACM},
    Title = {Mining, ranking and recommending entity aspects},
    Year = {2015}}
SIGIR logo

Dynamic query modeling for related content finding

While watching television, people increasingly consume additional content related to what they are watching. We consider the task of finding video content related to a live television broadcast for which we leverage the textual stream of subtitles associated with the broadcast. We model this task as a Markov decision process and propose a method that uses reinforcement learning to directly optimize the retrieval effectiveness of queries generated from the stream of subtitles. Our dynamic query modeling approach significantly outperforms state-of-the-art baselines for stationary query modeling and for text-based retrieval in a television setting. In particular we find that carefully weighting terms and decaying these weights based on recency significantly improves effectiveness. Moreover, our method is highly efficient and can be used in a live television setting, i.e., in near real time.

  • [PDF] D. Odijk, E. Meij, I. Sijaranamual, and M. de Rijke, “Dynamic query modeling for related content finding,” in SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval, 2015.
    [Bibtex]
    @inproceedings{SIGIR:2015:Odijk,
    Author = {Odijk, Daan and Meij, Edgar and Sijaranamual, Isaac and de Rijke, Maarten},
    Booktitle = {{SIGIR 2015: 38th international ACM SIGIR conference on Research and development in information retrieval}},
    Date-Added = {2015-08-06 13:14:13 +0000},
    Date-Modified = {2015-08-06 13:39:24 +0000},
    Month = {August},
    Publisher = {ACM},
    Title = {Dynamic query modeling for related content finding},
    Year = {2015}}
ACL logo

Learning to Explain Entity Relationships in Knowledge Graphs

We study the problem of explaining relationships between pairs of knowledge graph entities with human-readable descriptions. Our method extracts and enriches sentences that refer to an entity pair from a corpus and ranks the sentences according to how well they describe the relationship between the entities. We model this task as a learning to rank problem for sentences and employ a rich set of features. When evaluated on a large set of manually annotated sentences, we find that our method significantly improves over state-of-the-art baseline models.

  • [PDF] N. Voskarides, E. Meij, M. Tsagkias, M. de Rijke, and W. Weerkamp, “Learning to explain entity relationships in knowledge graphs,” in Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers), 2015, pp. 564-574.
    [Bibtex]
    @inproceedings{ACL:2015:Voskarides,
    Author = {Voskarides, Nikos and Meij, Edgar and Tsagkias, Manos and de Rijke, Maarten and Weerkamp, Wouter},
    Booktitle = {Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)},
    Date-Added = {2015-08-06 13:08:02 +0000},
    Date-Modified = {2015-08-06 13:08:14 +0000},
    Location = {Beijing, China},
    Pages = {564--574},
    Publisher = {Association for Computational Linguistics},
    Title = {Learning to Explain Entity Relationships in Knowledge Graphs},
    Url = {http://aclweb.org/anthology/P15-1055},
    Year = {2015},
    Bdsk-Url-1 = {http://aclweb.org/anthology/P15-1055}}
wsdm 2015

Fast and Space-Efficient Entity Linking in Queries

Entity linking deals with identifying entities from a knowledge base in a given piece of text and has become a fundamental building block for web search engines, enabling numerous downstream improvements from better document ranking to enhanced search results pages. A key problem in the context of web search queries is that this process needs to run under severe time constraints as it has to be performed before any actual retrieval takes place, typically within milliseconds. In this paper we propose a probabilistic model that leverages user-generated information on the web to link queries to entities in a knowledge base. There are three key ingredients that make the algorithm fast and space-efficient. First, the linking process ignores any dependencies between the different entity candidates, which allows for a O(k^2) implementation in the number of query terms. Second, we leverage hashing and compression techniques to reduce the memory footprint. Finally, to equip the algorithm with contextual knowledge without sacrificing speed, we factor the distance between distributional semantics of the query words and entities into the model. We show that our solution significantly outperforms several state-of-the-art baselines by more than 14% while being able to process queries in sub-millisecond times—at least two orders of magnitude faster than existing systems.

  • [PDF] R. Blanco, G. Ottaviano, and E. Meij, “Fast and space-efficient entity linking in queries,” in Proceedings of the eighth acm international conference on web search and data mining, 2015.
    [Bibtex]
    @inproceedings{WSDM:2015:blanco,
    Author = {Blanco, Roi and Ottaviano, Giuseppe and Meij, Edgar},
    Booktitle = {Proceedings of the eighth ACM international conference on Web search and data mining},
    Date-Added = {2011-10-26 11:21:51 +0200},
    Date-Modified = {2015-01-20 20:29:19 +0000},
    Series = {WSDM 2015},
    Title = {Fast and Space-Efficient Entity Linking in Queries},
    Year = {2015},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1935826.1935842}}
CIKM 2014

Time-Aware Rank Aggregation for Microblog Search

We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.

WSDM

WSDM 2014, a recap

WSDM is wrapping up today, with only the workshops left for tomorrow. All in all, it was an exciting WSDM with lots of interesting talks and discussions. And of course Times Square cannot be beaten as conference venue location. Some papers/talks that caught my (semantic search) eye at WSDM, in no particular order.

Until next year in Shanghai!

 

Yahoo Labs

Linking queries to entities

I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

The key properties of the dataset are as follows.

  • Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
  • There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
  • The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
  • The annotators have identified the “main” entity/ies for each query, if available.
  • The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
  • The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
  • Sessions are demarcated using a 30 minute time-out.
Entity Linking

Entity Linking and Retrieval for Semantic Search (WSDM 2014)

This morning, we presented the last edition of our tutorial series on Entity Linking and Retrieval, entitled “Entity Linking and Retrieval for Semantic Search” (with Krisztian Balog and Daan Odijk) at WSDM 2014! This final edition of the series builds upon our earlier tutorials at WWW 2013 and SIGIR 2013. The focus of this edition lies on the practical applications of Entity Linking and Retrieval, in particular for semantic search: more and more search engine users are expecting direct answers to their information needs (rather than just documents). Semantic search and its recent applications are enabling search engines to organize their wealth of information around entities. Entity linking and retrieval is at the basis of these developments, providing the building stones for organizing the web of entities.

This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.

As before, all our tutorial materials are available for free online, see http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/.