We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.
WSDM is wrapping up today, with only the workshops left for tomorrow. All in all, it was an exciting WSDM with lots of interesting talks and discussions. And of course Times Square cannot be beaten as conference venue location. Some papers/talks that caught my (semantic search) eye at WSDM, in no particular order.
- Entity Linking at the Tail: Sparse Signals, Unknown Entities, and Phrase Models
- Lessons from the Journey: A Query Log Analysis of Within-Session Learning
- On Building Entity Recommender Systems Using User Click Log and Freebase Knowledge
- Latent Dirichlet Allocation based Diversiﬁed Retrieval for E-commerce Search
- Improving Search Relevance for Short Queries in Community Question Answering
- Knowledge-based Graph Document Modeling
- Using Linked Data to Mine Facts from Wikipedia’s Tables
Until next year in Shanghai!
I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.
With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.
This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.
The key properties of the dataset are as follows.
- Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
- There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
- The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
- The annotators have identified the “main” entity/ies for each query, if available.
- The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
- The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
- Sessions are demarcated using a 30 minute time-out.
This morning, we presented the last edition of our tutorial series on Entity Linking and Retrieval, entitled “Entity Linking and Retrieval for Semantic Search” (with Krisztian Balog and Daan Odijk) at WSDM 2014! This final edition of the series builds upon our earlier tutorials at WWW 2013 and SIGIR 2013. The focus of this edition lies on the practical applications of Entity Linking and Retrieval, in particular for semantic search: more and more search engine users are expecting direct answers to their information needs (rather than just documents). Semantic search and its recent applications are enabling search engines to organize their wealth of information around entities. Entity linking and retrieval is at the basis of these developments, providing the building stones for organizing the web of entities.
This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.
As before, all our tutorial materials are available for free online, see http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/.
RepLab is a competitive evaluation exercise for Online Reputation Management systems. In 2012 and 2013, RepLab focused on the problem of monitoring the reputation of (company) entities on Twitter, and dealt with the tasks of entity linking (“Is the tweet about the entity?”), reputation polarity (“Does the tweet have positive or negative implications for the entity’s reputation?”), topic detection (“What is the issue relative to the entity that is discussed in the tweet?”), and topic ranking (“Is the topic an alert that deserves immediate attention?”).
RepLab 2014 will again focus on Reputation Management on Twitter and will be addressing two new tasks, see below. We will use tweets in two languages: English and Spanish.
- The classification of tweets with respect to standard reputation dimensions such as Performance, Leadership, Innovation, etc.
- The classification of Twitter profiles (authors) with respect to a certain domain, classifying them as journalists, professionals, etc. Second, this task focuses on finding the opinion makers.
The second task is a part of the shared PAN-RepLab author profiling task. Besides the characterization of profiles from a reputation analysis perspective, participants can also attempt to classify authors by gender and age, which is the focus of PAN 2014.
- March 1 – Training data released
- March 17 – Test data released
- May 5 – System results due
See http://nlp.uned.es/replab2014/ for more info and how to participate.
In this paper, we present an approach to query modeling that leverages the temporal distribution of documents in an initially retrieved set of documents. Continue reading “Using Temporal Bursts for Query Modeling” »