Linking queries to entities

I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

The key properties of the dataset are as follows.

  • Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
  • There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
  • The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
  • The annotators have identified the “main” entity/ies for each query, if available.
  • The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
  • The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
  • Sessions are demarcated using a 30 minute time-out.
hits per time of day

People searching for people: analysis of a people search engine log

Recent years show an increasing interest in vertical search: searching within a particular type of information. Understanding what people search for in these “verticals” gives direction to research and provides pointers for the search engines themselves. In this paper we analyze the search logs of one particular vertical: people search engines. Based on an extensive analysis of the logs of a search engine geared towards finding people, we propose a classification scheme for people search at three levels: (a) queries, (b) sessions, and (c) users. For queries, we identify three types, (i) event-based high-profile queries (people that become “popular” because of an event happening), (ii) regular high-profile queries (celebrities), and (iii) low-profile queries (other, less-known people). We present experiments on automatic classification of queries. On the session level, we observe five types: (i) family sessions (users looking for relatives), (ii) event sessions (querying the main players of an event), (iii) spotting sessions (trying to “spot” different celebrities online), (iv) polymerous sessions (sessions without a clear relation between queries), and (v) repetitive sessions (query refinement and copying). Finally, for users we identify four types: (i) monitors, (ii) spotters, (iii) followers, and (iv) polymers.

Our findings not only offer insight into search behavior in people search engines, but they are also useful to identify future research directions and to provide pointers for search engine improvements.

  • [PDF] W. Weerkamp, R. Berendsen, B. Kovachev, E. Meij, K. Balog, and M. de Rijke, “People searching for people: analysis of a people search engine log,” in Proceedings of the 34th international acm sigir conference on research and development in information, 2011.
    [Bibtex]
    @inproceedings{sigir:2011:weerkamp,
    Author = {Weerkamp, Wouter and Berendsen, Richard and Kovachev, Bogomil and Meij, Edgar and Balog, Krisztian and de Rijke, Maarten},
    Booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information},
    Date-Added = {2011-10-20 10:50:25 +0200},
    Date-Modified = {2012-10-30 08:41:27 +0000},
    Series = {SIGIR 2011},
    Title = {People searching for people: analysis of a people search engine log},
    Year = {2011},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/2009916.2009927}}

ACM DL Author-ize servicePeople searching for people: analysis of a people search engine log

Wouter Weerkamp, Richard Berendsen, Bogomil Kovachev, Edgar Meij, Krisztian Balog, Maarten de Rijke
SIGIR ’11 Proceedings of the 34th international ACM SIGIR conference on Research and development in Information, 2011

Classifying People Queries

Classifying Queries Submitted to a Vertical Search Engine

We propose and motivate a scheme for classifying queries submitted to a people search engine. We specify a number of features for automatically classifying people queries into the proposed classes and examine the effectiveness of these features. Our main finding is that classification is feasible and that using information from past searches, clickouts and news sources is important.

  • [PDF] R. Berendsen, B. Kovachev, E. Meij, M. de Rijke, and W. Weerkamp, “Classifying queries submitted to a vertical search engine,” in Web science 2011, Koblenz, 2011.
    [Bibtex]
    @inproceedings{websci:2011:berendsen,
    Address = {Koblenz},
    Author = {Berendsen, R. and Kovachev, B. and Meij, E. and de Rijke, M. and Weerkamp, W.},
    Booktitle = {Web Science 2011},
    Date-Added = {2011-10-20 10:49:24 +0200},
    Date-Modified = {2012-10-30 08:39:05 +0000},
    Title = {Classifying Queries Submitted to a Vertical Search Engine},
    Year = {2011}}
TREC

Heuristic Ranking and Diversification of Web Documents

We describe the participation of the University of Amsterdam’s Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.,

  • [PDF] J. He, K. Balog, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “Heuristic ranking and diversification of web documents,” in The eighteenth text retrieval conference, 2010.
    [Bibtex]
    @inproceedings{TREC:2010:he,
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {He, J. and Balog, K. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Eighteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 09:45:15 +0200},
    Date-Modified = {2012-10-30 09:24:20 +0000},
    Series = {TREC 2009},
    Title = {Heuristic Ranking and Diversification of Web Documents},
    Year = {2010}}
Questions and Answers signpost

Learning Semantic Query Suggestions

An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download.

  • [PDF] E. Meij, M. Bron, B. Huurnink, L. Hollink, and M. de Rijke, “Learning semantic query suggestions,” in Proceedings of the 8th international conference on the semantic web, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:Meij,
    Abstract = {Learning Semantic Query Suggestions by Edgar Meij, Marc Bron, Laura Hollink, Bouke Huurnink and Maarten de Rijke is available online now. An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download. },
    Author = {E. Meij and M. Bron and B. Huurnink and Hollink, L. and de Rijke, M.},
    Booktitle = {Proceedings of the 8th International Conference on The Semantic Web},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:04 +0000},
    Series = {ISWC 2009},
    Title = {Learning Semantic Query Suggestions},
    Year = {2009}}
Distribution of structured data embedded in XHTML

Investigating the Semantic Gap through Query Log Analysis

Significant efforts have focused in the past years on bringing large amounts of metadata online and the success of these efforts can be seen by the impressive number of web sites exposing data in RDFa or RDF/XML. However, little is known about the extent to which this data fits the needs of ordinary web users with everyday information needs. In this paper we study what we perceive as the semantic gap between the supply of data on the Semantic Web and the needs of web users as expressed in the queries submitted to a major Web search engine. We perform our analysis on both the level of instances and ontologies. First, we first look at how much data is actually relevant to Web queries and what kind of data is it. Second, we provide a generic method to extract the attributes that Web users are searching for regarding particular classes of entities. This method allows to contrast class definitions found in Semantic Web vocabularies with the attributes of objects that users are interested in. Our findings are crucial to measuring the potential of semantic search, but also speak to the state of the Semantic Web in general.

  • [PDF] P. Mika, E. Meij, and H. Zaragoza, “Investigating the semantic gap through query log analysis.,” in Proceedings of the 8th international semantic web conference, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:mika,
    Author = {Peter Mika and Edgar Meij and Hugo Zaragoza},
    Booktitle = {Proceedings of the 8th International Semantic Web Conference},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:11 +0000},
    Series = {ISWC 2009},
    Title = {Investigating the Semantic Gap through Query Log Analysis.},
    Year = {2009},
    Bdsk-Url-1 = {http://dblp.uni-trier.de/db/conf/semweb/iswc2009.html#MikaMZ09}}
i found you!

A Semantic Perspective on Query Log Analysis

We present our views on the CLEF log file analysis task. We argue for a task definition that focuses on the semantic enrichment of query logs. In addition, we discuss how additional information about the context in which queries are being made could further our understanding of users’ information seeking and how to better facilitate this process.

  • [PDF] K. Hofmann, M. de Rijke, B. Huurnink, and E. Meij, “A semantic perspective on query log analysis,” in Working notes for the clef 2009 workshop, 2009.
    [Bibtex]
    @inproceedings{CLEF:2009:hofmann,
    Abstract = {We present our views on the CLEF log file analysis task. We argue for a task definition that focuses on the semantic enrichment of query logs. In addition, we discuss how additional information about the context in which queries are being made could further our understanding of users' information seeking and how to better facilitate this process. },
    Author = {Hofmann, K. and de Rijke, M. and Huurnink, B. and Meij, E.},
    Booktitle = {Working Notes for the CLEF 2009 Workshop},
    Date-Added = {2011-10-17 09:46:16 +0200},
    Date-Modified = {2011-10-17 09:46:16 +0200},
    Title = {A Semantic Perspective on Query Log Analysis},
    Year = {2009}}
Type completion

An evaluation of entity and frequency based query completion methods

From the days of boolean search on library catalogues, users have reformulated their queries after an inspection of initial search results. Traditional information retrieval studies this in frameworks such as query expansion, relevance feedback, interactive retrieval, etc. These methods mostly exploit document contents because that is typically all information that is available. The situation is very different in web search engines because of the large amounts of users whose queries are collected in query logs. Query logs reflect how large numbers of users express their queries and can be a rich source of information when optimizing search results or determining query suggestions.

In this paper we study a special case of query suggestion: query completion, which aims to help users complete their queries. In particular, we are interested in comparing a commonly adopted frequency-based approach with methods that exploit an understanding of the type of entities in queries. Our intuition is that completion for rare queries can be improved by understanding the type of entity being sought. For example, if we know that “LX354” is a kind of digital camera, we can generate sensible completions by choosing them from the set of completions used with other digital cameras. Besides suggesting queries, the obtained completions can also function as facets for faceted browsing or as input for ontology engineering since they represent query refinements common to a class of entities. In this paper, we address the following questions: (i) How can we recognize entities and their types in queries? (ii) How can we rank possible completions given an entity type? (iii) How can our methods be evaluated and how do they perform? To address (iii), we propose a novel method which evaluates the prediction of real web queries. We show that a purely frequency-based approach without any entity type information works quite well for more frequent queries, but is surpassed by type-based methods for rare queries.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “An evaluation of entity and frequency based query completion methods,” in Proceedings of the 32nd international acm sigir conference on research and development in information retrieval, 2009.
    [Bibtex]
    @inproceedings{SIGIR:2009:meij,
    Author = {Meij, Edgar and Mika, Peter and Zaragoza, Hugo},
    Booktitle = {Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:25 +0000},
    Series = {SIGIR 2009},
    Title = {An evaluation of entity and frequency based query completion methods},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1571941.1572074}}
crowd

Investigating the Demand Side of Semantic Search through Query Log Analysis

Semantic search is by its broadest definition a collection of approaches that aim at matching the Web’s content with the information need of Web users at a semantic level. Most of the work in this area has focused on the supply-side of semantic search, in particular elevating Web content to the semantic level by relying on methods of information extraction or working with explicit metadata embedded inside or linked to Web resources. With respect to explicit metadata, several studies have been done on the adoption of semantic web formats in the wild, mostly based on statistics from the crawls of semantic web search engines. Much less effort has focused on the demand-side of semantic search, i.e. interpreting queries at the semantic level and studying information needs at this level. Conversely, little is known as to how much the supply of metadata actually matches the demand for information on the Web.

In this paper, we address the problem of studying the information need of Web searchers at an ontological level, i.e., in terms of the particular attributes of objects they are interested in. We describe a set of methods for extracting the context words to certain classes of objects from a Web search query log. We do so based on the idea that common context words reflects aspects of objects users are interested in. We implement these methods in an interactive tool called the Semantic Search Assist. The original purpose of this tool was to generate type-based query suggestions when there is not enough statistical evidence for entity-based query suggestions. However, from an ontology engineering perspective, this tool answers the question of what attributes a class of objects would have if the ontology for it was engineered purely based on the information needs of end users. As such it allows us to reflect on the gap between the properties defined in Semantic Web ontologies and the attributes of objects that people are searching for on the Web. We evaluate our tool by measuring it’s predictive power on the query log itself. We leave the study of the gap between particular information needs and Semantic Web data for future work.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “Investigating the demand side of semantic search through query log analysis,” in Proceedings of the workshop on semantic search (semsearch 2009) at the 18th international world wide web conference (www 2009), 2009.
    [Bibtex]
    @inproceedings{semsearch:2009:meij,
    Author = {Edgar Meij and P. Mika and H. Zaragoza},
    Booktitle = {Proceedings of the Workshop on Semantic Search (SemSearch 2009) at the 18th International World Wide Web Conference (WWW 2009)},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:47 +0000},
    Title = {Investigating the Demand Side of Semantic Search through Query Log Analysis},
    Year = {2009}}