formula

A query model based on normalized log-likelihood

A query is usually a brief, sometimes imprecise expression of an underlying information need . Examining how queries can be transformed to equivalent, potentially better queries is a theme of recurring interest to the information retrieval community. Such transformations include expansion of short queries to long queries, paraphrasing queries using an alternative vocabulary, mapping unstructured queries to structured ones, identifying key concepts in verbose queries, etc.

To inform the transformation process, multiple types of information sources have been considered. A recent one is search engine logs for query substitutions . Another recent example is where users complement their traditional keyword query with additional information, such as example documents, tags, images, categories, or their search history . The ultimate source of information for transforming a query, however, is the user, through relevance feedback : given a query and a set of judged documents for that query, how does a system take advantage of the judgments in order to transform the original query and retrieve more documents that will be useful to the user? As demonstrated by the recent launch of a dedicated relevance feedback track at TREC, we still lack the definitive answer to this question.

Let’s consider an example to see what aspects play a role in transforming a query based on judgments for a set of initially retrieved documents. Suppose we have a set of documents which are judged to be relevant to a query. These documents may vary in length and, furthermore, they need not be completely on topic because they may discuss more topics than the ones that are relevant to the query. While the users’ judgments are at the document level, not all of the documents’ sections can be assumed to be equally relevant. Most relevance feedback models that are currently available do not model or capture this phenomenon; instead, they attempt to transform the original query based on the full content of the documents. Clearly this is not ideal and we would like to account for the possibly multi-faceted character of documents. We hypothesize that a relevance feedback model that attempts to capture the topical structure of individual judged documents (“For each judged document, what is important about it?”) as well as of the set of all judged documents (“Which topics are shared by the entire set of judged documents?”) will outperform relevance feedback models that capture only one of these types of information.

We are working in a language modeling (LM) setting and our aim in this paper is to present an LM-based relevance feedback model that uses both types of information—about the topical relevance of a document and about the general topic of the set of relevant documents— to transform the original query. The proposed model uses the whole set of relevance assessments to determine how much each document that has been judged relevant should contribute to the query transformation. We use the TREC relevance feedback track test collection to evaluate our model and compare it to other, established relevance feedback methods. We show that it is able to achieve superior performance over all evaluated models. We answer the following two research questions in this paper. (i) Can we develop a relevance feedback model that uses evidence from both the individual relevant documents and the set of relevant documents as a whole? (ii) Can our new model achieve state-of-the-art results and how do these results compare to related models? When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.

  • [PDF] E. Meij, W. Weerkamp, and M. de Rijke, “A query model based on normalized log-likelihood,” in Proceedings of the 18th acm conference on information and knowledge management, 2009.
    [Bibtex]
    @inproceedings{CIKM:2009:Meij,
    Author = {Meij, Edgar and Weerkamp, Wouter and de Rijke, Maarten},
    Booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:42:51 +0000},
    Series = {CIKM 2009},
    Title = {A query model based on normalized log-likelihood},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1645953.1646261}}
Histogram indicating the number of documents vs the number of keyphrases

A Comparative Study of Features for Keyphrase Extraction

Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Many different features have been suggested, and have been used individually or in combination. However, it is not clear which of these features are most informative for this task.

We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new corpus that consists of fulltext journal articles and is substantially larger than data sets used in previous work. In addition, the rich collection and document structure available at the publishing stage is explicitly annotated. We suggest new features based on this structure and compare them to existing features, analyzing how the different features capture different aspects the keyphrase extraction task.

  • [PDF] K. Hofmann, M. Tsagkias, E. Meij, and M. de Rijke, “The impact of document structure on keyphrase extraction,” in Proceedings of the 18th acm conference on information and knowledge management, 2009.
    [Bibtex]
    @inproceedings{CIKM:2009:hofmann,
    Author = {Hofmann, Katja and Tsagkias, Manos and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:42:45 +0000},
    Series = {CIKM 2009},
    Title = {The impact of document structure on keyphrase extraction},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1645953.1646215}}

 

Questions and Answers signpost

Learning Semantic Query Suggestions

An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download.

  • [PDF] E. Meij, M. Bron, B. Huurnink, L. Hollink, and M. de Rijke, “Learning semantic query suggestions,” in Proceedings of the 8th international conference on the semantic web, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:Meij,
    Abstract = {Learning Semantic Query Suggestions by Edgar Meij, Marc Bron, Laura Hollink, Bouke Huurnink and Maarten de Rijke is available online now. An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download. },
    Author = {E. Meij and M. Bron and B. Huurnink and Hollink, L. and de Rijke, M.},
    Booktitle = {Proceedings of the 8th International Conference on The Semantic Web},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:04 +0000},
    Series = {ISWC 2009},
    Title = {Learning Semantic Query Suggestions},
    Year = {2009}}
Distribution of structured data embedded in XHTML

Investigating the Semantic Gap through Query Log Analysis

Significant efforts have focused in the past years on bringing large amounts of metadata online and the success of these efforts can be seen by the impressive number of web sites exposing data in RDFa or RDF/XML. However, little is known about the extent to which this data fits the needs of ordinary web users with everyday information needs. In this paper we study what we perceive as the semantic gap between the supply of data on the Semantic Web and the needs of web users as expressed in the queries submitted to a major Web search engine. We perform our analysis on both the level of instances and ontologies. First, we first look at how much data is actually relevant to Web queries and what kind of data is it. Second, we provide a generic method to extract the attributes that Web users are searching for regarding particular classes of entities. This method allows to contrast class definitions found in Semantic Web vocabularies with the attributes of objects that users are interested in. Our findings are crucial to measuring the potential of semantic search, but also speak to the state of the Semantic Web in general.

  • [PDF] P. Mika, E. Meij, and H. Zaragoza, “Investigating the semantic gap through query log analysis.,” in Proceedings of the 8th international semantic web conference, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:mika,
    Author = {Peter Mika and Edgar Meij and Hugo Zaragoza},
    Booktitle = {Proceedings of the 8th International Semantic Web Conference},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:11 +0000},
    Series = {ISWC 2009},
    Title = {Investigating the Semantic Gap through Query Log Analysis.},
    Year = {2009},
    Bdsk-Url-1 = {http://dblp.uni-trier.de/db/conf/semweb/iswc2009.html#MikaMZ09}}
i found you!

A Semantic Perspective on Query Log Analysis

We present our views on the CLEF log file analysis task. We argue for a task definition that focuses on the semantic enrichment of query logs. In addition, we discuss how additional information about the context in which queries are being made could further our understanding of users’ information seeking and how to better facilitate this process.

  • [PDF] K. Hofmann, M. de Rijke, B. Huurnink, and E. Meij, “A semantic perspective on query log analysis,” in Working notes for the clef 2009 workshop, 2009.
    [Bibtex]
    @inproceedings{CLEF:2009:hofmann,
    Abstract = {We present our views on the CLEF log file analysis task. We argue for a task definition that focuses on the semantic enrichment of query logs. In addition, we discuss how additional information about the context in which queries are being made could further our understanding of users' information seeking and how to better facilitate this process. },
    Author = {Hofmann, K. and de Rijke, M. and Huurnink, B. and Meij, E.},
    Booktitle = {Working Notes for the CLEF 2009 Workshop},
    Date-Added = {2011-10-17 09:46:16 +0200},
    Date-Modified = {2011-10-17 09:46:16 +0200},
    Title = {A Semantic Perspective on Query Log Analysis},
    Year = {2009}}
Type completion

An evaluation of entity and frequency based query completion methods

From the days of boolean search on library catalogues, users have reformulated their queries after an inspection of initial search results. Traditional information retrieval studies this in frameworks such as query expansion, relevance feedback, interactive retrieval, etc. These methods mostly exploit document contents because that is typically all information that is available. The situation is very different in web search engines because of the large amounts of users whose queries are collected in query logs. Query logs reflect how large numbers of users express their queries and can be a rich source of information when optimizing search results or determining query suggestions.

In this paper we study a special case of query suggestion: query completion, which aims to help users complete their queries. In particular, we are interested in comparing a commonly adopted frequency-based approach with methods that exploit an understanding of the type of entities in queries. Our intuition is that completion for rare queries can be improved by understanding the type of entity being sought. For example, if we know that “LX354” is a kind of digital camera, we can generate sensible completions by choosing them from the set of completions used with other digital cameras. Besides suggesting queries, the obtained completions can also function as facets for faceted browsing or as input for ontology engineering since they represent query refinements common to a class of entities. In this paper, we address the following questions: (i) How can we recognize entities and their types in queries? (ii) How can we rank possible completions given an entity type? (iii) How can our methods be evaluated and how do they perform? To address (iii), we propose a novel method which evaluates the prediction of real web queries. We show that a purely frequency-based approach without any entity type information works quite well for more frequent queries, but is surpassed by type-based methods for rare queries.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “An evaluation of entity and frequency based query completion methods,” in Proceedings of the 32nd international acm sigir conference on research and development in information retrieval, 2009.
    [Bibtex]
    @inproceedings{SIGIR:2009:meij,
    Author = {Meij, Edgar and Mika, Peter and Zaragoza, Hugo},
    Booktitle = {Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:25 +0000},
    Series = {SIGIR 2009},
    Title = {An evaluation of entity and frequency based query completion methods},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1571941.1572074}}
crowd

Investigating the Demand Side of Semantic Search through Query Log Analysis

Semantic search is by its broadest definition a collection of approaches that aim at matching the Web’s content with the information need of Web users at a semantic level. Most of the work in this area has focused on the supply-side of semantic search, in particular elevating Web content to the semantic level by relying on methods of information extraction or working with explicit metadata embedded inside or linked to Web resources. With respect to explicit metadata, several studies have been done on the adoption of semantic web formats in the wild, mostly based on statistics from the crawls of semantic web search engines. Much less effort has focused on the demand-side of semantic search, i.e. interpreting queries at the semantic level and studying information needs at this level. Conversely, little is known as to how much the supply of metadata actually matches the demand for information on the Web.

In this paper, we address the problem of studying the information need of Web searchers at an ontological level, i.e., in terms of the particular attributes of objects they are interested in. We describe a set of methods for extracting the context words to certain classes of objects from a Web search query log. We do so based on the idea that common context words reflects aspects of objects users are interested in. We implement these methods in an interactive tool called the Semantic Search Assist. The original purpose of this tool was to generate type-based query suggestions when there is not enough statistical evidence for entity-based query suggestions. However, from an ontology engineering perspective, this tool answers the question of what attributes a class of objects would have if the ontology for it was engineered purely based on the information needs of end users. As such it allows us to reflect on the gap between the properties defined in Semantic Web ontologies and the attributes of objects that people are searching for on the Web. We evaluate our tool by measuring it’s predictive power on the query log itself. We leave the study of the gap between particular information needs and Semantic Web data for future work.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “Investigating the demand side of semantic search through query log analysis,” in Proceedings of the workshop on semantic search (semsearch 2009) at the 18th international world wide web conference (www 2009), 2009.
    [Bibtex]
    @inproceedings{semsearch:2009:meij,
    Author = {Edgar Meij and P. Mika and H. Zaragoza},
    Booktitle = {Proceedings of the Workshop on Semantic Search (SemSearch 2009) at the 18th International World Wide Web Conference (WWW 2009)},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:47 +0000},
    Title = {Investigating the Demand Side of Semantic Search through Query Log Analysis},
    Year = {2009}}
Annals of Information Systems

Semantic disclosure in an e-Science environment

The Virtual Laboratory for e-Science (VL-e) project serves as a backdrop for the ideas described in this chapter. VL-e is a project with academic and industrial partners where e-science has been applied to several domains of scientific research. Adaptive Information Disclosure (AID), a subprogram within VL-e, is a multi-disciplinary group that concentrates expertise in information extraction, machine learning, and Semantic Web – a powerful combination of technologies that can be used to extract and store knowledge in a Semantic Web framework. In this chapter, the authors explain what “semantic disclosure” means and how it is essential to knowledge sharing in e-Science. The authors describe several Semantic Web applications and how they were built using components of the AIDA Toolkit (AID Application Toolkit). The lessons learned and the future of e-Science are also discussed.

  • [PDF] M. S. Marshall, M. Roos, E. Meij, S. Katrenko, W. R. van Hage, and P. W. Adriaans, “Semantic disclosure in an e-science environment,” in Semantic e-science (springer annals of information systems aois), 2009.
    [Bibtex]
    @inproceedings{AIS:2009:marshall,
    Author = {Marshall, M.S. and Roos, M. and Meij, E. and Katrenko, S. and van Hage, W.R. and Adriaans, P.W.},
    Booktitle = {Semantic e-Science (Springer Annals of Information Systems AoIS)},
    Date-Added = {2011-10-16 15:03:17 +0200},
    Date-Modified = {2012-10-28 17:21:26 +0000},
    Publisher = {Springer},
    Series = {Annals of Information Systems},
    Title = {Semantic disclosure in an e-Science environment},
    Volume = {11},
    Year = {2009}}
INEX

A Generative Language Modeling Approach for Ranking Entities

We describe our participation in the INEX 2008 Entity Ranking track. We develop a generative language modeling approach for the entity ranking and list completion tasks. Our framework comprises the following components: (i) entity and (ii) query language models, (iii) entity prior, (iv) the probability of an entity for a given category, and (v) the probability of an entity given another entity. We explore various ways of estimating these components, and report on our results. We find that improving the estimation of these components has very positive effects on performance, yet, there is room for further improvements.

  • [PDF] W. Weerkamp, K. Balog, and E. Meij, “A generative language modeling approach for ranking entities,” in Advances in focused retrieval, 2009.
    [Bibtex]
    @inproceedings{INEX:2008:weerkamp,
    Abstract = {We describe our participation in the INEX 2008 Entity Ranking track. We develop a generative language modeling approach for the entity ranking and list completion tasks. Our framework comprises the following components: (i) entity and (ii) query language models, (iii) entity prior, (iv) the probability of an entity for a given category, and (v) the probability of an entity given another entity. We explore various ways of estimating these components, and report on our results. We find that improving the estimation of these components has very positive effects on performance, yet, there is room for further improvements.},
    Author = {Weerkamp, W. and Balog, K. and Meij, E.},
    Booktitle = {Advances in Focused Retrieval},
    Date-Added = {2011-10-16 12:29:08 +0200},
    Date-Modified = {2011-10-16 12:29:08 +0200},
    Organization = {Springer},
    Publisher = {Springer},
    Title = {A Generative Language Modeling Approach for Ranking Entities},
    Year = {2009}}