Classifying People Queries

Classifying Queries Submitted to a Vertical Search Engine

We propose and motivate a scheme for classifying queries submitted to a people search engine. We specify a number of features for automatically classifying people queries into the proposed classes and examine the effectiveness of these features. Our main finding is that classification is feasible and that using information from past searches, clickouts and news sources is important.

  • [PDF] R. Berendsen, B. Kovachev, E. Meij, M. de Rijke, and W. Weerkamp, “Classifying queries submitted to a vertical search engine,” in Web science 2011, Koblenz, 2011.
    [Bibtex]
    @inproceedings{websci:2011:berendsen,
    Address = {Koblenz},
    Author = {Berendsen, R. and Kovachev, B. and Meij, E. and de Rijke, M. and Weerkamp, W.},
    Booktitle = {Web Science 2011},
    Date-Added = {2011-10-20 10:49:24 +0200},
    Date-Modified = {2012-10-30 08:39:05 +0000},
    Title = {Classifying Queries Submitted to a Vertical Search Engine},
    Year = {2011}}
Wikipedia

Supervised query modeling using Wikipedia

In a web retrieval setting, there is a clear need for precision enhancing methods. For example, the query “the secret garden” (a novel that has been adapted into movies and musicals) is a query that is easily led astray because of the generality of the individual query terms. While some methods address this issue at the document level, e.g., by using anchor texts or some function of the web graph, we are interested in improving the query; a prime example of such an approach is leveraging phrasal or proximity information. Besides degrading the user experience, another significant downside of a lack of precision is its negative impact on the effectiveness of pseudo relevance feedback methods. An example of this phenomenon can be observed for a query such as “indexed annuity” where the richness of the financial domain plus the broad commercial use of the web introduces unrelated terms. To address these issues, we propose a semantically informed manner of representing queries that uses supervised machine learning on Wikipedia. We train an SVM that automatically links queries to Wikipedia articles which are subsequently used to update the query model.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model. We, however, are interested in selecting those Wikipedia articles which best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we show, focusing this set using machine learning improves overall retrieval performance. In particular, we apply supervised machine learning to automatically link queries to Wikipedia articles and sample terms from the linked articles to re-estimate the query model. On a recent large web corpus, we observe substantial gains in terms of both traditional metrics and diversity measures.

  • [PDF] E. Meij and M. de Rijke, “Supervised query modeling using Wikipedia,” in Proceedings of the 33rd international acm sigir conference on research and development in information retrieval, 2010.
    [Bibtex]
    @inproceedings{SIGIR:2010:meij,
    Author = {Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
    Date-Added = {2012-05-03 22:16:10 +0200},
    Date-Modified = {2012-10-30 08:40:21 +0000},
    Series = {SIGIR 2010},
    Title = {Supervised query modeling using {Wikipedia}},
    Year = {2010},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1835449.1835660}}
escience graph

Enabling Data Transport between Web Services

Despite numerous benefits, many Web Services (WS) face problems with respect to data transport, either because SOAP doesn’t offer a scalable way of transporting large data-sets or because orchestration workflows (WF) don’t move data around efficiently. In this paper we address both problems with the development of the ProxyWS. This is a WS utilizing protocols offered by the Virtual Resource System (VRS), to enable other WS to transfer and access large datasets without modifying WS nor the underlying environment.

There is currently an abundance of deployed (legacy) WS using SOAP, which fail to produce access and return large datasets. Moreover, orchestration WF causes WS to pass messages containing data back through the WF engine. To address these problems we introduce the ProxyWS: a WS that is able to access data from remote resources (GridFTP, LFC, etc.), thanks to the VRS, and also transport larger data produced by WS, both legacy and new. For the ProxyWS to be able to provide larger data transfers to legacy WS, it has to be deployed on the same Axis-based container, just like a normal WS. This enables clients to make proxy calls to the ProxyWS instead of a legacy WS. As a consequence the ProxyWS returns a SOAP message containing a URI referring to the data location. For new implementations the ProxyWS is used as an API that can create data streams from remote data resources and other WS using the ProxyWS. This approach proved to be the most scalable since WS can process data as they are generated from producing WS. Thus with the introduction of the ProxyWS we are able to provide a separate channel for data transfers, that allows for more scalable SOA-based applications.

Many different approaches have been introduced in an attempt to address the problems mentioned earlier. Examples of these include Styx Grid Services, Data Proxy Web services for Taverna and Flex-SwA. Some noteworthy features of these approaches are: Direct streaming between WS, Usage of alternative protocols for data transports, and larger data delivery to legacy WS. However, each of these examples only addresses one part of the problem and, furthermore, do not include any means of allowing access to remote data resources. Leveraging these existing proposals and combining them with the VRS we implemented a ProxyWS. To validate it, we have tested its performance using 2 data-intensive WF. The first is a distributed indexing application that uses a set of WS to speedup the indexing of a large set of documents, while the second relies on the creation of that index for retrieving and recognizing protein names contained in results coming from a query. With the use of the ProxyWS we are able to retrieve data from remote locations (8.4 GB of documents for indexing), as well as to obtain more results relative to a query (8300 documents using the ProxyWS versus 1100 using SOAP).

We have presented the ProxyWS, which may be used to support large data transfers for legacy and new WS. We have verified its performance to deliver large datasets on two real-life tasks: Indexing using WS in a distributed environment and annotating documents from an index. From our experiments we have found that ProxyWS is able to facilitate data transports where normal SOAP messages would have failed. We have also demonstrated that with the use of the ProxyWS legacy WS can scale further, by avoiding data delivery via SOAP and by delivering data directly from the producing to the consuming WS.

  • [PDF] S. Koulouzis, E. Meij, and A. Belloum, “Enabling large data transfers between web services,” in 5th egee user forum, 2010.
    [Bibtex]
    @inproceedings{EGEE:2010:koulouzis,
    Author = {Koulouzis, S. and Meij, E. and Belloum, A.},
    Booktitle = {5th EGEE User Forum},
    Date-Added = {2011-10-20 10:00:08 +0200},
    Date-Modified = {2011-10-20 10:00:08 +0200},
    Title = {Enabling Large Data Transfers Between Web Services},
    Year = {2010}}
formula

A query model based on normalized log-likelihood

A query is usually a brief, sometimes imprecise expression of an underlying information need . Examining how queries can be transformed to equivalent, potentially better queries is a theme of recurring interest to the information retrieval community. Such transformations include expansion of short queries to long queries, paraphrasing queries using an alternative vocabulary, mapping unstructured queries to structured ones, identifying key concepts in verbose queries, etc.

To inform the transformation process, multiple types of information sources have been considered. A recent one is search engine logs for query substitutions . Another recent example is where users complement their traditional keyword query with additional information, such as example documents, tags, images, categories, or their search history . The ultimate source of information for transforming a query, however, is the user, through relevance feedback : given a query and a set of judged documents for that query, how does a system take advantage of the judgments in order to transform the original query and retrieve more documents that will be useful to the user? As demonstrated by the recent launch of a dedicated relevance feedback track at TREC, we still lack the definitive answer to this question.

Let’s consider an example to see what aspects play a role in transforming a query based on judgments for a set of initially retrieved documents. Suppose we have a set of documents which are judged to be relevant to a query. These documents may vary in length and, furthermore, they need not be completely on topic because they may discuss more topics than the ones that are relevant to the query. While the users’ judgments are at the document level, not all of the documents’ sections can be assumed to be equally relevant. Most relevance feedback models that are currently available do not model or capture this phenomenon; instead, they attempt to transform the original query based on the full content of the documents. Clearly this is not ideal and we would like to account for the possibly multi-faceted character of documents. We hypothesize that a relevance feedback model that attempts to capture the topical structure of individual judged documents (“For each judged document, what is important about it?”) as well as of the set of all judged documents (“Which topics are shared by the entire set of judged documents?”) will outperform relevance feedback models that capture only one of these types of information.

We are working in a language modeling (LM) setting and our aim in this paper is to present an LM-based relevance feedback model that uses both types of information—about the topical relevance of a document and about the general topic of the set of relevant documents— to transform the original query. The proposed model uses the whole set of relevance assessments to determine how much each document that has been judged relevant should contribute to the query transformation. We use the TREC relevance feedback track test collection to evaluate our model and compare it to other, established relevance feedback methods. We show that it is able to achieve superior performance over all evaluated models. We answer the following two research questions in this paper. (i) Can we develop a relevance feedback model that uses evidence from both the individual relevant documents and the set of relevant documents as a whole? (ii) Can our new model achieve state-of-the-art results and how do these results compare to related models? When evaluated, we show that our model is able to significantly improve over state-of-art feedback methods.

  • [PDF] E. Meij, W. Weerkamp, and M. de Rijke, “A query model based on normalized log-likelihood,” in Proceedings of the 18th acm conference on information and knowledge management, 2009.
    [Bibtex]
    @inproceedings{CIKM:2009:Meij,
    Author = {Meij, Edgar and Weerkamp, Wouter and de Rijke, Maarten},
    Booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:42:51 +0000},
    Series = {CIKM 2009},
    Title = {A query model based on normalized log-likelihood},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1645953.1646261}}
Histogram indicating the number of documents vs the number of keyphrases

A Comparative Study of Features for Keyphrase Extraction

Keyphrases are short phrases that reflect the main topic of a document. Because manually annotating documents with keyphrases is a time-consuming process, several automatic approaches have been developed. Typically, candidate phrases are extracted using features such as position or frequency in the document text. Many different features have been suggested, and have been used individually or in combination. However, it is not clear which of these features are most informative for this task.

We address this issue in the context of keyphrase extraction from scientific literature. We introduce a new corpus that consists of fulltext journal articles and is substantially larger than data sets used in previous work. In addition, the rich collection and document structure available at the publishing stage is explicitly annotated. We suggest new features based on this structure and compare them to existing features, analyzing how the different features capture different aspects the keyphrase extraction task.

  • [PDF] K. Hofmann, M. Tsagkias, E. Meij, and M. de Rijke, “The impact of document structure on keyphrase extraction,” in Proceedings of the 18th acm conference on information and knowledge management, 2009.
    [Bibtex]
    @inproceedings{CIKM:2009:hofmann,
    Author = {Hofmann, Katja and Tsagkias, Manos and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 18th ACM conference on Information and knowledge management},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:42:45 +0000},
    Series = {CIKM 2009},
    Title = {The impact of document structure on keyphrase extraction},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1645953.1646215}}

 

Questions and Answers signpost

Learning Semantic Query Suggestions

An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download.

  • [PDF] E. Meij, M. Bron, B. Huurnink, L. Hollink, and M. de Rijke, “Learning semantic query suggestions,” in Proceedings of the 8th international conference on the semantic web, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:Meij,
    Abstract = {Learning Semantic Query Suggestions by Edgar Meij, Marc Bron, Laura Hollink, Bouke Huurnink and Maarten de Rijke is available online now. An important application of semantic web technology is recognizing human-defined concepts in text. Query transformation is a strategy often used in search engines to derive queries that are able to return more useful search results than the original query and most popular search engines provide facilities that let users complete, specify, or reformulate their queries. We study the problem of semantic query suggestion, a special type of query transformation based on identifying semantic concepts contained in user queries. We use a feature-based approach in conjunction with supervised machine learning, augmenting term-based features with search history-based and concept-specific features. We apply our method to the task of linking queries from real-world query logs (the transaction logs of the Netherlands Institute for Sound and Vision) to the DBpedia knowledge base. We evaluate the utility of different machine learning algorithms, features, and feature types in identifying semantic concepts using a manually developed test bed and show significant improvements over an already high baseline. The resources developed for this paper, i.e., queries, human assessments, and extracted features, are available for download. },
    Author = {E. Meij and M. Bron and B. Huurnink and Hollink, L. and de Rijke, M.},
    Booktitle = {Proceedings of the 8th International Conference on The Semantic Web},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:04 +0000},
    Series = {ISWC 2009},
    Title = {Learning Semantic Query Suggestions},
    Year = {2009}}
Distribution of structured data embedded in XHTML

Investigating the Semantic Gap through Query Log Analysis

Significant efforts have focused in the past years on bringing large amounts of metadata online and the success of these efforts can be seen by the impressive number of web sites exposing data in RDFa or RDF/XML. However, little is known about the extent to which this data fits the needs of ordinary web users with everyday information needs. In this paper we study what we perceive as the semantic gap between the supply of data on the Semantic Web and the needs of web users as expressed in the queries submitted to a major Web search engine. We perform our analysis on both the level of instances and ontologies. First, we first look at how much data is actually relevant to Web queries and what kind of data is it. Second, we provide a generic method to extract the attributes that Web users are searching for regarding particular classes of entities. This method allows to contrast class definitions found in Semantic Web vocabularies with the attributes of objects that users are interested in. Our findings are crucial to measuring the potential of semantic search, but also speak to the state of the Semantic Web in general.

  • [PDF] P. Mika, E. Meij, and H. Zaragoza, “Investigating the semantic gap through query log analysis.,” in Proceedings of the 8th international semantic web conference, 2009.
    [Bibtex]
    @inproceedings{ISWC:2009:mika,
    Author = {Peter Mika and Edgar Meij and Hugo Zaragoza},
    Booktitle = {Proceedings of the 8th International Semantic Web Conference},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:11 +0000},
    Series = {ISWC 2009},
    Title = {Investigating the Semantic Gap through Query Log Analysis.},
    Year = {2009},
    Bdsk-Url-1 = {http://dblp.uni-trier.de/db/conf/semweb/iswc2009.html#MikaMZ09}}
Type completion

An evaluation of entity and frequency based query completion methods

From the days of boolean search on library catalogues, users have reformulated their queries after an inspection of initial search results. Traditional information retrieval studies this in frameworks such as query expansion, relevance feedback, interactive retrieval, etc. These methods mostly exploit document contents because that is typically all information that is available. The situation is very different in web search engines because of the large amounts of users whose queries are collected in query logs. Query logs reflect how large numbers of users express their queries and can be a rich source of information when optimizing search results or determining query suggestions.

In this paper we study a special case of query suggestion: query completion, which aims to help users complete their queries. In particular, we are interested in comparing a commonly adopted frequency-based approach with methods that exploit an understanding of the type of entities in queries. Our intuition is that completion for rare queries can be improved by understanding the type of entity being sought. For example, if we know that “LX354” is a kind of digital camera, we can generate sensible completions by choosing them from the set of completions used with other digital cameras. Besides suggesting queries, the obtained completions can also function as facets for faceted browsing or as input for ontology engineering since they represent query refinements common to a class of entities. In this paper, we address the following questions: (i) How can we recognize entities and their types in queries? (ii) How can we rank possible completions given an entity type? (iii) How can our methods be evaluated and how do they perform? To address (iii), we propose a novel method which evaluates the prediction of real web queries. We show that a purely frequency-based approach without any entity type information works quite well for more frequent queries, but is surpassed by type-based methods for rare queries.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “An evaluation of entity and frequency based query completion methods,” in Proceedings of the 32nd international acm sigir conference on research and development in information retrieval, 2009.
    [Bibtex]
    @inproceedings{SIGIR:2009:meij,
    Author = {Meij, Edgar and Mika, Peter and Zaragoza, Hugo},
    Booktitle = {Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:25 +0000},
    Series = {SIGIR 2009},
    Title = {An evaluation of entity and frequency based query completion methods},
    Year = {2009},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1571941.1572074}}
INEX

A Generative Language Modeling Approach for Ranking Entities

We describe our participation in the INEX 2008 Entity Ranking track. We develop a generative language modeling approach for the entity ranking and list completion tasks. Our framework comprises the following components: (i) entity and (ii) query language models, (iii) entity prior, (iv) the probability of an entity for a given category, and (v) the probability of an entity given another entity. We explore various ways of estimating these components, and report on our results. We find that improving the estimation of these components has very positive effects on performance, yet, there is room for further improvements.

  • [PDF] W. Weerkamp, K. Balog, and E. Meij, “A generative language modeling approach for ranking entities,” in Advances in focused retrieval, 2009.
    [Bibtex]
    @inproceedings{INEX:2008:weerkamp,
    Abstract = {We describe our participation in the INEX 2008 Entity Ranking track. We develop a generative language modeling approach for the entity ranking and list completion tasks. Our framework comprises the following components: (i) entity and (ii) query language models, (iii) entity prior, (iv) the probability of an entity for a given category, and (v) the probability of an entity given another entity. We explore various ways of estimating these components, and report on our results. We find that improving the estimation of these components has very positive effects on performance, yet, there is room for further improvements.},
    Author = {Weerkamp, W. and Balog, K. and Meij, E.},
    Booktitle = {Advances in Focused Retrieval},
    Date-Added = {2011-10-16 12:29:08 +0200},
    Date-Modified = {2011-10-16 12:29:08 +0200},
    Organization = {Springer},
    Publisher = {Springer},
    Title = {A Generative Language Modeling Approach for Ranking Entities},
    Year = {2009}}