thesis cover image of a smart computer

Combining Concepts and Language Models for Information Access

Since the middle of last century, information retrieval has gained an increasing interest. Since its inception, much research has been devoted to finding optimal ways of representing both documents and queries, as well as improving ways of matching one with the other. In cases where document annotations or explicit semantics are available, matching algorithms can be informed using the concept languages in which such semantics are usually defined. These algorithms are able to match queries and documents based on textual and semantic evidence.

Recent advances have enabled the use of rich query representations in the form of query language models. This, in turn, allows us to account for the language associated with concepts within the retrieval model in a principled and transparent manner. Developments in the semantic web community, such as the Linked Open Data cloud, have enabled the association of texts with concepts on a large scale. Taken together, these developments facilitate a move beyond manually assigned concepts in domain-specific contexts into the general domain.

This thesis investigates how one can improve information access by employing the actual use of concepts as measured by the language that people use when they discuss them. The main contribution is a set of models and methods that enable users to retrieve and access information on a conceptual level. Through extensive evaluations, a systematic exploration and thorough analysis of the experimental results of the proposed models is performed. Our empirical results show that a combination of top-down conceptual information and bottom-up statistical information obtains optimal performance on a variety of tasks and test collections.

See for more information.

  • [PDF] E. Meij, “Combining concepts and language models for information access,” PhD Thesis, 2010.
    Author = {Meij, Edgar},
    Date-Added = {2011-10-20 10:18:00 +0200},
    Date-Modified = {2011-10-22 12:23:33 +0200},
    School = {University of Amsterdam},
    Title = {Combining Concepts and Language Models for Information Access},
    Year = {2010}}


linking open data cloud datasets full

Archieven Linken met Semantische Zoekmachines

In toenemende mate worden grootschalige archieven toegankelijk gemaakt voor een breed publiek. Prominente voorbeelden worden gegeven door de archieven van landelijke dagbladen, nationale archieven, overheidsarchieven, archieven onder beheer van de Koninklijke Bibliotheek, televisiearchieven zoals beheerd door het Nationaal Instituut voor Beeld en Geluid en, meer algemeen, door archieven van erfgoedinstellingen.

Een archief is geen eiland. Gebeurtenissen beschreven in een nieuwsarchief krijgen een extra dimensie als zij gekoppeld worden aan beeldmateriaal. Historisch televisiemateriaal wint aan betekenis als het gekoppeld wordt aan contemporaine commentaren en nieuwsmateriaal uit de gedrukte pers. En meer specialistische of technisch georiënteerde archieven winnen aan bruikbaarheid als ze gekoppeld zijn aan achtergrondinformatie.

Onderzoek wijst uit dat eindgebruikers er bij gebaat zijn dat koppelingen tussen archieven betekenisvol zijn en bijvoorkeur langs semantische lijnen lopen, met een sterke oriëntatie op entiteiten (mensen, locaties, organisaties, artefacten, etc.), op thema’s (zoals “stadsleven,” “festiviteiten” of “consumentencultuur”) en op gebeurtenissen (zoals “Praagse lente,” “Opening van de Kanaaltunnel” of “Marathon Amsterdam”). Betekenisvolle ontsluiting van archieven komt hiermee neer op zoek-­‐ en verkenningstechnologiën rondom entiteiten, thema’s en gebeurtenissen plus hun onderlinge relaties.

Gezien de omvang van de archieven die nu beschikbaar zijn of komen, zijn handmatige methoden om de gewenste koppelingen te leggen of om entiteiten, thema’s en gebeurtenissen te identificeren in archiefobjecten eenvoudig niet realistisch. Een belangrijke beweging in onderzoek op het raakvlak van zoekmachinetechnologie en taaltechnologie betreft semantisch zoeken, waarbij de gewenste koppelingen tussen archieven langs de genoemde assen automatisch worden gelegd.

  • [PDF] M. de Rijke, K. Balog, M. Bron, J. He, B. Huurnink, V. B. Jijkoun, F. Laan, E. Meij, E. Tsagkias, A. Vishneuski, and W. Weerkamp, “Archieven linken met semantische zoekmachines,” Dixit (tijdschrift over toegepaste taal- en spraaktechnologie), vol. 7, iss. 1, pp. 7-9, 2010.
    Author = {de Rijke, M. and Balog, K. and Bron, M. and He, J. and Huurnink, B. and Jijkoun, V.B. and Laan, F. and Meij, E. and Tsagkias, E. and Vishneuski, A. and Weerkamp, W.},
    Date-Added = {2011-10-20 10:17:50 +0200},
    Date-Modified = {2011-10-20 10:17:50 +0200},
    Journal = {DIXIT (Tijdschrift over toegepaste taal- en spraaktechnologie)},
    Number = {1},
    Pages = {7-9},
    Title = {Archieven Linken met Semantische Zoekmachines},
    Volume = {7},
    Year = {2010}}

Supervised query modeling using Wikipedia

In a web retrieval setting, there is a clear need for precision enhancing methods. For example, the query “the secret garden” (a novel that has been adapted into movies and musicals) is a query that is easily led astray because of the generality of the individual query terms. While some methods address this issue at the document level, e.g., by using anchor texts or some function of the web graph, we are interested in improving the query; a prime example of such an approach is leveraging phrasal or proximity information. Besides degrading the user experience, another significant downside of a lack of precision is its negative impact on the effectiveness of pseudo relevance feedback methods. An example of this phenomenon can be observed for a query such as “indexed annuity” where the richness of the financial domain plus the broad commercial use of the web introduces unrelated terms. To address these issues, we propose a semantically informed manner of representing queries that uses supervised machine learning on Wikipedia. We train an SVM that automatically links queries to Wikipedia articles which are subsequently used to update the query model.

Wikipedia and supervised machine learning have previously been used to select optimal terms to include in the query model. We, however, are interested in selecting those Wikipedia articles which best describe the query and use those to sample terms from. This is similar to the unsupervised manner used, e.g., in the context of retrieving blogs. Such approaches are completely unsupervised in that they only consider a fixed number of pseudo relevant Wikipedia articles. As we show, focusing this set using machine learning improves overall retrieval performance. In particular, we apply supervised machine learning to automatically link queries to Wikipedia articles and sample terms from the linked articles to re-estimate the query model. On a recent large web corpus, we observe substantial gains in terms of both traditional metrics and diversity measures.

  • [PDF] E. Meij and M. de Rijke, “Supervised query modeling using Wikipedia,” in Proceedings of the 33rd international acm sigir conference on research and development in information retrieval, 2010.
    Author = {Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval},
    Date-Added = {2012-05-03 22:16:10 +0200},
    Date-Modified = {2012-10-30 08:40:21 +0000},
    Series = {SIGIR 2010},
    Title = {Supervised query modeling using {Wikipedia}},
    Year = {2010},
    Bdsk-Url-1 = {}}
Traditional Library Card Catalog

Conceptual language models for domain-specific retrieval

Over the years, various meta-languages have been used to manually enrich documents with conceptual knowledge of some kind. Examples include keyword assignment to citations or, more recently, tags to websites. In this paper we propose generative concept models as an extension to query modeling within the language modeling framework, which leverages these conceptual annotations to improve retrieval. By means of relevance feedback the original query is translated into a conceptual representation, which is subsequently used to update the query model.

Extensive experimental work on five test collections in two domains shows that our approach gives significant improvements in terms of recall, initial precision and mean average precision with respect to a baseline without relevance feedback. On one test collection, it is also able to outperform a text-based pseudo-relevance feedback approach based on relevance models. On the other test collections it performs similarly to relevance models. Overall, conceptual language models have the added advantage of offering query and browsing suggestions in the form of conceptual annotations. In addition, the internal structure of the meta-language can be exploited to add related terms.

Our contributions are threefold. First, an extensive study is conducted on how to effectively translate a textual query into a conceptual representation. Second, we propose a method for updating a textual query model using the concepts in conceptual representation. Finally, we provide an extensive analysis of when and how this conceptual feedback improves retrieval.

  • [PDF] [DOI] E. Meij, D. Trieschnigg, M. de Rijke, and W. Kraaij, “Conceptual language models for domain-specific retrieval,” Inf. process. manage., vol. 46, iss. 4, pp. 448-469, 2010.
    Address = {Tarrytown, NY, USA},
    Author = {Meij, Edgar and Trieschnigg, Dolf and de Rijke, Maarten and Kraaij, Wessel},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2011-10-12 18:31:55 +0200},
    Doi = {},
    Issn = {0306-4573},
    Journal = {Inf. Process. Manage.},
    Number = {4},
    Pages = {448--469},
    Publisher = {Pergamon Press, Inc.},
    Title = {Conceptual language models for domain-specific retrieval},
    Volume = {46},
    Year = {2010},
    Bdsk-Url-1 = {}}
semantic network of drugs

Entity Search: Building Bridges between Two Worlds

We have come to depend on technological resources to create order and find meaning in the ever-growing amount of online data. One frequently recurring type of query in web search are queries containing named entities (persons, organizations, locations, etc.): we organize our environments around entities that are meaningful to us. Hence, to support humans in dealing with massive volumes of data, next generation search engines need to organize information in semantically meaningful ways, structured around entities. Furthermore, instead of merely finding documents that mention an entity, finding the entity itself is required.

The problem of entity search has been and is being looked at by both the Information Retrieval (IR) and Semantic Web (SW) communities and is, in fact, ranked high on the research agendas of the two communities. The entity search task comes in several flavors. One is known as entity ranking (given a query and target category, return a ranked list of relevant entities), another is list completion (given a query and example entities, return similar entities), and a third is related entity finding (given a source entity, a relation and a target type, identify target entities that enjoy the specified relation with the source entity and that satisfy the target type constraint).

State-of-the-art IR models allow us to address entity search by identifying relevant entities in large volumes of web data. These methods often approach entity-oriented retrieval tasks by establishing associations between topics, documents, and entities or amongst entities themselves, where such associations are modeled by observing the language usage around entities. A major challenge with current IR approaches to entity retrieval is that they fail to produce interpretable descriptions of the found entities or of the relationships between entities. The generated models tend to lack human-interpretable semantics and are rarely meaningful for human consumption: interpretable labels are needed (both for entities and for relations). Linked Open Data (LOD) is a recent contribution of the emerging semantic web that has the potential of providing the required semantic information.

From a SW point of view, entity retrieval should be as simple as running SPARQL queries over structured data. However, since a true semantic web still has not been fully realized, the results of such queries are currently not sufficient to answer common information needs. By now, the LOD cloud contains millions of concepts from over one hundred structured data sets. This abundance, however, also introduces novel issues such as “cheap semantics” (e.g. wikilink relations in DBpedia) and the need for ranking potentially very large amounts of results. Furthermore, given the fact that most web users are not proficient users of semantic web languages such as SPARQL or standards such as RDF and OWL, the free-form text input used by most IR systems is more appealing to end users.

These concurrent developments give rise to the following general question: to which extent are state-of-art IR and SW technologies capable of answering information needs related to entity finding? In this paper we focus on the task of related entity finding (REF). E.g., for a source entity (“Michael Schumacher”), a relation (“Michael’s teammates while he was racing in Formula 1”) and a target type (“people”), a REF system should return entities such as “Eddie Irvine” and “Felipe Massa.” REF aims at making arbitrary relations between entities searchable. We focus on an adaptation of the official task as it was run at TREC 2009 and restrict the target entities to those having a primary Wikipedia article: this modification provides an elegant way of making the IR and SW results comparable.

From an IR perspective, a natural way of capturing the relation between a source and target entity is based on their co-occurrence in suitable contexts. Later, we use an aggregate of methods all of which are based on this approach. In contrast, a SW perspective on the same task is to search for entities through links such as the ones in LOD and for this we apply both standard SPARQL queries and an exhaustive graph search algorithm.

In this paper, we analyze and discuss to which extent REF can be solved by IR and SW methods. It is important to note that our goal is not to perform a quantitative comparison, and make claims about one approach being better than the other or vice versa. Rather, we investigate results returned by either approach and perform a more qualitative evaluation. We find that IR and SW methods discover different sets of entities, although these sets are overlapping. Based on the results of our evaluation, we demonstrate that the two approaches are complementary in nature and we discuss how each field could potentially benefit from the other. We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linking Open Data cloud.

  • [PDF] K. Balog, E. Meij, and M. de Rijke, “Entity search: building bridges between two worlds,” in Proceedings of the 3rd international semantic search workshop, 2010.
    Author = {Balog, Krisztian and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 3rd International Semantic Search Workshop},
    Date-Added = {2011-10-20 10:07:31 +0200},
    Date-Modified = {2012-10-30 08:41:54 +0000},
    Series = {SEMSEARCH 2010},
    Title = {Entity search: building bridges between two worlds},
    Year = {2010},
    Bdsk-Url-1 = {}}
escience graph

Enabling Data Transport between Web Services

Despite numerous benefits, many Web Services (WS) face problems with respect to data transport, either because SOAP doesn’t offer a scalable way of transporting large data-sets or because orchestration workflows (WF) don’t move data around efficiently. In this paper we address both problems with the development of the ProxyWS. This is a WS utilizing protocols offered by the Virtual Resource System (VRS), to enable other WS to transfer and access large datasets without modifying WS nor the underlying environment.

There is currently an abundance of deployed (legacy) WS using SOAP, which fail to produce access and return large datasets. Moreover, orchestration WF causes WS to pass messages containing data back through the WF engine. To address these problems we introduce the ProxyWS: a WS that is able to access data from remote resources (GridFTP, LFC, etc.), thanks to the VRS, and also transport larger data produced by WS, both legacy and new. For the ProxyWS to be able to provide larger data transfers to legacy WS, it has to be deployed on the same Axis-based container, just like a normal WS. This enables clients to make proxy calls to the ProxyWS instead of a legacy WS. As a consequence the ProxyWS returns a SOAP message containing a URI referring to the data location. For new implementations the ProxyWS is used as an API that can create data streams from remote data resources and other WS using the ProxyWS. This approach proved to be the most scalable since WS can process data as they are generated from producing WS. Thus with the introduction of the ProxyWS we are able to provide a separate channel for data transfers, that allows for more scalable SOA-based applications.

Many different approaches have been introduced in an attempt to address the problems mentioned earlier. Examples of these include Styx Grid Services, Data Proxy Web services for Taverna and Flex-SwA. Some noteworthy features of these approaches are: Direct streaming between WS, Usage of alternative protocols for data transports, and larger data delivery to legacy WS. However, each of these examples only addresses one part of the problem and, furthermore, do not include any means of allowing access to remote data resources. Leveraging these existing proposals and combining them with the VRS we implemented a ProxyWS. To validate it, we have tested its performance using 2 data-intensive WF. The first is a distributed indexing application that uses a set of WS to speedup the indexing of a large set of documents, while the second relies on the creation of that index for retrieving and recognizing protein names contained in results coming from a query. With the use of the ProxyWS we are able to retrieve data from remote locations (8.4 GB of documents for indexing), as well as to obtain more results relative to a query (8300 documents using the ProxyWS versus 1100 using SOAP).

We have presented the ProxyWS, which may be used to support large data transfers for legacy and new WS. We have verified its performance to deliver large datasets on two real-life tasks: Indexing using WS in a distributed environment and annotating documents from an index. From our experiments we have found that ProxyWS is able to facilitate data transports where normal SOAP messages would have failed. We have also demonstrated that with the use of the ProxyWS legacy WS can scale further, by avoiding data delivery via SOAP and by delivering data directly from the producing to the consuming WS.

  • [PDF] S. Koulouzis, E. Meij, and A. Belloum, “Enabling large data transfers between web services,” in 5th egee user forum, 2010.
    Author = {Koulouzis, S. and Meij, E. and Belloum, A.},
    Booktitle = {5th EGEE User Forum},
    Date-Added = {2011-10-20 10:00:08 +0200},
    Date-Modified = {2011-10-20 10:00:08 +0200},
    Title = {Enabling Large Data Transfers Between Web Services},
    Year = {2010}}

Heuristic Ranking and Diversification of Web Documents

We describe the participation of the University of Amsterdam’s Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.,

  • [PDF] J. He, K. Balog, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “Heuristic ranking and diversification of web documents,” in The eighteenth text retrieval conference, 2010.
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {He, J. and Balog, K. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Eighteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 09:45:15 +0200},
    Date-Modified = {2012-10-30 09:24:20 +0000},
    Series = {TREC 2009},
    Title = {Heuristic Ranking and Diversification of Web Documents},
    Year = {2010}}