Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method

We address the issue of combining explicit background knowledge with pseudo-relevance feedback from within a document collection. To this end, we use document-level annotations in tandem with generative language models to generate terms from pseudo-relevant documents and bias the probability estimates of expansion terms in a principled manner. By applying the knowledge inherent in document annotations, we aim to control query drift and reap the benefits of automatic query expansion in terms of recall without losing precision. We consider the parameters which are associated with our modeling and describe ways of estimating these automatically. We then evaluate our modeling and estimation methods on two test collections, both provided by the TREC Genomics track.

  • [PDF] E. Meij and M. de Rijke, “Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method,” in Proceedings of the 1st international conference on theory of information retrieval, 2007.
    Author = {E. Meij and de Rijke, M.},
    Booktitle = {Proceedings of the 1st International Conference on Theory of Information Retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:50:30 +0000},
    Series = {ICTIR 2007},
    Title = {{Integrating Conceptual Knowledge into Relevance Models: A Model and Estimation Method}},
    Year = {2007}}

Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments

We propose and evaluate a query expansion mechanism that supports searching and browsing in collections of annotated documents. Based on generative language models, our feedback mechanism uses document-level annotations to bias the generation of expansion terms and to generate browsing suggestions in the form of concepts selected from a controlled vocabulary (as typically used in digital library settings). We provide a detailed formalization of our feedback mechanism and evaluate its effectiveness using the TREC 2006 Genomics track test set. As to the retrieval effectiveness, we find a 20% improvement in mean average precision over a query-likelihood baseline, whilst increasing precision at 10. When we base the parameter estimation and feedback generation of our algorithm on a large corpus, we also find an improvement over state-of-the-art relevance models. The browsing suggestions are assessed along two dimensions: relevancy and specificity. We present an account of per-topic results, which helps understand for what type of queries our feedback mechanism is particularly helpful.

  • [PDF] E. Meij and M. de Rijke, “Thesaurus-based feedback to support mixed search and browsing environments,” in Research and advanced technology for digital libraries, 11th european conference, ecdl 2007, 2007.
    Author = {Edgar Meij and Maarten de Rijke},
    Booktitle = {Research and Advanced Technology for Digital Libraries, 11th European Conference, ECDL 2007},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-28 23:04:22 +0000},
    Title = {Thesaurus-Based Feedback to Support Mixed Search and Browsing Environments},
    Year = {2007}}

Using Prior Information Derived from Citations in Literature Search

Researchers spend a large amount of their time searching through an ever increasing number of scientific articles. Although users of scientific literature search engines prefer the ranking of results according to the number of citations a publication has received, it is unknown whether this notion of authoritativeness could also benefit more traditional and objective measures. Is it also an indicator of relevance, given an information need? In this paper, we examine the relationship between citation features of a scientific article and its prior probability of actually being relevant to an information need. We propose various ways of modeling this relationship and show how this kind of contextual information can be incorporated within a language modeling framework. We experiment with three document priors, which we evaluate on three distinct sets of queries and two document collections from the TREC Genomics track. Empirical results show that two of the proposed priors can significantly improve retrieval effectiveness, measured in terms of mean average precision.

  • [PDF] E. Meij and M. de Rijke, “Using prior information derived from citations in literature search,” in Riao 2007, 2007.
    Author = {Meij, E. and de Rijke, M.},
    Booktitle = {RIAO 2007},
    Date-Added = {2011-10-13 09:05:34 +0200},
    Date-Modified = {2012-10-30 08:49:59 +0000},
    Title = {Using Prior Information Derived from Citations in Literature Search},
    Year = {2007}}

My first BioAID: heuristic support for hypothesis construction from literature


Constructing a new hypothesis is often the first step for a new cycle of experiments. A typical approach to harvesting biological literature is to scan the results of a PubMed query and read what we think is most relevant. In this scenario, we are limited by the selection of papers and, for future applications, we are limited by our capacity to recall the knowledge we have gained. As part of the development of a ‘virtual laboratory for bioinformatics,’ we seek alternative ways to support the construction of hypotheses from biological literature.


Our objective is to provide automated support for hypothesis formation from literature based on an initial seed of knowledge.


Our approach consists of the following steps: first we create a ‘proto-ontology’ from the knowledge that we want to extend, for instance, a table in a review that lists diseases associated with a particular enzyme. We then identify the collection of documents that we want to search (typically Medline). Subsequently, we use concepts from our proto-ontology as input to retrieve relevant documents from a collection and to inform us of concepts such as protein names or relationships that are putatively associated with the proto- ontology. These results are used to enrich the proto-ontology with additional concepts and relations. The ontology can be iteratively enriched by using the results from one run as input for the next.


Our implementation is based on a collection of web services, allowing us to construct custom workflows for specific tasks. Together, these web services form a toolbox called AIDA (Adaptive Information Disclosure Application), for annotating documents, searching documents, discovering knowledge from documents, and storing ontological data. AIDA uses open source software such as Lucene for document retrieval, and Sesame for handling ontologies. For the purposes of this implementation, we have also used Taverna to construct our workflows and Protégé.


We have created workflows from services in the AIDA toolbox, and applied them to extend a proto- ontology with knowledge extracted from literature. Technically, the most challenging workflow uses our own proto-ontology as input for machine learning services, after which biological concepts are discovered that are related to terms from our own ontology. As a proof of concept, we have (re)discovered diseases that are known to be related to EZH2, an enzyme associated with gene regulation via chromatin remodelling. A second workflow which discovers genomics concepts is used to identify proteins that might present a previously unreported link between two biological concepts, e.g. histones and transcription factors. The proto-ontology and enriched ontology are written in the Web Ontology Language OWL, and stored in Sesame via another service from the toolbox.


Services and workflows are available from Ontologies are available from


Workflows constructed from the AIDA toolbox can be used as an aid in constructing hypotheses from literature. We show that we can automatically extend a proto-ontology with new hypothetical concepts and relationships that bridge across the boundaries of single papers or biological subdomains. Our approach can be customized to particular domains and vocabularies through the choice of ontology and literature corpora.

  • [PDF] M. Roos, S. Katrenko, W. R. van Hage, E. Meij, M. S. Marshall, and P. W. Adriaans, “My first bioaid: heuristic support for hypothesis construction,” in Ismb-eccb’07, 2007.
    Author = {Roos, M. and Katrenko, S. and van Hage, W.R. and Meij, E. and Marshall, M.S. and Adriaans, P.W.},
    Booktitle = {ISMB-ECCB'07},
    Date-Added = {2011-10-13 08:56:20 +0200},
    Date-Modified = {2011-10-13 08:56:20 +0200},
    Title = {My first BioAID: heuristic support for hypothesis construction},
    Year = {2007}}

Language Models for Enterprise Search: Query Expansion and Combination of Evidence

We describe our participation in the TREC 2006 Enterprise track. We provide a detailed account of the ideas underlying our language modeling approaches to both the discussion search and expert search tasks. For discussion search, our focus was on query expansion techniques, using additional information from the topic statement and from message threads; while the former was generally helpful, the latter mostly hurt performance. In expert search our main experiments concerned query expansion as well as combinations of expert finding and expert profiling techniques.

  • [PDF] K. Balog, E. Meij, and M. de Rijke, “The University of Amsterdam at the TREC 2006 Enterprise Track,” in The fifteenth text retrieval conference, 2007.
    Author = {Balog, K. and Meij, E. and de Rijke, M.},
    Booktitle = {The Fifteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:33:06 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2006},
    Title = {{The University of Amsterdam at the TREC 2006 Enterprise Track}},
    Year = {2007}}

Expanding Queries Using Multiple Resources

We describe our participation in the TREC 2006 Genomics track, in which our main focus was on query expansion. We hypothesized that applying query expansion techniques would help us both to identify and retrieve synonymous terms, and to cope with ambiguity. To this end, we developed several collection-specific as well as web-based strategies. We also performed post-submission experiments, in which we compare various retrieval engines, such as Lucene, Indri, and Lemur, using a simple baseline topic set. When indexing entire paragraphs as pseudo-documents, we find that Lemur is able to achieve the highest document-, passage-, and aspect-level scores, using the KL-divergence method and its default settings. Additionally, we index the collection at a lower level of granularity, by creating pseudo-documents comprising of individual sentences. When we search these instead of paragraphs in Lucene, the passage-level scores improve considerably. Finally we note that stemming improves overall scores by at least 10%.

  • [PDF] E. Meij, M. Jansen, and M. de Rijke, “Expanding queries using multiple resources (the AID group at TREC 2006: genomics track),” in The fifteenth text retrieval conference, 2007.
    Author = {Meij, E. and Jansen, M. and de Rijke, M.},
    Booktitle = {The Fifteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:24:14 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2006},
    Title = {Expanding Queries Using Multiple Resources (The {AID} Group at {TREC} 2006: Genomics Track)},
    Year = {2007}}