My first BioAID: heuristic support for hypothesis construction from literature

Motivation

Constructing a new hypothesis is often the first step for a new cycle of experiments. A typical approach to harvesting biological literature is to scan the results of a PubMed query and read what we think is most relevant. In this scenario, we are limited by the selection of papers and, for future applications, we are limited by our capacity to recall the knowledge we have gained. As part of the development of a ‘virtual laboratory for bioinformatics,’ we seek alternative ways to support the construction of hypotheses from biological literature.

Objectives

Our objective is to provide automated support for hypothesis formation from literature based on an initial seed of knowledge.

Approach

Our approach consists of the following steps: first we create a ‘proto-ontology’ from the knowledge that we want to extend, for instance, a table in a review that lists diseases associated with a particular enzyme. We then identify the collection of documents that we want to search (typically Medline). Subsequently, we use concepts from our proto-ontology as input to retrieve relevant documents from a collection and to inform us of concepts such as protein names or relationships that are putatively associated with the proto- ontology. These results are used to enrich the proto-ontology with additional concepts and relations. The ontology can be iteratively enriched by using the results from one run as input for the next.

Implementation

Our implementation is based on a collection of web services, allowing us to construct custom workflows for specific tasks. Together, these web services form a toolbox called AIDA (Adaptive Information Disclosure Application), for annotating documents, searching documents, discovering knowledge from documents, and storing ontological data. AIDA uses open source software such as Lucene for document retrieval, and Sesame for handling ontologies. For the purposes of this implementation, we have also used Taverna to construct our workflows and Protégé.

Results

We have created workflows from services in the AIDA toolbox, and applied them to extend a proto- ontology with knowledge extracted from literature. Technically, the most challenging workflow uses our own proto-ontology as input for machine learning services, after which biological concepts are discovered that are related to terms from our own ontology. As a proof of concept, we have (re)discovered diseases that are known to be related to EZH2, an enzyme associated with gene regulation via chromatin remodelling. A second workflow which discovers genomics concepts is used to identify proteins that might present a previously unreported link between two biological concepts, e.g. histones and transcription factors. The proto-ontology and enriched ontology are written in the Web Ontology Language OWL, and stored in Sesame via another service from the toolbox.

Availability

Services and workflows are available from http://ws.adaptivedisclosure.org/BioAIDdemo1. Ontologies are available from http://rdf.adaptivedisclosure.org/BioAIDdemo1.

Conclusion

Workflows constructed from the AIDA toolbox can be used as an aid in constructing hypotheses from literature. We show that we can automatically extend a proto-ontology with new hypothetical concepts and relationships that bridge across the boundaries of single papers or biological subdomains. Our approach can be customized to particular domains and vocabularies through the choice of ontology and literature corpora.

  • [PDF] M. Roos, S. Katrenko, W. R. van Hage, E. Meij, M. S. Marshall, and P. W. Adriaans, “My first bioaid: heuristic support for hypothesis construction,” in Ismb-eccb’07, 2007.
    [Bibtex]
    @inproceedings{ISMB:2007:Roos,
    Author = {Roos, M. and Katrenko, S. and van Hage, W.R. and Meij, E. and Marshall, M.S. and Adriaans, P.W.},
    Booktitle = {ISMB-ECCB'07},
    Date-Added = {2011-10-13 08:56:20 +0200},
    Date-Modified = {2011-10-13 08:56:20 +0200},
    Title = {My first BioAID: heuristic support for hypothesis construction},
    Year = {2007}}
TREC

Expanding Queries Using Multiple Resources

We describe our participation in the TREC 2006 Genomics track, in which our main focus was on query expansion. We hypothesized that applying query expansion techniques would help us both to identify and retrieve synonymous terms, and to cope with ambiguity. To this end, we developed several collection-specific as well as web-based strategies. We also performed post-submission experiments, in which we compare various retrieval engines, such as Lucene, Indri, and Lemur, using a simple baseline topic set. When indexing entire paragraphs as pseudo-documents, we find that Lemur is able to achieve the highest document-, passage-, and aspect-level scores, using the KL-divergence method and its default settings. Additionally, we index the collection at a lower level of granularity, by creating pseudo-documents comprising of individual sentences. When we search these instead of paragraphs in Lucene, the passage-level scores improve considerably. Finally we note that stemming improves overall scores by at least 10%.

  • [PDF] E. Meij, M. Jansen, and M. de Rijke, “Expanding queries using multiple resources (the AID group at TREC 2006: genomics track),” in The fifteenth text retrieval conference, 2007.
    [Bibtex]
    @inproceedings{TREC:2006:meij,
    Author = {Meij, E. and Jansen, M. and de Rijke, M.},
    Booktitle = {The Fifteenth Text REtrieval Conference},
    Date-Added = {2011-10-12 23:24:14 +0200},
    Date-Modified = {2012-10-30 09:23:12 +0000},
    Series = {TREC 2006},
    Title = {Expanding Queries Using Multiple Resources (The {AID} Group at {TREC} 2006: Genomics Track)},
    Year = {2007}}