Annals of Information Systems

Semantic disclosure in an e-Science environment

The Virtual Laboratory for e-Science (VL-e) project serves as a backdrop for the ideas described in this chapter. VL-e is a project with academic and industrial partners where e-science has been applied to several domains of scientific research. Adaptive Information Disclosure (AID), a subprogram within VL-e, is a multi-disciplinary group that concentrates expertise in information extraction, machine learning, and Semantic Web – a powerful combination of technologies that can be used to extract and store knowledge in a Semantic Web framework. In this chapter, the authors explain what “semantic disclosure” means and how it is essential to knowledge sharing in e-Science. The authors describe several Semantic Web applications and how they were built using components of the AIDA Toolkit (AID Application Toolkit). The lessons learned and the future of e-Science are also discussed.

  • [PDF] M. S. Marshall, M. Roos, E. Meij, S. Katrenko, W. R. van Hage, and P. W. Adriaans, “Semantic disclosure in an e-science environment,” in Semantic e-science (springer annals of information systems aois), 2009.
    [Bibtex]
    @inproceedings{AIS:2009:marshall,
    Author = {Marshall, M.S. and Roos, M. and Meij, E. and Katrenko, S. and van Hage, W.R. and Adriaans, P.W.},
    Booktitle = {Semantic e-Science (Springer Annals of Information Systems AoIS)},
    Date-Added = {2011-10-16 15:03:17 +0200},
    Date-Modified = {2012-10-28 17:21:26 +0000},
    Publisher = {Springer},
    Series = {Annals of Information Systems},
    Title = {Semantic disclosure in an e-Science environment},
    Volume = {11},
    Year = {2009}}
AGRO informatica

De Aida toolbox: Een gecombineerde aanpak voor het beheren van kennis

In een computationele netwerk omgeving zoals het grid is een overvloed aan zeer uiteenlopende soorten bronnen aanwezig. Denk bijvoorbeeld aan tijdschrift artikelen, beelden, massa spectrometrie data, R scripts voor statistiek, web services, workflows of spreadsheets. Deze overvloed kan een grote belemmering vormen. Hoe moet een gebruiker de juiste bronnen vinden voor een voorliggend probleem? Vele factoren maken het matchen van de benodigdheden en gebruikerswensen aan wat de bronnen kunnen leveren en de regels ten aanzien van hun gebruik een complex probleem. Het probleem doet zich voor op verschillende niveaus. Eindgebruikers willen het benodigde vinden in hun eigen domein. Applicatie en middelware ontwikkelaars moeten services en data kunnen vinden, bij voorkeur geautomatiseerd zodat veranderingen in aanwezigheid en toegankelijkheid kunnen worden opgevangen. Dit probleem beperkt zich niet tot grids; ook het Web en allerlei dataopslag toepassingen hebben er mee te maken. Ook voor ‘enhanced science’ (e-science) is het beheren van heterogene bronnen een belangrijke uitdaging.

  • [PDF] M. S. Marshall, M. Roos, E. Meij, S. Katrenko, W. R. van Hage, and P. W. Adriaans, “De AIDA toolbox: een gecombineerde aanpak voor het beheren van kennis,” Agro informatica, vol. 21, iss. 4, pp. 5-7, 2009.
    [Bibtex]
    @article{AGRO:2009:marshall,
    Author = {Marshall, M.S. and Roos, M. and Meij, Edgar and Katrenko, S. and van Hage, W.R. and Adriaans, P.W.},
    Date-Added = {2011-10-16 15:55:36 +0200},
    Date-Modified = {2012-10-28 23:04:41 +0000},
    Edition = {1},
    Journal = {Agro Informatica},
    Number = {4},
    Pages = {5--7},
    Title = {De {AIDA} toolbox: Een gecombineerde aanpak voor het beheren van kennis},
    Volume = {21},
    Year = {2009}}
workflow process

Biological applications of Aida knowledge management components

Given the important role of knowledge in biology, knowledge in a machine readable form can be an important asset for bioinformatics. We present two applications of AIDA (Adaptive Information Disclosure Application), a collection of knowledge management components. One is a workflow that extends a semantic model with putative relations between proteins and diseases extracted from literature by machine learning techniques. The other extends vBrowser, a virtual resource browser tool, with the ability to find relevant biological resources (e.g. data, workflows, documents) via semantic relationships.

Central to our semantic web approach is the separation of a ‘virtual knowledge space’ from its applications. In other words, knowledge is disclosed and accessed in a knowledge space rather than being coded into the application. The workflow adds knowledge to this space with knowledge extraction, while vBrowser accesses the knowledge resources for use during search. We use RDF and OWL to represent knowledge and Sesame to store RDF and OWL representations of knowledge.

The workflow contains the following steps: (i) add the ontology that you want to extend to Sesame (e.g. a model that contains the protein EZH2), (ii) extract the entities of interest from the ontology (e.g. EZH2), (iii) retrieve abstracts from Medline for these entities, (iv) extract proteins and protein-protein relationships from the abstracts, (v) add a ranking score to the discoveries, (vi) query OMIM with the extracted proteins and retrieve the disease labels (service from the National Institute of Genetics in Japan), (vii) add the discoveries and their interrelationships to the repository, (viii) export the enriched ontology to the knowledge space where for instance vBrowser can be used to explore the results. Future work includes metrics to more effectively retrieve biologically interesting suggestions from semantic data.

We show how the vBrowser can be used to browse both data resources and knowledge resources from the same basic interface. We show how vBrowser uses an AIDA thesaurus service to improve finding resources such as Medline documents and workflows on myExperiment.org. We found thesauri terms effective for search and advocate SKOS for its intuitive ‘broader/narrower-than’ relationships. We further show that the protein-disease relationships resulting from our knowledge capture workflow as well as the documents that contained these relationships can be accessed as knowledge resources from the vBrowser. We think OWL can adequately represent the knowledge in many biological cartoon models and have used it to represent the workflow provenance in our knowledge capture workflow.

  • M. Roos, S. M. Marshall, P. T. de Boer, K. van den Berg, S. Katrenko, E. Meij, W. R. van Hage, and P. W. Adriaans, “Biological applications of AIDA knowledge management components,” in Ismb ’08, 2008.
    [Bibtex]
    @inproceedings{ISMB:2008:roos,
    Author = {Marco Roos and M. Scott Marshall and Piter T. de Boer and Kasper van den Berg and Sophia Katrenko and Edgar Meij and Willem R. van Hage and Pieter W. Adriaans},
    Booktitle = {ISMB '08},
    Date-Added = {2011-10-16 10:45:35 +0200},
    Date-Modified = {2012-10-28 23:04:46 +0000},
    Title = {Biological applications of {AIDA} knowledge management components},
    Year = {2008}}

My first BioAID: heuristic support for hypothesis construction from literature

Motivation

Constructing a new hypothesis is often the first step for a new cycle of experiments. A typical approach to harvesting biological literature is to scan the results of a PubMed query and read what we think is most relevant. In this scenario, we are limited by the selection of papers and, for future applications, we are limited by our capacity to recall the knowledge we have gained. As part of the development of a ‘virtual laboratory for bioinformatics,’ we seek alternative ways to support the construction of hypotheses from biological literature.

Objectives

Our objective is to provide automated support for hypothesis formation from literature based on an initial seed of knowledge.

Approach

Our approach consists of the following steps: first we create a ‘proto-ontology’ from the knowledge that we want to extend, for instance, a table in a review that lists diseases associated with a particular enzyme. We then identify the collection of documents that we want to search (typically Medline). Subsequently, we use concepts from our proto-ontology as input to retrieve relevant documents from a collection and to inform us of concepts such as protein names or relationships that are putatively associated with the proto- ontology. These results are used to enrich the proto-ontology with additional concepts and relations. The ontology can be iteratively enriched by using the results from one run as input for the next.

Implementation

Our implementation is based on a collection of web services, allowing us to construct custom workflows for specific tasks. Together, these web services form a toolbox called AIDA (Adaptive Information Disclosure Application), for annotating documents, searching documents, discovering knowledge from documents, and storing ontological data. AIDA uses open source software such as Lucene for document retrieval, and Sesame for handling ontologies. For the purposes of this implementation, we have also used Taverna to construct our workflows and Protégé.

Results

We have created workflows from services in the AIDA toolbox, and applied them to extend a proto- ontology with knowledge extracted from literature. Technically, the most challenging workflow uses our own proto-ontology as input for machine learning services, after which biological concepts are discovered that are related to terms from our own ontology. As a proof of concept, we have (re)discovered diseases that are known to be related to EZH2, an enzyme associated with gene regulation via chromatin remodelling. A second workflow which discovers genomics concepts is used to identify proteins that might present a previously unreported link between two biological concepts, e.g. histones and transcription factors. The proto-ontology and enriched ontology are written in the Web Ontology Language OWL, and stored in Sesame via another service from the toolbox.

Availability

Services and workflows are available from http://ws.adaptivedisclosure.org/BioAIDdemo1. Ontologies are available from http://rdf.adaptivedisclosure.org/BioAIDdemo1.

Conclusion

Workflows constructed from the AIDA toolbox can be used as an aid in constructing hypotheses from literature. We show that we can automatically extend a proto-ontology with new hypothetical concepts and relationships that bridge across the boundaries of single papers or biological subdomains. Our approach can be customized to particular domains and vocabularies through the choice of ontology and literature corpora.

  • [PDF] M. Roos, S. Katrenko, W. R. van Hage, E. Meij, M. S. Marshall, and P. W. Adriaans, “My first bioaid: heuristic support for hypothesis construction,” in Ismb-eccb’07, 2007.
    [Bibtex]
    @inproceedings{ISMB:2007:Roos,
    Author = {Roos, M. and Katrenko, S. and van Hage, W.R. and Meij, E. and Marshall, M.S. and Adriaans, P.W.},
    Booktitle = {ISMB-ECCB'07},
    Date-Added = {2011-10-13 08:56:20 +0200},
    Date-Modified = {2011-10-13 08:56:20 +0200},
    Title = {My first BioAID: heuristic support for hypothesis construction},
    Year = {2007}}