question mark

Towards a combined model for search and navigation of annotated documents

Documents whose textual content is complemented with annotations of one kind or another are ubiquitous. Examples include biomedical documents (annotated with MeSH terms) and news articles (annotated with IPTC terms). Such annotations—or concepts—have typically been used for query expansion, to suggest alternative or related query formulations, and to facilitate browsing of the document collection. In recent years, we have seen two important developments in this area: (i) a renewed interest in the knowledge sources underlying the annotations, mainly inspired by semantic web initiatives and (ii) the creation of social annotations, as part of web 2.0 developments. These developments motivate a renewed interest in models and methods for accessing annotated documents.

The theme of my proposed research is to capture two aspects in a single, unified model: retrieval and navigation. Given a query, this entails using both term-based and concept-based evidence to locate relevant information (retrieval) and suggesting useful browsing suggestions (navigation). I imagine this to be a “two-way” process, i.e., the user can browse the document collection using concepts and the relations between concepts, but she can also navigate the knowledge structure using the (vocabulary) terms from the documents. Such information seeking behavior is witnessed in an increasing number of applications and domains (e.g., suggesting related tags in Bibsonomy or Flickr), providing a solid motivation for my research agenda. In order to accomplish this unification, I will first need to address three separate, but intertwined issues. First, a way of “bridging the gap” between concepts and (vocabulary) terms is needed, since concepts are not directly observable. Second, relations between concepts need to be modeled in some way. Finally, the concepts and relations thus modeled should be integrated in the information seeking process, thereby improving both retrieval and navigation.

So far, I have formulated concept modeling as a form of text classification, by representing concepts as distributions over vocabulary terms. In the context of a digital library setting, I have shown that integrating conceptual knowledge in this way can be beneficial both to retrieval performance as well as to facilitate navigation. More recently, I have taken these experiments a step further by creating parsimonious concept models. In these experiments, the integration of concepts in the query model estimations is able to deliver significantly better results, both compared to a query likelihood run as well as to a run based on relevance models.

To determine the strength of relations between concepts, I have looked at using the divergence between concept models. The estimations are based on differences in language use as measured by computing the cross-entropy reduction between concept models. Experimental results show that this approach is able to outperform both path-based as well as information content-based methods on two separate test sets. While this approach measures the similarity between concepts, it does not explicitly take a relation type into consideration. Thus, any explicit link structure present in the used knowledge structure disappears. Whether this is a reasonable assumption for my work is still unclear and something I intend to find an answer to.

In future work, I would also like to address the question how the retrieval-oriented models I have introduced so far may be used to further aid navigation. To some extent, I have already used the TREC Genomics test collections for the evaluation of the navigational effectiveness, but future work—possibly observing users directly in a user study or indirectly through log analysis—should indicate what the model’s impact, if any, is on navigational effectiveness.

  • [PDF] E. Meij, “Towards a combined model for search and navigation of annotated documents,” in Proceedings of the 31st annual international acm sigir conference on research and development in information retrieval, 2008.
    [Bibtex]
    @inproceedings{SIGIR:2008:meij-doctcons,
    Abstract = {Note: OCR errors may be found in this Reference List extracted from
    the full text article. ACM has opted to expose the complete List
    rather than only correct and linked references.},
    Author = {Meij, Edgar},
    Booktitle = {Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:48:04 +0000},
    Series = {SIGIR 2008},
    Title = {Towards a combined model for search and navigation of annotated documents},
    Year = {2008},
    Bdsk-Url-1 = {http://dx.doi.org/10.1145/1390334.1390573}}
Apple of orange?

Measuring Concept Relatedness Using Language Models

Over the years, the notion of concept relatedness has at- tracted considerable attention. A variety of approaches, based on ontology structure, information content, association, or context have been proposed to indicate the relatedness of abstract ideas. In this paper we present a novel context based measure of concept relatedness, based on cross entropy reduction. We propose a method based on the cross entropy reduction between language models of concepts which are estimated based on document-concept assignments. After introducing our method, we compare it to the methods introduced earlier, by comparing the results with relatedness judgments provided by human assessors. The approach shows improved or competitive results compared to state-of-the-art methods on two test sets in the biomedical domain.

  • [PDF] D. Trieschnigg, E. Meij, M. de Rijke, and W. Kraaij, “Measuring concept relatedness using language models,” in Proceedings of the 31st annual international acm sigir conference on research and development in information retrieval, 2008.
    [Bibtex]
    @inproceedings{SIGIR:2008:trieschnigg,
    Author = {Trieschnigg, Dolf and Meij, Edgar and de Rijke, Maarten and Kraaij, Wessel},
    Booktitle = {Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:45:51 +0000},
    Series = {SIGIR 2008},
    Title = {Measuring concept relatedness using language models},
    Year = {2008},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1390334.1390523}}