P30 difference plot

Team COMMIT at TREC 2011

We describe the participation of Team COMMIT in this year’s Microblog and Entity track.

In our participation in the Microblog track, we used a feature-based approach. Specifically, we pursued a precision oriented recency-aware retrieval approach for tweets. Amongst others we used various types of external data. In particular, we examined the potential of link retrieval on a corpus of crawled content pages and we use semantic query expansion using Wikipedia. We also deployed pre-filtering based on query-dependent and query-independent features. For the Microblog track we found that a simple cut-off based on the z-score is not sufficient: for differently distributed scores, this can decrease recall. A well set cut-off parameter can however significantly increase precision, especially if there are few highly relevant tweets. Filtering based on query-independent filtering does not help for already small result list. With a high occurrence of links in relevant tweets, we found that using link retrieval helps improving precision and recall for highly relevant and relevant tweets. Future work should focus on a score-distribution dependent selection criterion.

In this years Entity track participation we focused on the Entity List Completion (ELC) task. We experimented with a text based and link based approach to retrieve entities in Linked Data (LD). Additionally we experimented with selecting candidate entities from a web corpus. Our intuition is that entities occurring on pages with many of the example entities are more likely to be good candidates than entities that do not. For the Entity track there are no analyses or conclusions to report yet; at the time of writing no evaluation results are available for the Entity track.

  • [PDF] M. Bron, E. Meij, M. Peetz, M. Tsagkias, and M. de Rijke, “Team COMMIT at TREC 2011,” in The twentieth text retrieval conference, 2012.
    [Bibtex]
    @inproceedings{TREC:2011:commit,
    Author = {Bron, Marc and Meij, Edgar and Peetz, Maria-Hendrike and Tsagkias, Manos and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:19 +0200},
    Date-Modified = {2012-10-30 09:26:12 +0000},
    Series = {TREC 2011},
    Title = {Team {COMMIT} at {TREC 2011}},
    Year = {2012}}
TREC

The University of Amsterdam at Trec 2010: Session, Entity, and Relevance Feedback

We describe the participation of the University of Amsterdam’s ILPS group in the session, entity, and relevance feedback track at TREC 2010. In the Session Track we explore the use of blind relevance feedback to bias a follow-up query towards or against the topics covered in documents returned to the user in response to the original query. In the Entity Track REF task we experiment with a window size parameter to limit the amount of context considered by the entity co-occurrence models and explore the use of Freebase for type filtering, entity normalization and homepage finding. In the ELC task we use an approach that uses the number of links shared between candidate and example entities to rank candidates. In the Relevance Feedback Track we experiment with a novel model that uses Wikipedia to expand the query language model.

  • [PDF] M. Bron, J. He, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “The University of Amsterdam at TREC 2010: session, entity, and relevance feedback,” in The nineteenth text retrieval conference, 2011.
    [Bibtex]
    @inproceedings{TREC:2011:bron,
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {M. Bron and He, J. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Nineteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 11:18:35 +0200},
    Date-Modified = {2012-10-30 09:25:06 +0000},
    Series = {TREC 2010},
    Title = {{The University of Amsterdam at TREC 2010}: Session, Entity, and Relevance Feedback},
    Year = {2011}}
semantic network of drugs

Entity Search: Building Bridges between Two Worlds

We have come to depend on technological resources to create order and find meaning in the ever-growing amount of online data. One frequently recurring type of query in web search are queries containing named entities (persons, organizations, locations, etc.): we organize our environments around entities that are meaningful to us. Hence, to support humans in dealing with massive volumes of data, next generation search engines need to organize information in semantically meaningful ways, structured around entities. Furthermore, instead of merely finding documents that mention an entity, finding the entity itself is required.

The problem of entity search has been and is being looked at by both the Information Retrieval (IR) and Semantic Web (SW) communities and is, in fact, ranked high on the research agendas of the two communities. The entity search task comes in several flavors. One is known as entity ranking (given a query and target category, return a ranked list of relevant entities), another is list completion (given a query and example entities, return similar entities), and a third is related entity finding (given a source entity, a relation and a target type, identify target entities that enjoy the specified relation with the source entity and that satisfy the target type constraint).

State-of-the-art IR models allow us to address entity search by identifying relevant entities in large volumes of web data. These methods often approach entity-oriented retrieval tasks by establishing associations between topics, documents, and entities or amongst entities themselves, where such associations are modeled by observing the language usage around entities. A major challenge with current IR approaches to entity retrieval is that they fail to produce interpretable descriptions of the found entities or of the relationships between entities. The generated models tend to lack human-interpretable semantics and are rarely meaningful for human consumption: interpretable labels are needed (both for entities and for relations). Linked Open Data (LOD) is a recent contribution of the emerging semantic web that has the potential of providing the required semantic information.

From a SW point of view, entity retrieval should be as simple as running SPARQL queries over structured data. However, since a true semantic web still has not been fully realized, the results of such queries are currently not sufficient to answer common information needs. By now, the LOD cloud contains millions of concepts from over one hundred structured data sets. This abundance, however, also introduces novel issues such as “cheap semantics” (e.g. wikilink relations in DBpedia) and the need for ranking potentially very large amounts of results. Furthermore, given the fact that most web users are not proficient users of semantic web languages such as SPARQL or standards such as RDF and OWL, the free-form text input used by most IR systems is more appealing to end users.

These concurrent developments give rise to the following general question: to which extent are state-of-art IR and SW technologies capable of answering information needs related to entity finding? In this paper we focus on the task of related entity finding (REF). E.g., for a source entity (“Michael Schumacher”), a relation (“Michael’s teammates while he was racing in Formula 1”) and a target type (“people”), a REF system should return entities such as “Eddie Irvine” and “Felipe Massa.” REF aims at making arbitrary relations between entities searchable. We focus on an adaptation of the official task as it was run at TREC 2009 and restrict the target entities to those having a primary Wikipedia article: this modification provides an elegant way of making the IR and SW results comparable.

From an IR perspective, a natural way of capturing the relation between a source and target entity is based on their co-occurrence in suitable contexts. Later, we use an aggregate of methods all of which are based on this approach. In contrast, a SW perspective on the same task is to search for entities through links such as the ones in LOD and for this we apply both standard SPARQL queries and an exhaustive graph search algorithm.

In this paper, we analyze and discuss to which extent REF can be solved by IR and SW methods. It is important to note that our goal is not to perform a quantitative comparison, and make claims about one approach being better than the other or vice versa. Rather, we investigate results returned by either approach and perform a more qualitative evaluation. We find that IR and SW methods discover different sets of entities, although these sets are overlapping. Based on the results of our evaluation, we demonstrate that the two approaches are complementary in nature and we discuss how each field could potentially benefit from the other. We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linking Open Data cloud.

  • [PDF] K. Balog, E. Meij, and M. de Rijke, “Entity search: building bridges between two worlds,” in Proceedings of the 3rd international semantic search workshop, 2010.
    [Bibtex]
    @inproceedings{semsearch:2010:balog,
    Author = {Balog, Krisztian and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 3rd International Semantic Search Workshop},
    Date-Added = {2011-10-20 10:07:31 +0200},
    Date-Modified = {2012-10-30 08:41:54 +0000},
    Series = {SEMSEARCH 2010},
    Title = {Entity search: building bridges between two worlds},
    Year = {2010},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1863879.1863888}}