TREC

The University of Amsterdam at the TREC 2011 Session Track

We describe the participation of the University of Amsterdam’s ILPS group in the Session track at TREC 2011.

The stream of interactions created by a user engaging with a search system contains a wealth of information. For retrieval purposes, previous interactions can help inform us about a user’s current information need. Building on this intuition, our contribution to this TREC year’s session track focuses on session modeling and learning to rank using session information. In this paper, we present and compare three complementary strategies that we designed for improving retrieval for a current query using previous queries and clicked results: probabilistic session modeling, semantic query modeling, and implicit feedback.

In our experiments we examined three complementary strategies for improving retrieval for a current query. Our first strategy, based on probabilistic session modeling, was the best performing strategy.

Our second strategy, based on semantic query modeling, did less well than we expected, likely due to topic drift from excessively aggressive query expansion. We expect that performance of this strategy would improve by limiting the number of terms and/or improving the probability estimates.

With respect to our third strategy, based on learning from feedback, we found that learning weights for linear weighted combinations of features from an external collection can be beneficial, if characteristics of the collection are similar to the current data. Feedback available in the form of user clicks appeared to be less beneficial. Our run learning from implicit feedback did perform substantially lower than a run where weights were learned from an external collection with explicit feedback using the same learning algorithm and set of features.

  • [PDF] B. Huurnink, R. Berendsen, K. Hofmann, E. Meij, and M. de Rijke, “The University of Amsterdam at the TREC 2011 session track,” in The twentieth text retrieval conference, 2012.
    [Bibtex]
    @inproceedings{TREC:2011:huurnink,
    Author = {Huurnink, Bouke and Berendsen, Richard and Hofmann, Katja and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:18 +0200},
    Date-Modified = {2013-05-22 11:44:53 +0000},
    Month = {January},
    Series = {TREC 2011},
    Title = {The {University of Amsterdam} at the {TREC} 2011 Session Track},
    Year = {2012}}
P30 difference plot

Team COMMIT at TREC 2011

We describe the participation of Team COMMIT in this year’s Microblog and Entity track.

In our participation in the Microblog track, we used a feature-based approach. Specifically, we pursued a precision oriented recency-aware retrieval approach for tweets. Amongst others we used various types of external data. In particular, we examined the potential of link retrieval on a corpus of crawled content pages and we use semantic query expansion using Wikipedia. We also deployed pre-filtering based on query-dependent and query-independent features. For the Microblog track we found that a simple cut-off based on the z-score is not sufficient: for differently distributed scores, this can decrease recall. A well set cut-off parameter can however significantly increase precision, especially if there are few highly relevant tweets. Filtering based on query-independent filtering does not help for already small result list. With a high occurrence of links in relevant tweets, we found that using link retrieval helps improving precision and recall for highly relevant and relevant tweets. Future work should focus on a score-distribution dependent selection criterion.

In this years Entity track participation we focused on the Entity List Completion (ELC) task. We experimented with a text based and link based approach to retrieve entities in Linked Data (LD). Additionally we experimented with selecting candidate entities from a web corpus. Our intuition is that entities occurring on pages with many of the example entities are more likely to be good candidates than entities that do not. For the Entity track there are no analyses or conclusions to report yet; at the time of writing no evaluation results are available for the Entity track.

  • [PDF] M. Bron, E. Meij, M. Peetz, M. Tsagkias, and M. de Rijke, “Team COMMIT at TREC 2011,” in The twentieth text retrieval conference, 2012.
    [Bibtex]
    @inproceedings{TREC:2011:commit,
    Author = {Bron, Marc and Meij, Edgar and Peetz, Maria-Hendrike and Tsagkias, Manos and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:19 +0200},
    Date-Modified = {2012-10-30 09:26:12 +0000},
    Series = {TREC 2011},
    Title = {Team {COMMIT} at {TREC 2011}},
    Year = {2012}}
Plot of a query-specific burst

Adaptive Temporal Query Modeling

We present an approach to query modeling that uses the temporal distribution of documents in an initially retrieved set of documents. Such distributions tend to exhibit bursts, especially in news related document collections. We hypothesize that documents in those bursts are more likely to be relevant than others. Predicated on this, we expand queries with the most distinguishing terms in high quality documents sampled from bursts. We show how the most commonly used decay function for recent document retrieval can be used as probabilistic model for temporal retrieval in general. The effectiveness of our models is demonstrated on both news collections and a collection of blog posts.

  • [PDF] M. Peetz, E. Meij, M. de Rijke, and W. Weerkamp, “Adaptive temporal query modeling,” in Advances in information retrieval – 34th european conference on ir research, ecir 2012, 2012.
    [Bibtex]
    @inproceedings{ECIR:2012:peetz,
    Author = {Peetz, Maria-Hendrike and Meij, Edgar and de Rijke, Maarten and Weerkamp, Wouter},
    Booktitle = {Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012},
    Date-Added = {2011-11-23 18:10:40 +0100},
    Date-Modified = {2012-10-28 23:01:12 +0000},
    Title = {Adaptive Temporal Query Modeling},
    Year = {2012}}
onszelf voorbij

Wij-woorden op websites: Zoekmachines voor geesteswetenschappers

Volgens velen in onze samenleving zijn we onszelf in het proces van integratie en multi-culti finaal voorbijgelopen. Sinds tien jaar is de toon van het debat in de media en op internet volslagen veranderd. De regering verkondigt dat de multiculturele samenleving is mislukt en dus wordt afgeschaft. Etnische achterstandsgroepen moeten zichzelf maar zien te redden en populistische uitlatingen doen het goed – soms met extreme gevolgen. Wie anders meent is soft en hoort vast bij de ‘linkse kerk’.

Een team van wetenschappers onderzoekt hoe theatergezelschappen, politieke partijen, kerken en andere groepen  hun grenzen trekken én overschrijden. Hoe stellen zij zich de wereld voorbij hun eigen grenzen voor? Als het domein van een bedreigende ander? Of ligt daar juist een braakliggend terrein vol mogelijkheden voor eigen nog niet verwerkelijkte projecten? En welke politieke consequenties hebben die verschillende voorstellingen? Daarbij speelt steeds de vraag: levert het winst of verlies op om over de grenzen van de eigen identiteit heen te kijken? Middeleeuwse kaartenmakers kenden niet de hele wereld en de onbekende delen lieten ze vaak maar wit. ‘Hier zijn draken’ of ‘waar de leeuwen zijn’, schreven ze er dan bij, maar soms ook ‘Eldorado’ of zelfs ‘het Paradijs’. Onszelf voorbij gaat over veranderende gemeenschappen vandaag de dag. En vooral over de vraag of het verlies of winst oplevert om over de grenzen van de eigen identiteit heen te kijken. Volgens velen in de samenleving zijn we onszelf in het proces van integratie en multi-culti geheel voorbijgelopen. Anderen vinden dat we zo’n stap ‘voorbij ons eigen erf’ eerst maar eens moeten zetten.

Onszelf voorbij gaat over veranderende gemeenschappen vandaag de dag. Waar hoor ik bij? En wie hoort bij mij? Dat was vroeger een duidelijke zaak: kerk, vakbond, partij en familie trokken de grenzen. Nu zijn deze vormen van verbondenheid in een stroomversnelling geraakt. Zijn we bang geworden voor wat er voorbij de grenzen van onze eigen groep ligt? Of durven we over die grens heen te stappen, onszelf voorbij?

Klik hier voor meer informatie.

Dynamic term cloud screenshot

Online Religious Studies

Data transitions have revolutionized many scientific disciplines, starting with the exact sciences, then the life sciences, and now the social sciences and humanities are in the process of making the transition to becoming data intensive sciences, with descriptions through quantitative measurements. New analysis tools, and publicly accessible utterances, opinions, transactions and interactions resulting from widespread Internet and social media usage facilitate new, data-intensive research methods in disciplines that have so far relied on small-scale literature and/or panel-based studies. To illustrate the new possibilities, we report on a pilot carried out by a cross-disciplinary team consisting of computer scientists and researchers in religious studies. In the latter area, research is often focused on mapping out the convictions, hopes, and beliefs of groups of people, be it within certain religions or within any other group, such as those defined by a political party.

In the pilot, religious scholars examined the core keywords in a left-wing political party in order to determine their hopes and beliefs. Rather than following their standard way-of-working, they were equipped with a search engine with an index of content crawled from discussion forums, the party’s web site plus a range of online publications relating to the party and going back to 1990. In this paper we focus on lessons learned and on methodological innovations for religious scholars as well as for computer scientists building the enabling technology.

  • [PDF] J. Bekkenkamp, E. Meij, and M. de Rijke, “Online religious studies,” in Web science 2011, Koblenz, 2011.
    [Bibtex]
    @inproceedings{websci:2011:meij,
    Abstract = {Data transitions have revolutionized many scientific disciplines, starting with the exact sciences, then the life sciences, and now the social sciences and humanities are in the process of making the transition to becoming data intensive sciences, with descriptions through quantitative measurements. New analysis tools and publicly accessible utterances, opinions, transactions and interactions resulting from widespread internet and social media usage facilitate new, data-intensive research methods in disciplines that have so far relied on small-scale literature and/or panel-based studies. To illustrate the new possibilities, we report on a pilot carried out by a cross-disciplinary team consisting of computer scientists and researchers in religious studies. In the latter area, research is often focused on mapping out the convictions, hopes, and beliefs of groups of people, be it within certain religions or within any other group, such as those defined by a political party.
    In the pilot, religious scholars examined the core keywords in a left-wing political party in order to determine their hopes and beliefs. Rather than following their standard way-of- working, they were equipped with a search engine with an index of content crawled from discussion forums, the party‚{\"A}{\^o}s web site plus a range of online publications relating to the party and going back to 1990. In this paper we focus on lessons learned and on methodological innovations for religious scholars as well as for computer scientists building the enabling technology.},
    Address = {Koblenz},
    Author = {Bekkenkamp, J. and Meij, E. and de Rijke, M.},
    Booktitle = {Web Science 2011},
    Date-Added = {2011-10-20 10:49:41 +0200},
    Date-Modified = {2012-10-30 08:39:02 +0000},
    Title = {Online Religious Studies},
    Year = {2011}}
Classifying People Queries

Classifying Queries Submitted to a Vertical Search Engine

We propose and motivate a scheme for classifying queries submitted to a people search engine. We specify a number of features for automatically classifying people queries into the proposed classes and examine the effectiveness of these features. Our main finding is that classification is feasible and that using information from past searches, clickouts and news sources is important.

  • [PDF] R. Berendsen, B. Kovachev, E. Meij, M. de Rijke, and W. Weerkamp, “Classifying queries submitted to a vertical search engine,” in Web science 2011, Koblenz, 2011.
    [Bibtex]
    @inproceedings{websci:2011:berendsen,
    Address = {Koblenz},
    Author = {Berendsen, R. and Kovachev, B. and Meij, E. and de Rijke, M. and Weerkamp, W.},
    Booktitle = {Web Science 2011},
    Date-Added = {2011-10-20 10:49:24 +0200},
    Date-Modified = {2012-10-30 08:39:05 +0000},
    Title = {Classifying Queries Submitted to a Vertical Search Engine},
    Year = {2011}}
Dutch Belgian Information Retrieval Workshop logo

Dir 2011: the eleventh Dutch-Belgian information retrieval workshop

The 11th edition of the annual Dutch-Belgian Information Retrieval workshop (DIR 2011) took place on February 4 in Amsterdam. It was organized by the University of Amsterdam and the Centrum Wiskunde & Informatica. The focus of this year’s workshop was on interaction, with the goal of facilitating and increasing interaction, especially within the local research community, and between industry and academia. The scientific program included demos, research papers, and compressed contributions. The keynotes by Nick Belkin and Gabriella Kazai provided intriguing outlooks on the future of IR evaluation.

  • [PDF] C. Boscarino, K. Hofmann, V. B. Jijkoun, E. Meij, M. de Rijke, and W. Weerkamp, “Workshop report: dutch-belgian information retrieval,” Sigir forum, vol. 45, iss. 1, pp. 42-44, 2011.
    [Bibtex]
    @article{forum:2011:dir,
    Author = {Boscarino, C. and Hofmann, K. and Jijkoun, V.B. and Meij, E. and de Rijke, M. and Weerkamp, W.},
    Chapter = {42},
    Date-Added = {2011-10-20 10:48:47 +0200},
    Date-Modified = {2011-10-20 10:48:52 +0200},
    Journal = {SIGIR Forum},
    Number = {1},
    Pages = {42-44},
    Title = {Workshop report: Dutch-Belgian Information Retrieval},
    Volume = {45},
    Year = {2011}}

DBpedia

Mapping queries to the Linking Open Data cloud: A case study using DBpedia

We introduce the task of mapping search engine queries to DBpedia, a major linking hub in the Linking Open Data cloud. We propose and compare various methods for addressing this task, using a mixture of information retrieval and machine learning techniques. Specifically, we present a supervised machine learning-based method to determine which concepts are intended by a user issuing a query. The concepts are obtained from an ontology and may be used to provide contextual information, related concepts, or navigational suggestions to the user submitting the query. Our approach first ranks candidate concepts using a language modeling for information retrieval framework. We then extract query, concept, and search-history feature vectors for these concepts. Using manual annotations we inform a machine learning algorithm that learns how to select concepts from the candidates given an input query. Simply performing a lexical match between the queries and concepts is found to perform poorly and so does using retrieval alone, i.e., omitting the concept selection stage. Our proposed method significantly improves upon these baselines and we find that support vector machines are able to achieve the best performance out of the machine learning algorithms evaluated.

  • [PDF] [DOI] E. Meij, M. Bron, L. Hollink, B. Huurnink, and M. de Rijke, “Mapping queries to the Linking Open Data cloud: a case study using DBpedia,” Web semantics: science, services and agents on the world wide web, vol. 9, iss. 4, pp. 418-433, 2011.
    [Bibtex]
    @article{JWS:2011:meij,
    Abstract = {We introduce the task of mapping search engine queries to DBpedia, a major linking hub in the Linking Open Data cloud. We propose and compare various methods for addressing this task, using a mixture of information retrieval and machine learning techniques. Specifically, we present a supervised machine learning-based method to determine which concepts are intended by a user issuing a query. The concepts are obtained from an ontology and may be used to provide contextual information, related concepts, or navigational suggestions to the user submitting the query. Our approach first ranks candidate concepts using a language modeling for information retrieval framework. We then extract query, concept, and search-history feature vectors for these concepts. Using manual annotations we inform a machine learning algorithm that learns how to select concepts from the candidates given an input query. Simply performing a lexical match between the queries and concepts is found to perform poorly and so does using retrieval alone, i.e., omitting the concept selection stage. Our proposed method significantly improves upon these baselines and we find that support vector machines are able to achieve the best performance out of the machine learning algorithms evaluated.},
    Author = {Edgar Meij and Marc Bron and Laura Hollink and Bouke Huurnink and Maarten de Rijke},
    Date-Added = {2011-11-25 08:45:19 +0100},
    Date-Modified = {2012-10-28 21:59:08 +0000},
    Doi = {10.1016/j.websem.2011.04.001},
    Issn = {1570-8268},
    Journal = {Web Semantics: Science, Services and Agents on the World Wide Web},
    Keywords = {Information retrieval},
    Number = {4},
    Pages = {418 - 433},
    Title = {Mapping queries to the {Linking Open Data} cloud: A case study using {DBpedia}},
    Url = {http://www.sciencedirect.com/science/article/pii/S1570826811000187},
    Volume = {9},
    Year = {2011},
    Bdsk-Url-1 = {http://www.sciencedirect.com/science/article/pii/S1570826811000187},
    Bdsk-Url-2 = {http://dx.doi.org/10.1016/j.websem.2011.04.001}}
Trade-off between diversity and precision

Result diversification based on query-specific cluster ranking

Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.

  • [PDF] [DOI] J. He, E. Meij, and M. de Rijke, “Result diversification based on query-specific cluster ranking,” J. am. soc. inf. sci., vol. 62, iss. 3, p. 550–571, 2011.
    [Bibtex]
    @article{JASIST:2011:he,
    Abstract = {Result diversification is a retrieval strategy for dealing with ambiguous or multi-faceted queries by providing documents that cover as many facets of the query as possible. We propose a result diversification framework based on query-specific clustering and cluster ranking, in which diversification is restricted to documents belonging to clusters that potentially contain a high percentage of relevant documents. Empirical results show that the proposed framework improves the performance of several existing diversification methods. The framework also gives rise to a simple yet effective cluster-based approach to result diversification that selects documents from different clusters to be included in a ranked list in a round robin fashion. We describe a set of experiments aimed at thoroughly analyzing the behavior of the two main components of the proposed diversification framework, ranking and selecting clusters for diversification. Both components have a crucial impact on the overall performance of our framework, but ranking clusters plays a more important role than selecting clusters. We also examine properties that clusters should have in order for our diversification framework to be effective. Most relevant documents should be contained in a small number of high-quality clusters, while there should be no dominantly large clusters. Also, documents from these high-quality clusters should have a diverse content. These properties are strongly correlated with the overall performance of the proposed diversification framework.},
    Address = {New York, NY, USA},
    Author = {He, Jiyin and Meij, Edgar and de Rijke, Maarten},
    Citeulike-Article-Id = {9425102},
    Citeulike-Linkout-0 = {http://portal.acm.org/citation.cfm?id=1952338},
    Citeulike-Linkout-1 = {http://dx.doi.org/10.1002/asi.21468},
    Date-Added = {2011-10-20 10:40:50 +0200},
    Date-Modified = {2012-10-28 21:59:28 +0000},
    Doi = {10.1002/asi.21468},
    Issn = {1532-2882},
    Journal = {J. Am. Soc. Inf. Sci.},
    Keywords = {todo},
    Number = {3},
    Pages = {550--571},
    Posted-At = {2011-10-20 09:40:35},
    Priority = {2},
    Publisher = {Wiley Subscription Services, Inc., A Wiley Company},
    Title = {Result diversification based on query-specific cluster ranking},
    Url = {http://dx.doi.org/10.1002/asi.21468},
    Volume = {62},
    Year = {2011},
    Bdsk-Url-1 = {http://dx.doi.org/10.1002/asi.21468}}
TREC

The University of Amsterdam at Trec 2010: Session, Entity, and Relevance Feedback

We describe the participation of the University of Amsterdam’s ILPS group in the session, entity, and relevance feedback track at TREC 2010. In the Session Track we explore the use of blind relevance feedback to bias a follow-up query towards or against the topics covered in documents returned to the user in response to the original query. In the Entity Track REF task we experiment with a window size parameter to limit the amount of context considered by the entity co-occurrence models and explore the use of Freebase for type filtering, entity normalization and homepage finding. In the ELC task we use an approach that uses the number of links shared between candidate and example entities to rank candidates. In the Relevance Feedback Track we experiment with a novel model that uses Wikipedia to expand the query language model.

  • [PDF] M. Bron, J. He, K. Hofmann, E. Meij, M. de Rijke, E. Tsagkias, and W. Weerkamp, “The University of Amsterdam at TREC 2010: session, entity, and relevance feedback,” in The nineteenth text retrieval conference, 2011.
    [Bibtex]
    @inproceedings{TREC:2011:bron,
    Abstract = {We describe the participation of the University of Amsterdam's Intelligent Systems Lab in the web track at TREC 2009. We participated in the adhoc and diversity task. We find that spam is an important issue in the ad hoc task and that Wikipedia-based heuristic optimization approaches help to boost the retrieval performance, which is assumed to potentially reduce spam in the top ranked results. As for the diversity task, we explored different methods. Clustering and a topic model-based approach have a similar performance and both are relatively better than a query log based approach.},
    Author = {M. Bron and He, J. and Hofmann, K. and Meij, E. and de Rijke, M. and Tsagkias, E. and Weerkamp, W.},
    Booktitle = {The Nineteenth Text REtrieval Conference},
    Date-Added = {2011-10-20 11:18:35 +0200},
    Date-Modified = {2012-10-30 09:25:06 +0000},
    Series = {TREC 2010},
    Title = {{The University of Amsterdam at TREC 2010}: Session, Entity, and Relevance Feedback},
    Year = {2011}}