Example entity linking for tweets, to support tweets summarization

Personalized Time-Aware Tweets Summarization

To appear as full paper at SIGIR 2013.

In this paper we focus on selecting meaningful tweets given a user’s interests. Specifically, we consider the task of time-aware tweets summarization, based on a user’s history and collaborative social influences from “social circles.” Continue reading “Personalized Time-Aware Tweets Summarization” »

Generating Pseudo Test Collections for Learning to Rank Scientific Articles

Pseudo test collections are automatically generated to provide training material for learning to rank methods. We propose a method for generating pseudo test collections in the domain of digital libraries, where data is relatively sparse, but comes with rich annotations. Our intuition is that documents are annotated to make them better findable for certain information needs. We use these annotations and the associated documents as a source for pairs of queries and relevant documents. We investigate how learning to rank performance varies when we use different methods for sampling annotations, and show how our pseudo test collection ranks systems compared to editorial topics with editorial judgements. Our results demonstrate that it is possible to train a learning to rank algorithm on generated pseudo judgments. In some cases, performance is on par with learning on manually obtained ground truth.

  • [PDF] R. Berendsen, M. Tsagkias, M. de Rijke, and E. Meij, “Generating pseudo test collections for learning to rank scientific articles,” in Information access evaluation. multilinguality, multimodality, and visual analytics – third international conference of the clef initiative, clef 2012, 2012.
    [Bibtex]
    @inproceedings{CLEF:2012:berendsen,
    Author = {Berendsen, Richard and Tsagkias, Manos and de Rijke, Maarten and Meij, Edgar},
    Booktitle = {Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics - Third International Conference of the CLEF Initiative, CLEF 2012},
    Date-Added = {2012-07-03 13:44:06 +0200},
    Date-Modified = {2012-10-30 08:37:52 +0000},
    Title = {Generating Pseudo Test Collections for Learning to Rank Scientific Articles},
    Year = {2012}}
Twitter aspects

Identifying Entity Aspects in Microblog Posts

Online reputation management is about monitoring and handling the public image of entities (such as companies) on the Web. An important task in this area is identifying aspects of the entity of interest (such as products, services, competitors, key people, etc.) given a stream of microblog posts referring to the entity. In this paper we compare different IR techniques and opinion target identification methods for automatically identifying aspects and find that (i) simple statistical method such as TF.IDF are a strong baseline for the task, being significantly better than applying opinion-oriented methods and (ii) only considering terms tagged as nouns improves the results for all the methods analyzed.

More information on the dataset that we created (and used in this paper) can be found here.

  • [PDF] D. Spina, E. Meij, M. de Rijke, A. Oghina, B. M. Thuong, and M. Breuss, “Identifying entity aspects in microblog posts,” in The 35th international acm sigir conference on research and development in information retrieval, 2012.
    [Bibtex]
    @inproceedings{SIGIR:2012:spina,
    Author = {Damiano Spina and Meij, Edgar and de Rijke, Maarten and Andrei Oghina and Bui Minh Thuong and Mathias Breuss},
    Booktitle = {The 35th International ACM SIGIR conference on research and development in Information Retrieval},
    Date-Added = {2012-05-03 22:17:17 +0200},
    Date-Modified = {2012-10-30 08:40:47 +0000},
    Series = {SIGIR 2012},
    Title = {Identifying Entity Aspects in Microblog Posts},
    Year = {2012}}
Research on Twitter

Adding Semantics to Microblog Posts

Microblogs have become an important source of information for marketing, intelligence, and reputation management purposes. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication.

We propose a solution to the problem of determining what a microblog post is about through semantic linking: we add semantics to posts by automatically identifying concepts that are semantically related to it and generating links to the corresponding Wikipedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for manual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantically linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that is able to achieve significant improvements over all other methods, especially in terms of precision.

  • [PDF] E. Meij, W. Weerkamp, and M. de Rijke, “Adding semantics to microblog posts,” in Proceedings of the fifth acm international conference on web search and data mining, 2012.
    [Bibtex]
    @inproceedings{WSDM:2012:meij,
    Author = {Meij, Edgar and Weerkamp, Wouter and de Rijke, Maarten},
    Booktitle = {Proceedings of the fifth ACM international conference on Web search and data mining},
    Date-Added = {2015-01-20 20:28:31 +0000},
    Date-Modified = {2015-01-20 20:28:31 +0000},
    Series = {WSDM 2012},
    Title = {Adding Semantics to Microblog Posts},
    Year = {2012},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1935826.1935842}}
Plot of a query-specific burst

Adaptive Temporal Query Modeling

We present an approach to query modeling that uses the temporal distribution of documents in an initially retrieved set of documents. Such distributions tend to exhibit bursts, especially in news related document collections. We hypothesize that documents in those bursts are more likely to be relevant than others. Predicated on this, we expand queries with the most distinguishing terms in high quality documents sampled from bursts. We show how the most commonly used decay function for recent document retrieval can be used as probabilistic model for temporal retrieval in general. The effectiveness of our models is demonstrated on both news collections and a collection of blog posts.

  • [PDF] M. Peetz, E. Meij, M. de Rijke, and W. Weerkamp, “Adaptive temporal query modeling,” in Advances in information retrieval – 34th european conference on ir research, ecir 2012, 2012.
    [Bibtex]
    @inproceedings{ECIR:2012:peetz,
    Author = {Peetz, Maria-Hendrike and Meij, Edgar and de Rijke, Maarten and Weerkamp, Wouter},
    Booktitle = {Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012},
    Date-Added = {2011-11-23 18:10:40 +0100},
    Date-Modified = {2012-10-28 23:01:12 +0000},
    Title = {Adaptive Temporal Query Modeling},
    Year = {2012}}
social media icons

A Framework for Unsupervised Spam Detection in Social Networking Sites

Social networking sites offer users the option to submit user spam reports for a given message, indicating this message is inappropriate. In this paper we present a framework that uses these user spam reports for spam detection. The framework is based on the HITS web link analysis framework and is instantiated in three models. The models subsequently introduce propagation between messages reported by the same user, messages authored by the same user, and messages with similar content. Each of the models can also be converted to a simple semi-supervised scheme. We test our models on data from a popular social network and compare the models to two baselines, based on message content and raw report counts. We find that our models outperform both baselines and that each of the additions (reporters, authors, and similar messages) further improves the performance of the framework.

  • [PDF] M. Bosma, E. Meij, and W. Weerkamp, “A framework for unsupervised spam detection in social networking sites,” in Advances in information retrieval – 34th european conference on ir research, ecir 2012, 2012.
    [Bibtex]
    @inproceedings{ECIR:2012:bosma,
    Author = {Maarten Bosma and Meij, Edgar and Weerkamp, Wouter},
    Booktitle = {Advances in Information Retrieval - 34th European Conference on IR Research, ECIR 2012},
    Date-Added = {2011-11-23 18:10:33 +0100},
    Date-Modified = {2012-10-28 23:00:37 +0000},
    Title = {A Framework for Unsupervised Spam Detection in Social Networking Sites},
    Year = {2012}}
hits per time of day

People searching for people: analysis of a people search engine log

Recent years show an increasing interest in vertical search: searching within a particular type of information. Understanding what people search for in these “verticals” gives direction to research and provides pointers for the search engines themselves. In this paper we analyze the search logs of one particular vertical: people search engines. Based on an extensive analysis of the logs of a search engine geared towards finding people, we propose a classification scheme for people search at three levels: (a) queries, (b) sessions, and (c) users. For queries, we identify three types, (i) event-based high-profile queries (people that become “popular” because of an event happening), (ii) regular high-profile queries (celebrities), and (iii) low-profile queries (other, less-known people). We present experiments on automatic classification of queries. On the session level, we observe five types: (i) family sessions (users looking for relatives), (ii) event sessions (querying the main players of an event), (iii) spotting sessions (trying to “spot” different celebrities online), (iv) polymerous sessions (sessions without a clear relation between queries), and (v) repetitive sessions (query refinement and copying). Finally, for users we identify four types: (i) monitors, (ii) spotters, (iii) followers, and (iv) polymers.

Our findings not only offer insight into search behavior in people search engines, but they are also useful to identify future research directions and to provide pointers for search engine improvements.

  • [PDF] W. Weerkamp, R. Berendsen, B. Kovachev, E. Meij, K. Balog, and M. de Rijke, “People searching for people: analysis of a people search engine log,” in Proceedings of the 34th international acm sigir conference on research and development in information, 2011.
    [Bibtex]
    @inproceedings{sigir:2011:weerkamp,
    Author = {Weerkamp, Wouter and Berendsen, Richard and Kovachev, Bogomil and Meij, Edgar and Balog, Krisztian and de Rijke, Maarten},
    Booktitle = {Proceedings of the 34th international ACM SIGIR conference on Research and development in Information},
    Date-Added = {2011-10-20 10:50:25 +0200},
    Date-Modified = {2012-10-30 08:41:27 +0000},
    Series = {SIGIR 2011},
    Title = {People searching for people: analysis of a people search engine log},
    Year = {2011},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/2009916.2009927}}

ACM DL Author-ize servicePeople searching for people: analysis of a people search engine log

Wouter Weerkamp, Richard Berendsen, Bogomil Kovachev, Edgar Meij, Krisztian Balog, Maarten de Rijke
SIGIR ’11 Proceedings of the 34th international ACM SIGIR conference on Research and development in Information, 2011

Dynamic term cloud screenshot

Online Religious Studies

Data transitions have revolutionized many scientific disciplines, starting with the exact sciences, then the life sciences, and now the social sciences and humanities are in the process of making the transition to becoming data intensive sciences, with descriptions through quantitative measurements. New analysis tools, and publicly accessible utterances, opinions, transactions and interactions resulting from widespread Internet and social media usage facilitate new, data-intensive research methods in disciplines that have so far relied on small-scale literature and/or panel-based studies. To illustrate the new possibilities, we report on a pilot carried out by a cross-disciplinary team consisting of computer scientists and researchers in religious studies. In the latter area, research is often focused on mapping out the convictions, hopes, and beliefs of groups of people, be it within certain religions or within any other group, such as those defined by a political party.

In the pilot, religious scholars examined the core keywords in a left-wing political party in order to determine their hopes and beliefs. Rather than following their standard way-of-working, they were equipped with a search engine with an index of content crawled from discussion forums, the party’s web site plus a range of online publications relating to the party and going back to 1990. In this paper we focus on lessons learned and on methodological innovations for religious scholars as well as for computer scientists building the enabling technology.

  • [PDF] J. Bekkenkamp, E. Meij, and M. de Rijke, “Online religious studies,” in Web science 2011, Koblenz, 2011.
    [Bibtex]
    @inproceedings{websci:2011:meij,
    Abstract = {Data transitions have revolutionized many scientific disciplines, starting with the exact sciences, then the life sciences, and now the social sciences and humanities are in the process of making the transition to becoming data intensive sciences, with descriptions through quantitative measurements. New analysis tools and publicly accessible utterances, opinions, transactions and interactions resulting from widespread internet and social media usage facilitate new, data-intensive research methods in disciplines that have so far relied on small-scale literature and/or panel-based studies. To illustrate the new possibilities, we report on a pilot carried out by a cross-disciplinary team consisting of computer scientists and researchers in religious studies. In the latter area, research is often focused on mapping out the convictions, hopes, and beliefs of groups of people, be it within certain religions or within any other group, such as those defined by a political party.
    In the pilot, religious scholars examined the core keywords in a left-wing political party in order to determine their hopes and beliefs. Rather than following their standard way-of- working, they were equipped with a search engine with an index of content crawled from discussion forums, the party‚{\"A}{\^o}s web site plus a range of online publications relating to the party and going back to 1990. In this paper we focus on lessons learned and on methodological innovations for religious scholars as well as for computer scientists building the enabling technology.},
    Address = {Koblenz},
    Author = {Bekkenkamp, J. and Meij, E. and de Rijke, M.},
    Booktitle = {Web Science 2011},
    Date-Added = {2011-10-20 10:49:41 +0200},
    Date-Modified = {2012-10-30 08:39:02 +0000},
    Title = {Online Religious Studies},
    Year = {2011}}