Example entity linking for tweets, to support tweets summarization

Personalized Time-Aware Tweets Summarization

To appear as full paper at SIGIR 2013.

In this paper we focus on selecting meaningful tweets given a user’s interests. Specifically, we consider the task of time-aware tweets summarization, based on a user’s history and collaborative social influences from “social circles.” Continue reading “Personalized Time-Aware Tweets Summarization” »

Overview of RepLab 2012: Evaluating Online Reputation Management Systems

This paper summarizes the goals, organization and results of the first RepLab competitive evaluation campaign for Online Reputation Management Systems (RepLab 2012). RepLab focused on the reputation of companies, and asked participant systems to annotate different types of information on tweets containing the names of several companies. Two tasks were proposed: a pro ling task, where tweets had to be annotated for relevance and polarity for reputation, and a monitoring task, where tweets had to be clustered thematically and clusters had to be ordered by priority (for reputation management purposes). The gold standard consisted of annotations made by reputation management experts, a feature which turns the RepLab 2012 test collection in a useful source not only to evaluate systems, but also to reach a better understanding of the notions of polarity and priority in the context of reputation management.

  • [PDF] E. Amigó, A. Corujo, J. Gonzalo, E. Meij, and M. de Rijke, “Overview of RepLab 2012: evaluating online reputation management systems,” in Clef (online working notes/labs/workshop), 2012.
    [Bibtex]
    @inproceedings{CLEF:2012:replab,
    Author = {Enrique Amig{\'o} and Adolfo Corujo and Julio Gonzalo and Edgar Meij and Maarten de Rijke},
    Booktitle = {CLEF (Online Working Notes/Labs/Workshop)},
    Date-Added = {2012-09-20 12:48:33 +0000},
    Date-Modified = {2012-10-30 09:30:49 +0000},
    Title = {Overview of {RepLab} 2012: Evaluating Online Reputation Management Systems},
    Year = {2012}}

Generating Pseudo Test Collections for Learning to Rank Scientific Articles

Pseudo test collections are automatically generated to provide training material for learning to rank methods. We propose a method for generating pseudo test collections in the domain of digital libraries, where data is relatively sparse, but comes with rich annotations. Our intuition is that documents are annotated to make them better findable for certain information needs. We use these annotations and the associated documents as a source for pairs of queries and relevant documents. We investigate how learning to rank performance varies when we use different methods for sampling annotations, and show how our pseudo test collection ranks systems compared to editorial topics with editorial judgements. Our results demonstrate that it is possible to train a learning to rank algorithm on generated pseudo judgments. In some cases, performance is on par with learning on manually obtained ground truth.

  • [PDF] R. Berendsen, M. Tsagkias, M. de Rijke, and E. Meij, “Generating pseudo test collections for learning to rank scientific articles,” in Information access evaluation. multilinguality, multimodality, and visual analytics – third international conference of the clef initiative, clef 2012, 2012.
    [Bibtex]
    @inproceedings{CLEF:2012:berendsen,
    Author = {Berendsen, Richard and Tsagkias, Manos and de Rijke, Maarten and Meij, Edgar},
    Booktitle = {Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics - Third International Conference of the CLEF Initiative, CLEF 2012},
    Date-Added = {2012-07-03 13:44:06 +0200},
    Date-Modified = {2012-10-30 08:37:52 +0000},
    Title = {Generating Pseudo Test Collections for Learning to Rank Scientific Articles},
    Year = {2012}}
Time series

OpenGeist: Insight in the Stream of Page Views on Wikipedia

We present a RESTful interface that captures insights into the zeitgeist of Wikipedia users. In recent years many so-called zeitgeist applications have been launched. Such applications are used to gain insights into the current gist of society and actual affairs. Several news sources run zeitgeist applications for popular and trending news. In addition, there are zeitgeist applications that report on trending publications such as LibraryThing, and trending topics, such as Google Zeitgeist. There is an interesting open data source from which a stream of people’s changing interests can be observed across a very broad spectrum of areas: the Wikimedia access logs. These logs contain the number of requests made to any Wikimedia domain, sorted by subdomain, and aggregated on an hourly basis. Since they are a log of the actual requests, they are noisy and can also contain non-existing pages. They are also quite large, yielding 60 GB worth of compressed textual data per month. Currently, we update the data on a daily basis and filter the raw source data by matching the URLs of all English Wikipedia articles and their redirects.

In this paper we describe an API that facilitates easy access to the access logs. We have identified the following requirements our system should have:

  • The user must have access to the raw time series data for a concept.
  • The user must be able to find the N most temporally similar concepts.
  • The user must be able to group concepts and their data, based either on the categorial system of Wikipedia or on similarity between concepts.
  • The system must return either a textual or a visual representation.
  • The user should be able to apply time series filters to extract trends and (recurring) events.

The API is an interface for clustering and comparing concepts based on the time series of the number of views of their Wikipedia page.

See http://www.opengeist.org for more info and examples.

  • [PDF] M-H. Peetz, E. Meij, and M. de Rijke, “OpenGeist: insight in the stream of page views on Wikipedia,” in Sigir 2012 workshop on time-aware information access, 2012.
    [Bibtex]
    @inproceedings{SIGIR-WS:2012:Peetz,
    Author = {Peetz, M-H. and Meij, E. and de Rijke, M.},
    Booktitle = {SIGIR 2012 Workshop on Time-aware Information Access},
    Date-Added = {2012-10-28 16:35:47 +0000},
    Date-Modified = {2012-10-31 10:48:46 +0000},
    Title = {{OpenGeist}: Insight in the Stream of Page Views on {Wikipedia}},
    Year = {2012}}