CIKM 2014

Time-Aware Rank Aggregation for Microblog Search

We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.

Time series

OpenGeist: Insight in the Stream of Page Views on Wikipedia

We present a RESTful interface that captures insights into the zeitgeist of Wikipedia users. In recent years many so-called zeitgeist applications have been launched. Such applications are used to gain insights into the current gist of society and actual affairs. Several news sources run zeitgeist applications for popular and trending news. In addition, there are zeitgeist applications that report on trending publications such as LibraryThing, and trending topics, such as Google Zeitgeist. There is an interesting open data source from which a stream of people’s changing interests can be observed across a very broad spectrum of areas: the Wikimedia access logs. These logs contain the number of requests made to any Wikimedia domain, sorted by subdomain, and aggregated on an hourly basis. Since they are a log of the actual requests, they are noisy and can also contain non-existing pages. They are also quite large, yielding 60 GB worth of compressed textual data per month. Currently, we update the data on a daily basis and filter the raw source data by matching the URLs of all English Wikipedia articles and their redirects.

In this paper we describe an API that facilitates easy access to the access logs. We have identified the following requirements our system should have:

  • The user must have access to the raw time series data for a concept.
  • The user must be able to find the N most temporally similar concepts.
  • The user must be able to group concepts and their data, based either on the categorial system of Wikipedia or on similarity between concepts.
  • The system must return either a textual or a visual representation.
  • The user should be able to apply time series filters to extract trends and (recurring) events.

The API is an interface for clustering and comparing concepts based on the time series of the number of views of their Wikipedia page.

See for more info and examples.

  • [PDF] M-H. Peetz, E. Meij, and M. de Rijke, “OpenGeist: insight in the stream of page views on Wikipedia,” in Sigir 2012 workshop on time-aware information access, 2012.
    Author = {Peetz, M-H. and Meij, E. and de Rijke, M.},
    Booktitle = {SIGIR 2012 Workshop on Time-aware Information Access},
    Date-Added = {2012-10-28 16:35:47 +0000},
    Date-Modified = {2012-10-31 10:48:46 +0000},
    Title = {{OpenGeist}: Insight in the Stream of Page Views on {Wikipedia}},
    Year = {2012}}