CIKM 2014

Time-Aware Rank Aggregation for Microblog Search

We tackle the problem of searching microblog posts and frame it as a rank aggregation problem where we merge result lists generated by separate rankers so as to produce a final ranking to be returned to the user. We propose a rank aggregation method, TimeRA, that is able to infer the rank scores of documents via latent factor modeling. It is time-aware and rewards posts that are published in or near a burst of posts that are ranked highly in many of the lists being aggregated. Our experimental results show that it significantly outperforms state-of-the-art rank aggregation and time-sensitive microblog search algorithms.

Twitter aspects

Identifying Entity Aspects in Microblog Posts

Online reputation management is about monitoring and handling the public image of entities (such as companies) on the Web. An important task in this area is identifying aspects of the entity of interest (such as products, services, competitors, key people, etc.) given a stream of microblog posts referring to the entity. In this paper we compare different IR techniques and opinion target identification methods for automatically identifying aspects and find that (i) simple statistical method such as TF.IDF are a strong baseline for the task, being significantly better than applying opinion-oriented methods and (ii) only considering terms tagged as nouns improves the results for all the methods analyzed.

More information on the dataset that we created (and used in this paper) can be found here.

  • [PDF] D. Spina, E. Meij, M. de Rijke, A. Oghina, B. M. Thuong, and M. Breuss, “Identifying entity aspects in microblog posts,” in The 35th international acm sigir conference on research and development in information retrieval, 2012.
    Author = {Damiano Spina and Meij, Edgar and de Rijke, Maarten and Andrei Oghina and Bui Minh Thuong and Mathias Breuss},
    Booktitle = {The 35th International ACM SIGIR conference on research and development in Information Retrieval},
    Date-Added = {2012-05-03 22:17:17 +0200},
    Date-Modified = {2012-10-30 08:40:47 +0000},
    Series = {SIGIR 2012},
    Title = {Identifying Entity Aspects in Microblog Posts},
    Year = {2012}}
Twitter aspects

A Corpus for Entity Profiling in Microblog Posts

Microblogs have become an invaluable source of information for the purpose of online reputation management. An emerging problem in the field of online reputation management consists of identifying the key aspects of an entity commented in microblog posts. Streams of microblogs are of great value because of their direct and real-time nature and synthesizing them in form of entity profiles facilitates reputation managers to keep a track of the public image of the entity. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication.

In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

You can find more information on this test collection at

  • [PDF] D. Spina, E. Meij, A. Oghina, B. M. Thuong, M. Breuss, and M. de Rijke, “A corpus for entity profiling in microblog posts,” in Lrec 2012 workshop on language engineering for online reputation management, 2012.
    Author = {Damiano Spina and Edgar Meij and Andrei Oghina and Bui Minh Thuong and Mathias Breuss and Maarten de Rijke},
    Booktitle = {LREC 2012 Workshop on Language Engineering for Online Reputation Management},
    Date-Added = {2012-03-29 12:18:51 +0200},
    Date-Modified = {2012-03-29 12:20:09 +0200},
    Title = {A Corpus for Entity Profiling in Microblog Posts},
    Year = {2012}}
Research on Twitter

Adding Semantics to Microblog Posts

Microblogs have become an important source of information for marketing, intelligence, and reputation management purposes. Streams of microblogs are of great value because of their direct and real-time nature. Determining what an individual microblog post is about, however, can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication.

We propose a solution to the problem of determining what a microblog post is about through semantic linking: we add semantics to posts by automatically identifying concepts that are semantically related to it and generating links to the corresponding Wikipedia articles. The identified concepts can subsequently be used for, e.g., social media mining, thereby reducing the need for manual inspection and selection. Using a purpose-built test collection of tweets, we show that recently proposed approaches for semantically linking do not perform well, mainly due to the idiosyncratic nature of microblog posts. We propose a novel method based on machine learning with a set of innovative features and show that is able to achieve significant improvements over all other methods, especially in terms of precision.

  • [PDF] E. Meij, W. Weerkamp, and M. de Rijke, “Adding semantics to microblog posts,” in Proceedings of the fifth acm international conference on web search and data mining, 2012.
    Author = {Meij, Edgar and Weerkamp, Wouter and de Rijke, Maarten},
    Booktitle = {Proceedings of the fifth ACM international conference on Web search and data mining},
    Date-Added = {2015-01-20 20:28:31 +0000},
    Date-Modified = {2015-01-20 20:28:31 +0000},
    Series = {WSDM 2012},
    Title = {Adding Semantics to Microblog Posts},
    Year = {2012},
    Bdsk-Url-1 = {}}
ELRA logo

LREC 2012 Workshop on Language Engineering for Online Reputation Management

I am co-organizing an LREC workshop on Language Engineering for Online Reputation Management.

The LREC 2012 workshop on Language Engineering for Online Reputation Management intends to bring together the Language Engineering community (including researchers and developers) with representatives from the Online Reputation Management industry, a fast-growing sector which poses challenging demands to text mining technologies. The goal is to establish a five-year roadmap on the topic, focusing on what language technologies are required to get there in terms of resources, algorithms and applications.

Online Reputation Management deals with the image that online media project about individuals and organizations. The growing relevance of social media and the speed at which facts and opinions travel in microblogging networks make online reputation an essential part of a company’s public relations.

While traditional reputation analysis was based mostly on manual analysis (clipping from media, surveys, etc.), the key value from online media comes from the ability of processing, understanding and aggregating potentially huge streams of facts and opinions about a company or individual. Information to be mined includes answers to questions such as: What is the general state of opinion about a company/individual in online media? What are its perceived strengths and weaknesses, as compared to its peers/competitors? How is the company positioned with respect to its strategic market? Can incoming threats to its reputation be detected early enough to be neutralized before they effectively affect reputation?

In this context, Natural Language Processing plays a key, enabling role, and we are already witnessing an unprecedented demand for text mining software in this area. Note that, while the area of opinion mining has made significant advances in the last few years, most tangible progress has been focused on products. However, mining and understanding opinions about companies and individuals is, in general, a much harder and less understood problem.

The aim of this workshop is to bring together the Language Engineering community (including researchers and developers) with representatives from the Online Reputation Management industry, with the ultimate goal of establishing a five-year roadmap on the topic, and a description of the language technologies required to get there in terms of resources, algorithms and applications.

With this purpose in mind, the workshop will welcome both research papers and position statements from industry and academia. The agenda for the event will include both presentations (from accepted submissions and selected invited speakers) and a collaborative discussion to sketch a roadmap for Language Engineering in Online Reputation Management. The EU project Limosine (starting November 2011) will be used as a funding instrument to ensure that participation is representative and key players are engaged in the workshop. The workshop is held in coordination with the RepLab initiative, a CLEF 2012 evaluation initiative for systems dealing with Online Reputation Management challenges.

Twitter standing

A comparison of five semantic linking algorithms on tweets

Late last December, Yahoo! released a new version of their Content Analysis service and they announced that the initial version will be deprecated in 2012. Inspired by a recent post by Tony Hirst, entitled A Quick Peek at Three Content Analysis Services, this seemed like a perfect opportunity to test out various algorithms/APIs for semantically annotating text, in particular tweets. For my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke), we have developed a gold-standard test collection for exactly this, i.e., automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet.

What I wanted to do here is take our recently released test collection and compare several off-the-shelf annotation APIs. In the paper, we already compare various methods, including Tagme and DBpedia spotlight. There, we add to this a variant solely based on the anchor texts found in Wikipedia, called ‘CMNS’ in the paper. In this post, I also include the new Yahoo! service and a service called Wikimeta. I have excluded OpenCalais from this list, mainly because it doesn’t link to Wikipedia.

Highlights of the experimental setup:

  • Approximately 500 tweets, with a maximum of 50 retrieved concepts, i.e., Wikipedia articles, per tweet.
  • The tweet is tokenized, i.e., punctuation and capitalization is removed. Twitter-specific “terms” such as mentions and URLs, are also removed. For hashtags, I remove the ‘#’ character but leave the term itself. Stopwords are removed. (More on this later.)

First, some general observations with respect to each API.

  • DBpedia Spotlight feels sluggish and actually takes the longest to annotate all tweets (approx. 30 minutes).
  • Tagme is blazingly fast, processing all tweets in under 60 seconds.
  • Yahoo! is also fast, but not very robust. It gives intermittent HTTP 500 responses to web service calls.
  • Wikimeta, well… First of all, the returned XML is not valid, containing unescaped ‘&’ characters. After having manually fixed the output, it started nicely, but the web service seems to have crashed after processing 50 tweets. Update: things are back up and it finished within a few minutes.
  • Finally, our method is also quite fast; it finished processing all tweets in under 90 seconds. Obviously we have a local installation of this, so there is little networking overhead.

Now, onto the results. Below, I report on a number of metrics, including average R-precision, i.e., precision at R, where R denotes the number of relevant concepts per tweet, reciprocal rank, i.e., the reciprocal of the rank of the first relevant concept, recall, and MAP (mean average precision)

Comparison results

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26230.43010.39040.2865

From this table it is clear that Tagme obtains high precision, with our method a close second. Reciprocal rank is high for both methods—a value of 0.6289 indicates the average rank of the first relevant concept lies around 1.6. Our method obtains highest recall–retrieving over 80% of all relevant concepts–and MAP, this time with Tagme as close second.

When running these experiments, it turned out that some methods use capitalization, punctuation, and related information to determine candidate concept links and targets; in particular Wikimeta and Yahoo! seem to be affected by this. So, in the next table you’ll find the same results, only this time without any tokenization performed (and also without any stopwords removed). Indeed, Wikimeta improves considerably and also Yahoo! improves somewhat. There seems to be a little gain for DBpedia Spotlight in this case.

Comparison results - untokenized

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26500.42980.42730.2950

To round up, some concluding remarks. Tweets are inherently different from “ordinary” text, and this evaluation has shown that the methods that perform best on short texts (for instance, the Tagme system) also perform best on tweets, when there is little data available for disambiguation. Wikimeta parses the input text and is thus helped by providing it with full-text (for as far as that goes with Twitter).

Finally, I discovered something interesting with respect to our test collection, namely that some of the contents already seem to be outdated. One of the tweets refers to “Pia Toscano,” but she wasn’t in the annotators’ version of Wikipedia yet. As such, some systems retrieve her correctly, although the annotations deem her not relevant. “Dynamic semantics.” Sounds like a nice title for my next paper.


Research on Twitter

Dataset for “Adding Semantics to Microblog Posts”

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.


Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.


We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.