ELRA logo

LREC 2012 Workshop on Language Engineering for Online Reputation Management

I am co-organizing an LREC workshop on Language Engineering for Online Reputation Management.

The LREC 2012 workshop on Language Engineering for Online Reputation Management intends to bring together the Language Engineering community (including researchers and developers) with representatives from the Online Reputation Management industry, a fast-growing sector which poses challenging demands to text mining technologies. The goal is to establish a five-year roadmap on the topic, focusing on what language technologies are required to get there in terms of resources, algorithms and applications.

Online Reputation Management deals with the image that online media project about individuals and organizations. The growing relevance of social media and the speed at which facts and opinions travel in microblogging networks make online reputation an essential part of a company’s public relations.

While traditional reputation analysis was based mostly on manual analysis (clipping from media, surveys, etc.), the key value from online media comes from the ability of processing, understanding and aggregating potentially huge streams of facts and opinions about a company or individual. Information to be mined includes answers to questions such as: What is the general state of opinion about a company/individual in online media? What are its perceived strengths and weaknesses, as compared to its peers/competitors? How is the company positioned with respect to its strategic market? Can incoming threats to its reputation be detected early enough to be neutralized before they effectively affect reputation?

In this context, Natural Language Processing plays a key, enabling role, and we are already witnessing an unprecedented demand for text mining software in this area. Note that, while the area of opinion mining has made significant advances in the last few years, most tangible progress has been focused on products. However, mining and understanding opinions about companies and individuals is, in general, a much harder and less understood problem.

The aim of this workshop is to bring together the Language Engineering community (including researchers and developers) with representatives from the Online Reputation Management industry, with the ultimate goal of establishing a five-year roadmap on the topic, and a description of the language technologies required to get there in terms of resources, algorithms and applications.

With this purpose in mind, the workshop will welcome both research papers and position statements from industry and academia. The agenda for the event will include both presentations (from accepted submissions and selected invited speakers) and a collaborative discussion to sketch a roadmap for Language Engineering in Online Reputation Management. The EU project Limosine (starting November 2011) will be used as a funding instrument to ensure that participation is representative and key players are engaged in the workshop. The workshop is held in coordination with the RepLab initiative, a CLEF 2012 evaluation initiative for systems dealing with Online Reputation Management challenges.


The University of Amsterdam at the TREC 2011 Session Track

We describe the participation of the University of Amsterdam’s ILPS group in the Session track at TREC 2011.

The stream of interactions created by a user engaging with a search system contains a wealth of information. For retrieval purposes, previous interactions can help inform us about a user’s current information need. Building on this intuition, our contribution to this TREC year’s session track focuses on session modeling and learning to rank using session information. In this paper, we present and compare three complementary strategies that we designed for improving retrieval for a current query using previous queries and clicked results: probabilistic session modeling, semantic query modeling, and implicit feedback.

In our experiments we examined three complementary strategies for improving retrieval for a current query. Our first strategy, based on probabilistic session modeling, was the best performing strategy.

Our second strategy, based on semantic query modeling, did less well than we expected, likely due to topic drift from excessively aggressive query expansion. We expect that performance of this strategy would improve by limiting the number of terms and/or improving the probability estimates.

With respect to our third strategy, based on learning from feedback, we found that learning weights for linear weighted combinations of features from an external collection can be beneficial, if characteristics of the collection are similar to the current data. Feedback available in the form of user clicks appeared to be less beneficial. Our run learning from implicit feedback did perform substantially lower than a run where weights were learned from an external collection with explicit feedback using the same learning algorithm and set of features.

  • [PDF] B. Huurnink, R. Berendsen, K. Hofmann, E. Meij, and M. de Rijke, “The University of Amsterdam at the TREC 2011 session track,” in The twentieth text retrieval conference, 2012.
    Author = {Huurnink, Bouke and Berendsen, Richard and Hofmann, Katja and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:18 +0200},
    Date-Modified = {2013-05-22 11:44:53 +0000},
    Month = {January},
    Series = {TREC 2011},
    Title = {The {University of Amsterdam} at the {TREC} 2011 Session Track},
    Year = {2012}}
P30 difference plot

Team COMMIT at TREC 2011

We describe the participation of Team COMMIT in this year’s Microblog and Entity track.

In our participation in the Microblog track, we used a feature-based approach. Specifically, we pursued a precision oriented recency-aware retrieval approach for tweets. Amongst others we used various types of external data. In particular, we examined the potential of link retrieval on a corpus of crawled content pages and we use semantic query expansion using Wikipedia. We also deployed pre-filtering based on query-dependent and query-independent features. For the Microblog track we found that a simple cut-off based on the z-score is not sufficient: for differently distributed scores, this can decrease recall. A well set cut-off parameter can however significantly increase precision, especially if there are few highly relevant tweets. Filtering based on query-independent filtering does not help for already small result list. With a high occurrence of links in relevant tweets, we found that using link retrieval helps improving precision and recall for highly relevant and relevant tweets. Future work should focus on a score-distribution dependent selection criterion.

In this years Entity track participation we focused on the Entity List Completion (ELC) task. We experimented with a text based and link based approach to retrieve entities in Linked Data (LD). Additionally we experimented with selecting candidate entities from a web corpus. Our intuition is that entities occurring on pages with many of the example entities are more likely to be good candidates than entities that do not. For the Entity track there are no analyses or conclusions to report yet; at the time of writing no evaluation results are available for the Entity track.

  • [PDF] M. Bron, E. Meij, M. Peetz, M. Tsagkias, and M. de Rijke, “Team COMMIT at TREC 2011,” in The twentieth text retrieval conference, 2012.
    Author = {Bron, Marc and Meij, Edgar and Peetz, Maria-Hendrike and Tsagkias, Manos and de Rijke, Maarten},
    Booktitle = {The Twentieth Text REtrieval Conference},
    Date-Added = {2011-10-22 12:22:19 +0200},
    Date-Modified = {2012-10-30 09:26:12 +0000},
    Series = {TREC 2011},
    Title = {Team {COMMIT} at {TREC 2011}},
    Year = {2012}}
ECIR 2012

ECIR preprints published

The camera-ready ver­sion of the ECIR papers, A Framework for Unsupervised Spam Detection in Social Networking Sites (with Maarten Bosma and Wouter Weerkamp) and Adaptive Temporal Query Modeling (with Hendrike Peetz, Wouter Weerkamp, and Maarten de Rijke) are available now.

In the first paper, we report on the effectiveness of an unsupervised spam detection method for community-based websites, where users can indicate whether messages posted by others are spam. The collection of the user-generated messages that we used, their spam reports, and labels will be released soon, stay tuned.

Twitter standing

A comparison of five semantic linking algorithms on tweets

Late last December, Yahoo! released a new version of their Content Analysis service and they announced that the initial version will be deprecated in 2012. Inspired by a recent post by Tony Hirst, entitled A Quick Peek at Three Content Analysis Services, this seemed like a perfect opportunity to test out various algorithms/APIs for semantically annotating text, in particular tweets. For my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke), we have developed a gold-standard test collection for exactly this, i.e., automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet.

What I wanted to do here is take our recently released test collection and compare several off-the-shelf annotation APIs. In the paper, we already compare various methods, including Tagme and DBpedia spotlight. There, we add to this a variant solely based on the anchor texts found in Wikipedia, called ‘CMNS’ in the paper. In this post, I also include the new Yahoo! service and a service called Wikimeta. I have excluded OpenCalais from this list, mainly because it doesn’t link to Wikipedia.

Highlights of the experimental setup:

  • Approximately 500 tweets, with a maximum of 50 retrieved concepts, i.e., Wikipedia articles, per tweet.
  • The tweet is tokenized, i.e., punctuation and capitalization is removed. Twitter-specific “terms” such as mentions and URLs, are also removed. For hashtags, I remove the ‘#’ character but leave the term itself. Stopwords are removed. (More on this later.)

First, some general observations with respect to each API.

  • DBpedia Spotlight feels sluggish and actually takes the longest to annotate all tweets (approx. 30 minutes).
  • Tagme is blazingly fast, processing all tweets in under 60 seconds.
  • Yahoo! is also fast, but not very robust. It gives intermittent HTTP 500 responses to web service calls.
  • Wikimeta, well… First of all, the returned XML is not valid, containing unescaped ‘&’ characters. After having manually fixed the output, it started nicely, but the web service seems to have crashed after processing 50 tweets. Update: things are back up and it finished within a few minutes.
  • Finally, our method is also quite fast; it finished processing all tweets in under 90 seconds. Obviously we have a local installation of this, so there is little networking overhead.

Now, onto the results. Below, I report on a number of metrics, including average R-precision, i.e., precision at R, where R denotes the number of relevant concepts per tweet, reciprocal rank, i.e., the reciprocal of the rank of the first relevant concept, recall, and MAP (mean average precision)

Comparison results

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26230.43010.39040.2865

From this table it is clear that Tagme obtains high precision, with our method a close second. Reciprocal rank is high for both methods—a value of 0.6289 indicates the average rank of the first relevant concept lies around 1.6. Our method obtains highest recall–retrieving over 80% of all relevant concepts–and MAP, this time with Tagme as close second.

When running these experiments, it turned out that some methods use capitalization, punctuation, and related information to determine candidate concept links and targets; in particular Wikimeta and Yahoo! seem to be affected by this. So, in the next table you’ll find the same results, only this time without any tokenization performed (and also without any stopwords removed). Indeed, Wikimeta improves considerably and also Yahoo! improves somewhat. There seems to be a little gain for DBpedia Spotlight in this case.

Comparison results - untokenized

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26500.42980.42730.2950

To round up, some concluding remarks. Tweets are inherently different from “ordinary” text, and this evaluation has shown that the methods that perform best on short texts (for instance, the Tagme system) also perform best on tweets, when there is little data available for disambiguation. Wikimeta parses the input text and is thus helped by providing it with full-text (for as far as that goes with Twitter).

Finally, I discovered something interesting with respect to our test collection, namely that some of the contents already seem to be outdated. One of the tweets refers to “Pia Toscano,” but she wasn’t in the annotators’ version of Wikipedia yet. As such, some systems retrieve her correctly, although the annotations deem her not relevant. “Dynamic semantics.” Sounds like a nice title for my next paper.


Research on Twitter

Dataset for “Adding Semantics to Microblog Posts”

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/. The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.


Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.


We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.