Twitter standing

A comparison of five semantic linking algorithms on tweets

Late last December, Yahoo! released a new version of their Content Analysis service and they announced that the initial version will be deprecated in 2012. Inspired by a recent post by Tony Hirst, entitled A Quick Peek at Three Content Analysis Services, this seemed like a perfect opportunity to test out various algorithms/APIs for semantically annotating text, in particular tweets. For my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke), we have developed a gold-standard test collection for exactly this, i.e., automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet.

What I wanted to do here is take our recently released test collection and compare several off-the-shelf annotation APIs. In the paper, we already compare various methods, including Tagme and DBpedia spotlight. There, we add to this a variant solely based on the anchor texts found in Wikipedia, called ‘CMNS’ in the paper. In this post, I also include the new Yahoo! service and a service called Wikimeta. I have excluded OpenCalais from this list, mainly because it doesn’t link to Wikipedia.

Highlights of the experimental setup:

  • Approximately 500 tweets, with a maximum of 50 retrieved concepts, i.e., Wikipedia articles, per tweet.
  • The tweet is tokenized, i.e., punctuation and capitalization is removed. Twitter-specific “terms” such as mentions and URLs, are also removed. For hashtags, I remove the ‘#’ character but leave the term itself. Stopwords are removed. (More on this later.)

First, some general observations with respect to each API.

  • DBpedia Spotlight feels sluggish and actually takes the longest to annotate all tweets (approx. 30 minutes).
  • Tagme is blazingly fast, processing all tweets in under 60 seconds.
  • Yahoo! is also fast, but not very robust. It gives intermittent HTTP 500 responses to web service calls.
  • Wikimeta, well… First of all, the returned XML is not valid, containing unescaped ‘&’ characters. After having manually fixed the output, it started nicely, but the web service seems to have crashed after processing 50 tweets. Update: things are back up and it finished within a few minutes.
  • Finally, our method is also quite fast; it finished processing all tweets in under 90 seconds. Obviously we have a local installation of this, so there is little networking overhead.

Now, onto the results. Below, I report on a number of metrics, including average R-precision, i.e., precision at R, where R denotes the number of relevant concepts per tweet, reciprocal rank, i.e., the reciprocal of the rank of the first relevant concept, recall, and MAP (mean average precision)

Comparison results

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26230.43010.39040.2865

From this table it is clear that Tagme obtains high precision, with our method a close second. Reciprocal rank is high for both methods—a value of 0.6289 indicates the average rank of the first relevant concept lies around 1.6. Our method obtains highest recall–retrieving over 80% of all relevant concepts–and MAP, this time with Tagme as close second.

When running these experiments, it turned out that some methods use capitalization, punctuation, and related information to determine candidate concept links and targets; in particular Wikimeta and Yahoo! seem to be affected by this. So, in the next table you’ll find the same results, only this time without any tokenization performed (and also without any stopwords removed). Indeed, Wikimeta improves considerably and also Yahoo! improves somewhat. There seems to be a little gain for DBpedia Spotlight in this case.

Comparison results - untokenized

R-PrecRecip. RankRecallMAP
DBpedia Spotlight0.26500.42980.42730.2950

To round up, some concluding remarks. Tweets are inherently different from “ordinary” text, and this evaluation has shown that the methods that perform best on short texts (for instance, the Tagme system) also perform best on tweets, when there is little data available for disambiguation. Wikimeta parses the input text and is thus helped by providing it with full-text (for as far as that goes with Twitter).

Finally, I discovered something interesting with respect to our test collection, namely that some of the contents already seem to be outdated. One of the tweets refers to “Pia Toscano,” but she wasn’t in the annotators’ version of Wikipedia yet. As such, some systems retrieve her correctly, although the annotations deem her not relevant. “Dynamic semantics.” Sounds like a nice title for my next paper.


Research on Twitter

Dataset for “Adding Semantics to Microblog Posts”

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.


Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.


We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.