Late last Decem­ber, Yahoo! released a new ver­sion of their Con­tent Analy­sis ser­vice and they announced that the ini­tial ver­sion will be dep­re­cated in 2012. Inspired by a recent post by Tony Hirst, enti­tled A Quick Peek at Three Con­tent Analy­sis Ser­vices, this seemed like a per­fect oppor­tu­nity to test out var­i­ous algorithms/APIs for seman­ti­cally anno­tat­ing text, in par­tic­u­lar tweets. For my WSDM paper, Adding Seman­tics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke), we have devel­oped a gold-standard test col­lec­tion for exactly this, i.e., auto­mat­i­cally iden­ti­fy­ing con­cepts (in the form of Wikipedia arti­cles) that are con­tained in or meant by a tweet.

What I wanted to do here is take our recently released test col­lec­tion and com­pare sev­eral off-the-shelf anno­ta­tion APIs. In the paper, we already com­pare var­i­ous meth­ods, includ­ing Tagme and DBpe­dia spot­light. There, we add to this a vari­ant solely based on the anchor texts found in Wikipedia, called ‘CMNS’ in the paper. In this post, I also include the new Yahoo! ser­vice and a ser­vice called Wikimeta. I have excluded Open­Calais from this list, mainly because it doesn’t link to Wikipedia.

High­lights of the exper­i­men­tal setup:

  • Approx­i­mately 500 tweets, with a max­i­mum of 50 retrieved con­cepts, i.e., Wikipedia arti­cles, per tweet.
  • The tweet is tok­enized, i.e., punc­tu­a­tion and cap­i­tal­iza­tion is removed. Twitter-specific “terms” such as men­tions and URLs, are also removed. For hash­tags, I remove the ‘#’ char­ac­ter but leave the term itself. Stopwords are removed. (More on this later.)

First, some gen­eral obser­va­tions with respect to each API.

  • DBpe­dia Spot­light feels slug­gish and actu­ally takes the longest to anno­tate all tweets (approx. 30 minutes).
  • Tagme is blaz­ingly fast, pro­cess­ing all tweets in under 60 seconds.
  • Yahoo! is also fast, but not very robust. It gives inter­mit­tent HTTP 500 responses to web ser­vice calls.
  • Wikimeta, well… First of all, the returned XML is not valid, con­tain­ing unescaped ‘&’ char­ac­ters. After hav­ing man­u­ally fixed the out­put, it started nicely, but the web ser­vice seems to have crashed after pro­cess­ing 50 tweets. Update: things are back up and it fin­ished within a few minutes.
  • Finally, our method is also quite fast; it fin­ished pro­cess­ing all tweets in under 90 sec­onds. Obvi­ously we have a local instal­la­tion of this, so there is lit­tle net­work­ing overhead.

Now, onto the results. Below, I report on a num­ber of met­rics, includ­ing aver­age R-precision, i.e., pre­ci­sion at R, where R denotes the num­ber of rel­e­vant con­cepts per tweet, rec­i­p­ro­cal rank, i.e., the rec­i­p­ro­cal of the rank of the first rel­e­vant con­cept, recall, and MAP (mean aver­age precision)

Com­par­i­son results

R-PrecRecip. RankRecallMAP
DBpe­dia Spotlight0.26230.43010.39040.2865
Tagme0.46210.62890.59730.4851
Yahoo!0.07850.14270.06900.0781
Wikimeta0.03190.05730.02830.0314
CMNS0.44270.62750.82390.5247

From this table it is clear that Tagme obtains high pre­ci­sion, with our method a close sec­ond. Rec­i­p­ro­cal rank is high for both methods—a value of 0.6289 indi­cates the aver­age rank of the first rel­e­vant con­cept lies around 1.6. Our method obtains high­est recall–retrieving over 80% of all rel­e­vant concepts–and MAP, this time with Tagme as close second.

When run­ning these exper­i­ments, it turned out that some meth­ods use cap­i­tal­iza­tion, punc­tu­a­tion, and related infor­ma­tion to deter­mine can­di­date con­cept links and tar­gets; in par­tic­u­lar Wikimeta and Yahoo! seem to be affected by this. So, in the next table you’ll find the same results, only this time with­out any tok­eniza­tion per­formed (and also with­out any stop­words removed). Indeed, Wikimeta improves con­sid­er­ably and also Yahoo! improves some­what. There seems to be a lit­tle gain for DBpe­dia Spot­light in this case.

Com­par­i­son results — untokenized

R-PrecRecip. RankRecallMAP
DBpe­dia Spotlight0.26500.42980.42730.2950
Tagme0.45530.61330.58130.4766
Yahoo!0.10940.18270.09850.1091
Wikimeta0.20600.33470.21670.2047
CMNS0.44270.62750.82390.5247

To round up, some con­clud­ing remarks. Tweets are inher­ently dif­fer­ent from “ordi­nary” text, and this eval­u­a­tion has shown that the meth­ods that per­form best on short texts (for instance, the Tagme sys­tem) also per­form best on tweets, when there is lit­tle data avail­able for dis­am­bigua­tion. Wikimeta parses the input text and is thus helped by pro­vid­ing it with full-text (for as far as that goes with Twitter).

Finally, I dis­cov­ered some­thing inter­est­ing with respect to our test col­lec­tion, namely that some of the con­tents already seem to be out­dated. One of the tweets refers to “Pia Toscano,” but she wasn’t in the anno­ta­tors’ ver­sion of Wikipedia yet. As such, some sys­tems retrieve her cor­rectly, although the anno­ta­tions deem her not rel­e­vant. “Dynamic seman­tics.” Sounds like a nice title for my next paper.