As promised, I’m releas­ing the dataset used for my WSDM paper, Adding Seman­tics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we eval­u­ate var­i­ous meth­ods for auto­mat­i­cally iden­ti­fy­ing con­cepts (in the form of Wikipedia arti­cles) that are con­tained in or meant by a tweet. This release will con­sist of a num­ber of parts and be down­load­able from http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/. The first part, described below, con­tains the tweets that we used, as well as the man­ual anno­ta­tions, i.e., links to Wikipedia arti­cles. If there is suf­fi­cient inter­est, I will also release the extracted fea­tures that were used in the paper. Let me know in the com­ments or drop me a line.

If you make use of this dataset, do remem­ber to cite our paper. The bib­li­o­graphic details can be found at here. If you have any ques­tions, don’t hes­i­tate to ask me in the com­ments or by send­ing me an e-mail.

Tweets

Twitter’s Terms of Ser­vice do not allow me to redis­trib­ute the tweets directly, so I’m pro­vid­ing a file con­tain­ing the tweet IDs, the user­name, and the MD5 check­sum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the offi­cial tools used in the TREC Microblog track to fetch them. Because of Twit­ter rate lim­its, I rec­om­mend using the JSON option in blocks of 150 tweets. If you are unsuc­cess­ful in down­load­ing the tweets, drop me a line and I’ll try to help you out.

Note that for the exper­i­ments done in the paper, we have anno­tated 562 tweets. In the mean time, how­ever, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller num­ber of tweets: 502 in particular.

Anno­ta­tions

We have asked two vol­un­teers to man­u­ally anno­tate the tweets. They were pre­sented with an anno­ta­tion inter­face with which they could search through Wikipedia arti­cles using sep­a­rate arti­cle fields such as title, con­tent, incom­ing anchor texts, first sen­tence, and first para­graph. The anno­ta­tion guide­lines spec­i­fied that the anno­ta­tor should iden­tify con­cepts con­tained in, meant by, or rel­e­vant to the tweet. They could also indi­cate that an entire tweet was either ambigu­ous (where mul­ti­ple tar­get con­cepts exist) or erro­neous (when no rel­e­vant con­cept could be assigned). For the 502 tweets listed above, the sta­tis­tics are slightly dif­fer­ent than reported in the paper. The aver­age length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erro­neous cat­e­gories. For these, the anno­ta­tors iden­ti­fied 2.17 con­cepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with anno­ta­tions. Here, the first col­umn con­tains the tweet ID, the sec­ond col­umn the anno­tated Wikipedia arti­cle ID, and the third col­umn the title of the Wikipedia arti­cle. For ambigu­ous tweets the Wikipedia arti­cle ID equals ‘-1′ and for unknown tweets the ID equals ‘-2′ (for both of these cases the Wikipedia arti­cle title equals ‘-’).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC par­lance), that can be used with a tool such as trec_eval as a gold stan­dard. This file is derived from the man­ual anno­ta­tions by con­sid­er­ing all anno­tated links between a tweet and Wikipedia arti­cles as ‘rel­e­vant’ and the remain­der as being non-relevant. Recall that in our paper, we approach the task of link­ing tweets to con­cepts as a rank­ing task; more rel­e­vant con­cepts should be ranked above less rel­e­vant con­cepts. As such, we can rank Wikipedia arti­cles for a given tweet and use com­mon Infor­ma­tion Retrieval met­rics, such as pre­ci­sion, MAP, R-precision, etc. to eval­u­ate and com­pare dif­fer­ent methods.