Dataset for “Adding Semantics to Microblog Posts”

05/01/2012 Blog 15 Comments

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/. The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.

Tweets

Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.

Annotations

We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.

15 thoughts on “Dataset for “Adding Semantics to Microblog Posts””

edelawit

12/03/2013 at 17:28

Hi Edgar Meij,

I was doing a project on twitter and was hoping to use your dataset but wasn’t able to download them using the twitter tools that were provided, could you please help me out? Thanks.

- Edgar Meij
  
  13/03/2013 at 10:56
  
  Yes, sure. Please send me an e-mail and we’ll take it from there.
  
  Edgar
  
  - edelawit
    
    18/03/2013 at 12:22
    
    ok, thank you!
    minilikmem@gmail.com
    
    p.s. Just to make it a bit clearer, i had trouble downloading the tweets(status block) those ids in your dataset are referring to.
    
Bernhard

06/11/2013 at 09:33

Hi Edgar,
could you please elaborate on how you used the twitter-tools to fetch the actual tweets? E.g. which class did you use and how did you adapt it?
Thanks.
Bernhard

- Edgar Meij
  
  20/11/2013 at 12:26
  
  Hi Bernhard,
  
  It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.
  
  Edgar
  
Anonymous

27/11/2013 at 10:54

Hi Edgar Meij,

I was doing a project and was hoping to use your dataset but wasn’t able to download them that were provided, could you please help me out? Thanks.

- Edgar Meij
  
  28/11/2013 at 16:16
  
  Hi,
  
  It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.
  
  Edgar
  
Duy Van Khanh

19/01/2014 at 10:39

Hi Edgar Meij,

I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks. My email: duychipmunk@gmail.com

Xin Chen

29/04/2014 at 23:52

Dear Edgar,

I’m currently working on health tweets classifier which may need your data sets testing semantic based approach.

Would you kindly shared this data sets with me (xchen97@emory.edu)? We would sincerely acknowledge your help in our research outcome (paper, presentation etc..).

SeoHyun Kim

22/09/2015 at 10:19

Hi Edgar Meij,

I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks.
My email: tjgus3253@gmail.com

Anonymous

12/11/2015 at 15:33

Hi Edgar Meij,
I want use the ‘wsdm2012_annotations.txt’.but i didn’t find the links to download it.can you email it for me?
thank you so much

- Edgar Meij
  
  30/11/2015 at 15:23
  
  Sure. Please send me an email and I’ll send it to you.
  
Talaat

21/06/2016 at 17:38

Hello Dr. Meij,
I’d be grateful if you can provide me with the dataset via the following mail:
t.maher@nu.edu.eg
Thanks!
Talaat

- Edgar Meij
  
  28/06/2016 at 21:27
  
  Sure. Email sent.
  
sleephing

15/07/2016 at 04:06

Hi Edgar Meij,
I’m working on a project about entity linking.
I’d appreciate it if you could send me the data sets.
sleephing@gmail.com
Thank u so much!

Edgar Meij

Dataset for “Adding Semantics to Microblog Posts”

Tweets

Annotations

Preprint of my WSDM paper, 'Adding Semantics to Microblog Posts' available now

A comparison of five semantic linking algorithms on tweets

15 thoughts on “Dataset for “Adding Semantics to Microblog Posts””

Leave a Reply Cancel reply