Dataset for “Adding Semantics to Microblog Posts”

Research on Twitter

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/. The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.

Tweets

Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.

Annotations

We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.

 

 

 

15 Comments

    edelawit

    Hi Edgar Meij,

    I was doing a project on twitter and was hoping to use your dataset but wasn’t able to download them using the twitter tools that were provided, could you please help me out? Thanks.

      Edgar Meij

      Yes, sure. Please send me an e-mail and we’ll take it from there.

      Edgar

        edelawit

        ok, thank you!
        minilikmem@gmail.com

        p.s. Just to make it a bit clearer, i had trouble downloading the tweets(status block) those ids in your dataset are referring to.

    Bernhard

    Hi Edgar,
    could you please elaborate on how you used the twitter-tools to fetch the actual tweets? E.g. which class did you use and how did you adapt it?
    Thanks.
    Bernhard

      Edgar Meij

      Hi Bernhard,

      It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.

      Edgar

    Anonymous

    Hi Edgar Meij,

    I was doing a project and was hoping to use your dataset but wasn’t able to download them that were provided, could you please help me out? Thanks.

      Edgar Meij

      Hi,

      It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.

      Edgar

    Duy Van Khanh

    Hi Edgar Meij,

    I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks. My email: duychipmunk@gmail.com

    Xin Chen

    Dear Edgar,

    I’m currently working on health tweets classifier which may need your data sets testing semantic based approach.

    Would you kindly shared this data sets with me (xchen97@emory.edu)? We would sincerely acknowledge your help in our research outcome (paper, presentation etc..).

    SeoHyun Kim

    Hi Edgar Meij,

    I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks.
    My email: tjgus3253@gmail.com

    Anonymous

    Hi Edgar Meij,
    I want use the ‘wsdm2012_annotations.txt’.but i didn’t find the links to download it.can you email it for me?
    thank you so much

      Edgar Meij

      Sure. Please send me an email and I’ll send it to you.

    Talaat

    Hello Dr. Meij,
    I’d be grateful if you can provide me with the dataset via the following mail:
    t.maher@nu.edu.eg
    Thanks!
    Talaat

      Edgar Meij

      Sure. Email sent.

    sleephing

    Hi Edgar Meij,
    I’m working on a project about entity linking.
    I’d appreciate it if you could send me the data sets.
    sleephing@gmail.com
    Thank u so much!

Leave a Reply

Your email address will not be published.


Time limit is exhausted. Please reload CAPTCHA.