• Publications
    • Conference Papers
    • Workshop Papers
    • Journal Papers
    • Publicity
    • Books
    • Theses
    • Submitted
  • Professional Activities
  • Teaching
  • About
  • Contact

Edgar Meij

semantic search research ッ

  • Publications
    • Conference Papers
    • Workshop Papers
    • Journal Papers
    • Publicity
    • Books
    • Theses
    • Submitted
  • Professional Activities
  • Teaching
  • About
  • Contact
Research on Twitter

Dataset for “Adding Semantics to Microblog Posts”

05/01/2012 Blog 15 Comments

As promised, I’m releasing the dataset used for my WSDM paper, Adding Semantics to Microblog Posts (with Wouter Weerkamp and Maarten de Rijke). In the paper, we evaluate various methods for automatically identifying concepts (in the form of Wikipedia articles) that are contained in or meant by a tweet. This release will consist of a number of parts and be downloadable from http://ilps.science.uva.nl/resources/wsdm2012-adding-semantics-to-microblog-posts/. The first part, described below, contains the tweets that we used, as well as the manual annotations, i.e., links to Wikipedia articles. If there is sufficient interest, I will also release the extracted features that were used in the paper. Let me know in the comments or drop me a line.

If you make use of this dataset, do remember to cite our paper. The bibliographic details can be found at here. If you have any questions, don’t hesitate to ask me in the comments or by sending me an e-mail.

Tweets

Twitter’s Terms of Service do not allow me to redistribute the tweets directly, so I’m providing a file containing the tweet IDs, the username, and the MD5 checksum of each tweet. With the file ‘wsdm2012_tweets.dat’ you can use the official tools used in the TREC Microblog track to fetch them. Because of Twitter rate limits, I recommend using the JSON option in blocks of 150 tweets. If you are unsuccessful in downloading the tweets, drop me a line and I’ll try to help you out.

Note that for the experiments done in the paper, we have annotated 562 tweets. In the mean time, however, tweets were deleted and accounts were banned. As such, you’ll find that we are left with a slightly smaller number of tweets: 502 in particular.

Annotations

We have asked two volunteers to manually annotate the tweets. They were presented with an annotation interface with which they could search through Wikipedia articles using separate article fields such as title, content, incoming anchor texts, first sentence, and first paragraph. The annotation guidelines specified that the annotator should identify concepts contained in, meant by, or relevant to the tweet. They could also indicate that an entire tweet was either ambiguous (where multiple target concepts exist) or erroneous (when no relevant concept could be assigned). For the 502 tweets listed above, the statistics are slightly different than reported in the paper. The average length of a tweet in this set equals 37. Out of the 502 tweets, 375 were labeled as not being in either of the two erroneous categories. For these, the annotators identified 2.17 concepts per tweet on average.

In the file ‘wsdm2012_annotations.txt’ you will find a tab-separated list with annotations. Here, the first column contains the tweet ID, the second column the annotated Wikipedia article ID, and the third column the title of the Wikipedia article. For ambiguous tweets the Wikipedia article ID equals ‘-1’ and for unknown tweets the ID equals ‘-2’ (for both of these cases the Wikipedia article title equals ‘-‘).

The ‘wsdm2012_qrels.txt’ file is a so-called qrels file (in TREC parlance), that can be used with a tool such as trec_eval as a gold standard. This file is derived from the manual annotations by considering all annotated links between a tweet and Wikipedia articles as ‘relevant’ and the remainder as being non-relevant. Recall that in our paper, we approach the task of linking tweets to concepts as a ranking task; more relevant concepts should be ranked above less relevant concepts. As such, we can rank Wikipedia articles for a given tweet and use common Information Retrieval metrics, such as precision, MAP, R-precision, etc. to evaluate and compare different methods.

 

 

 

adding-semantics-to-micro-blog-postsdataset-adding-semantics-microblogDBpediadon-meij-wikiedgar-meij-proedgar-meij-twitterEntity linkingMicroblogsprodatasetSemantic linkingSemantic searchsemantic-microblog-linkingSemanticizingText miningtrec-2012trec-micro-bloggingtrec-microblogging-2012Twittertwitter-datasettwitter-tweets-dataset-jsonWikipedia

Preprint of my WSDM paper, 'Adding Semantics to Microblog Posts' available now

A comparison of five semantic linking algorithms on tweets

15 thoughts on “Dataset for “Adding Semantics to Microblog Posts””
  1. edelawit
    12/03/2013 at 17:28

    Hi Edgar Meij,

    I was doing a project on twitter and was hoping to use your dataset but wasn’t able to download them using the twitter tools that were provided, could you please help me out? Thanks.

    Reply
    • Edgar Meij
      13/03/2013 at 10:56

      Yes, sure. Please send me an e-mail and we’ll take it from there.

      Edgar

      Reply
      • edelawit
        18/03/2013 at 12:22

        ok, thank you!
        minilikmem@gmail.com

        p.s. Just to make it a bit clearer, i had trouble downloading the tweets(status block) those ids in your dataset are referring to.

        Reply
  2. Bernhard
    06/11/2013 at 09:33

    Hi Edgar,
    could you please elaborate on how you used the twitter-tools to fetch the actual tweets? E.g. which class did you use and how did you adapt it?
    Thanks.
    Bernhard

    Reply
    • Edgar Meij
      20/11/2013 at 12:26

      Hi Bernhard,

      It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.

      Edgar

      Reply
  3. Anonymous
    27/11/2013 at 10:54

    Hi Edgar Meij,

    I was doing a project and was hoping to use your dataset but wasn’t able to download them that were provided, could you please help me out? Thanks.

    Reply
    • Edgar Meij
      28/11/2013 at 16:16

      Hi,

      It’s been quite some time since I used these tools, and I have received reports that in some cases it doesn’t work anymore. If you’d like you can send me an e-mail and I can provide you with the dataset.

      Edgar

      Reply
  4. Duy Van Khanh
    19/01/2014 at 10:39

    Hi Edgar Meij,

    I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks. My email: duychipmunk@gmail.com

    Reply
  5. Xin Chen
    29/04/2014 at 23:52

    Dear Edgar,

    I’m currently working on health tweets classifier which may need your data sets testing semantic based approach.

    Would you kindly shared this data sets with me (xchen97@emory.edu)? We would sincerely acknowledge your help in our research outcome (paper, presentation etc..).

    Reply
  6. SeoHyun Kim
    22/09/2015 at 10:19

    Hi Edgar Meij,

    I was doing a project about NED for Tweet and was hoping to use your dataset but wasn’t able to download them that were provided, could you please send me your dataset? Thanks.
    My email: tjgus3253@gmail.com

    Reply
  7. Anonymous
    12/11/2015 at 15:33

    Hi Edgar Meij,
    I want use the ‘wsdm2012_annotations.txt’.but i didn’t find the links to download it.can you email it for me?
    thank you so much

    Reply
    • Edgar Meij
      30/11/2015 at 15:23

      Sure. Please send me an email and I’ll send it to you.

      Reply
  8. Talaat
    21/06/2016 at 17:38

    Hello Dr. Meij,
    I’d be grateful if you can provide me with the dataset via the following mail:
    t.maher@nu.edu.eg
    Thanks!
    Talaat

    Reply
    • Edgar Meij
      28/06/2016 at 21:27

      Sure. Email sent.

      Reply
  9. sleephing
    15/07/2016 at 04:06

    Hi Edgar Meij,
    I’m working on a project about entity linking.
    I’d appreciate it if you could send me the data sets.
    sleephing@gmail.com
    Thank u so much!

    Reply
Leave a Reply Cancel reply

Time limit is exhausted. Please reload CAPTCHA.

Edgar Meij logo
Welcome!

This is the website of Edgar Meij. I lead several groups of researchers and engineers at Bloomberg working on knowledge graphs, question answering, information retrieval, machine learning, and more…

Search
Tweets by @edgarmeij
Tags
AIDA Artificial Intelligence CLEF DBpedia edgar-meij entity-linking-and-retrieval entity-linking-and-retrieval-tutorial entity-linking-tutorial Entity finding Entity linking Information retrieval Knowledge base population Knowledge Graph Language modeling Linking Open Data LOD logo-penerbit-buku-internasional Machine learning meij MeSH Microblogs penerbit-buku-internasional personalized-time-aware-tweets-summarization Query log analysis Query modeling Relevance modeling Semanticizing Semantic linking Semantic query analysis Semantic search Teaching Text mining Topic modeling TREC Blog TREC Genomics TREC KBA TREC Microblog TREC Relevance Feedback TREC Sessions Tutorial Twitter Web services Wikipedia Workflows Workshop
Proudly powered by WordPress | Theme: Doo by ThemeVS.