WSDM 2014, a recap

WSDM is wrapping up today, with only the workshops left for tomorrow. All in all, it was an exciting WSDM with lots of interesting talks and discussions. And of course Times Square cannot be beaten as conference venue location. Some papers/talks that caught my (semantic search) eye at WSDM, in no particular order.

Until next year in Shanghai!

 

Linking queries to entities

I’m happy to announce we’re releasing a new test collection for entity linking for web queries (within user sessions) to Wikipedia. About half of the queries in this dataset are sampled from Yahoo search logs, the other half comes from the TREC Session track. Check out the L24 dataset on Yahoo Webscope, or drop me a line for more information. Below you’ll find an excerpt of the README text associated with it.

With this dataset you can train, test, and benchmark entity linking systems on the task of linking web search queries – within the context of a search session – to entities. Entities are a key enabling component for semantic search, as many information needs can be answered by returning a list of entities, their properties, and/or their relations. A first step in any such scenario is to determine which entities appear in a query – a process commonly referred to as named entity resolution, named entity disambiguation, or semantic linking.

This dataset allows researchers and other practitioners to evaluate their systems for linking web search engine queries to entities. The dataset contains manually identified links to entities in the form of Wikipedia articles and provides the means to train, test, and benchmark such systems using manually created, gold standard data. With releasing this dataset publicly, we aim to foster research into entity linking systems for web search queries. To this end, we also include sessions and queries from the TREC Session track (years 2010–2013). Moreover, since the linked entities are aligned with a specific part of each query (a “span”), this data can also be used to evaluate systems that identify spans in queries, i.e, that perform query segmentation for web search queries, in the context of search sessions.

The key properties of the dataset are as follows.

  • Queries are taken from Yahoo US Web Search and from the TREC Session track (2010-2013).
  • There are 2635 queries in 980 sessions, 7482 spans, and 5964 links to Wikipedia articles in this dataset.
  • The annotations include the part of the query (the “span”) that is linked to each Wikipedia article. This information can also be used for query segmentation experiments.
  • The annotators have identified the “main” entity/ies for each query, if available.
  • The annotators also labeled the queries, identifying whether they are non-English, navigational, quote-or-question, adult, or ambiguous and also if an out-of-Wikipedia entity is mentioned in the query, i.e., when an entity is mentioned in a query but no suitable Wikipedia article exists.
  • The file includes session information: each session consists of an anonymized id, initial query, as well as all the queries issued within the same session and their relative date/timestamp if available.
  • Sessions are demarcated using a 30 minute time-out.

Entity Linking and Retrieval for Semantic Search (WSDM 2014)

This morning, we presented the last edition of our tutorial series on Entity Linking and Retrieval, entitled “Entity Linking and Retrieval for Semantic Search” (with Krisztian Balog and Daan Odijk) at WSDM 2014! This final edition of the series builds upon our earlier tutorials at WWW 2013 and SIGIR 2013. The focus of this edition lies on the practical applications of Entity Linking and Retrieval, in particular for semantic search: more and more search engine users are expecting direct answers to their information needs (rather than just documents). Semantic search and its recent applications are enabling search engines to organize their wealth of information around entities. Entity linking and retrieval is at the basis of these developments, providing the building stones for organizing the web of entities.

This tutorial aims to cover all facets of semantic search from a unified point of view and connect real-world applications with results from scientific publications. We provide a comprehensive overview of entity linking and retrieval in the context of semantic search and thoroughly explore techniques for query understanding, entity-based retrieval and ranking on unstructured text, structured knowledge repositories, and a mixture of these. We point out the connections between published approaches and applications, and provide hands-on examples on real-world use cases and datasets.

As before, all our tutorial materials are available for free online, see http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/.

RepLab 2014

RepLab is a competitive evaluation exercise for Online Reputation Management systems. In 2012 and 2013, RepLab focused on the problem of monitoring the reputation of (company) entities on Twitter, and dealt with the tasks of entity linking (“Is the tweet about the entity?”), reputation polarity (“Does the tweet have positive or negative implications for the entity’s reputation?”), topic detection (“What is the issue relative to the entity that is discussed in the tweet?”), and topic ranking (“Is the topic an alert that deserves immediate attention?”).

RepLab 2014 will again focus on Reputation Management on Twitter and will be addressing two new tasks, see below. We will use tweets in two languages: English and Spanish.

  1. The classification of tweets with respect to standard reputation dimensions such as Performance, Leadership, Innovation, etc.
  2. The classification of Twitter profiles (authors) with respect to a certain domain, classifying them as journalists, professionals, etc. Second, this task focuses on finding the opinion makers.

The second task is a part of the shared PAN-RepLab author profiling task. Besides the characterization of profiles from a reputation analysis perspective, participants can also attempt to classify authors by gender and age, which is the focus of PAN 2014.

Important dates:

  • March 1 – Training data released
  • March 17 – Test data released
  • May 5 – System results due

See http://nlp.uned.es/replab2014/ for more info and how to participate.

We’re now hiring next year’s interns!

I’m happy to announce that we have just opened up our applications for next year’s internships at Yahoo Labs in Barcelona. So, if you’re a PhD student in a related field, do consider applying. Especially if you’re interested in spending some time in sunny Barcelona and gaining research experience along the way, you’re more than welcome.

The application form can be found at http://comunicacio.barcelonamedia.org/yahoo/, the deadline is January 13th, 2014.

Do reach out if you have any questions!

Entity Linking and Retrieval Tutorial @ SIGIR 2013 – Slides, Code, and Bibliography

The material for our “Entity Linking and Retrieval” tutorial (with Krisztian Balog and Daan Odijk) for SIGIR 2013 has been updated and is available online on GitHub (slides), Dropbox (slides), Mendeley, and CodeAcademy. All material is summarized at the webpage for the tutorial: http://ejmeij.github.io/entity-linking-and-retrieval-tutorial/. See my other blogpost for a brief summary.

Time-Aware Chi-squared for Document Filtering over Time

To appear at TAIA2013 (a SIGIR 2013 workshop).

Document filtering over time is widely applied in various tasks such as tracking topics in online news or social media. We consider it a classification task, where topics of interest correspond to classes, and the feature space consists of the words associated to each class. In “streaming” settings the set of words associated with a concept may change. In this paper we employ a multinomial Naive Bayes classifier and perform periodic feature selection to adapt to evolving topics. We propose two ways of employing Pearson’s χ2 test for feature selection and demonstrate its benefit on the TREC KBA 2012 data set. By incorporating a time-dependent function in our equations for χ2 we provide an elegant method for applying different weighting schemes. Experiments show improvements of our approach over a non-adaptive baseline.

Do support groups members disclose less to their partners? the dynamics of HIV disclosure in four African countries

To appear in BMC Public Health.

Background: Recent efforts to curtail the HIV epidemic in Africa have emphasized preventing sexual transmission to partners through antiretroviral therapy. A component of current strategies is disclosure to partners, thus understanding its motivations will help maximise results. This study examines the rates, dynamics and consequences of partner disclosure in Burkina Faso, Kenya, Malawi and Uganda, with special attention to the role of support groups and stigma in disclosure.

Methods: The study employs mixed methods, including a cross-sectional client survey of counseling and testing services, focus groups, and in-depth interviews with HIV-positive individuals in stable partnerships in Burkina Faso, Kenya, Malawi and Uganda, recruited at healthcare facilities offering HIV testing.

Results: Rates of disclosure to partners varied between countries (32.7% – 92.7%). The lowest rate was reported in Malawi. Reasons for disclosure included preventing the transmission of HIV, the need for care, and upholding the integrity of the relationship. Fear of stigma was an important reason for non-disclosure. Women reported experiencing more negative reactions when disclosing to partners. Disclosure was positively associated with living in urban areas, higher education levels, and being male, while being negatively associated with membership to support groups.

Conclusions: Understanding of reasons for disclosure and recognition of the role of support groups in the process can help improve current prevention efforts, that increasingly focus on treatment as prevention as a way to halt new infections. Support groups can help spread secondary prevention messages, by explaining to their members that antiretroviral treatment has benefits for HIV positive individuals and their partners. Home-based testing can further facilitate partner disclosure, as couples can test together and be counseled jointly.

Semantic TED

Multilingual Semantic Linking for Video Streams: Making “Ideas Worth Sharing” More Accessible

Semantic TEDThis paper describes our (winning!) submission to the Developers Challenge at WoLE2013, “Doing Good by Linking Entities.” We present a fully automatic system – called “Semantic TED” – which provides intelligent suggestions in the form of links to Wikipedia articles for video streams in multiple languages, based on the subtitles that accompany the visual content. The system is applied to online conference talks. In particular, we adapt a recently proposed semantic linking approach for streams of television broadcasts to facilitate generating contextual links while a TED talk is being viewed. TED is a highly popular global conference series covering many research domains; the publicly available talks have accumulated a total view count of over one billion at the time of writing. We exploit the multi-linguality of Wikipedia and the TED subtitles to provide contextual suggestions in the language of the user watching a video. In this way, a vast source of educational and intellectual content is disclosed to a broad audience that might otherwise experience difficulties interpreting it.

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making "ideas worth sharing" more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    [Bibtex]
    @inproceedings{WOLE:2013:Odijk,
    Author = {Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    Booktitle = {Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},
    Date-Added = {2013-05-15 14:09:58 +0000},
    Date-Modified = {2013-05-15 14:11:37 +0000},
    Title = {Multilingual Semantic Linking for Video Streams: Making "Ideas Worth Sharing" More Accessible},
    Year = {2013}}