Filtering relevant documents with respect to entities is an essential task in the context of knowledge base construction and maintenance. It entails processing a time-ordered stream of documents that might be relevant to an entity in order to select only those that contain vital information. State-of-the-art approaches to document filtering for popular entities are entity-dependent: they rely on and are also trained on the specifics of differentiating features for each specific entity. Moreover, these approaches tend to use so-called extrinsic information such as Wikipedia page views and related entities which is typically only available only for popular head entities. Entity-dependent approaches based on such signals are therefore ill-suited as filtering methods for long-tail entities. Continue reading “Document Filtering for Long-tail Entities” »
Time-Aware Chi-squared for Document Filtering over Time
To appear at TAIA2013 (a SIGIR 2013 workshop).
Document filtering over time is widely applied in various tasks such as tracking topics in online news or social media. We consider it a classification task, where topics of interest correspond to classes, and the feature space consists of the words associated to each class. In “streaming” settings the set of words associated with a concept may change. In this paper we employ a multinomial Naive Bayes classifier and perform periodic feature selection to adapt to evolving topics. We propose two ways of employing Pearson’s χ2 test for feature selection and demonstrate its benefit on the TREC KBA 2012 data set. By incorporating a time-dependent function in our equations for χ2 we provide an elegant method for applying different weighting schemes. Experiments show improvements of our approach over a non-adaptive baseline.
TREC 2012 summary
In the 21st Text REtrieval Conference (TREC 2012), seven tracks ran: KBA, Contextual suggestion, Session, Web, Medical, Crowdsourcing, and Microblog. Of these, Microblog attracted the largest number of participating groups (40) closely followed by Medical (24). Continue reading “TREC 2012 summary” »
The University of Amsterdam at TREC 2012
This year the Information and Language Processing Systems (ILPS) group of the University of Amsterdam participated in the Microblog and the Knowledge Base Acceleration (KBA) tracks. Continue reading “The University of Amsterdam at TREC 2012” »
Hadoop code for TREC KBA
I’ve decided to put some of the Hadoop code I developed for the TREC KBA task online. It’s available on Github: https://github.com/