Time-Aware Chi-squared for Document Filtering over Time

To appear at TAIA2013 (a SIGIR 2013 workshop).

Document filtering over time is widely applied in various tasks such as tracking topics in online news or social media. We consider it a classification task, where topics of interest correspond to classes, and the feature space consists of the words associated to each class. In “streaming” settings the set of words associated with a concept may change. In this paper we employ a multinomial Naive Bayes classifier and perform periodic feature selection to adapt to evolving topics. We propose two ways of employing Pearson’s χ2 test for feature selection and demonstrate its benefit on the TREC KBA 2012 data set. By incorporating a time-dependent function in our equations for χ2 we provide an elegant method for applying different weighting schemes. Experiments show improvements of our approach over a non-adaptive baseline.

Semantic TED

Multilingual Semantic Linking for Video Streams: Making “Ideas Worth Sharing” More Accessible

Semantic TEDThis paper describes our (winning!) submission to the Developers Challenge at WoLE2013, “Doing Good by Linking Entities.” We present a fully automatic system – called “Semantic TED” – which provides intelligent suggestions in the form of links to Wikipedia articles for video streams in multiple languages, based on the subtitles that accompany the visual content. The system is applied to online conference talks. In particular, we adapt a recently proposed semantic linking approach for streams of television broadcasts to facilitate generating contextual links while a TED talk is being viewed. TED is a highly popular global conference series covering many research domains; the publicly available talks have accumulated a total view count of over one billion at the time of writing. We exploit the multi-linguality of Wikipedia and the TED subtitles to provide contextual suggestions in the language of the user watching a video. In this way, a vast source of educational and intellectual content is disclosed to a broad audience that might otherwise experience difficulties interpreting it.

  • [PDF] D. Odijk, E. Meij, D. Graus, and T. Kenter, “Multilingual semantic linking for video streams: making "ideas worth sharing" more accessible,” in Proceedings of the 2nd international workshop on web of linked entities (wole 2013), 2013.
    [Bibtex]
    @inproceedings{WOLE:2013:Odijk,
    Author = {Odijk, Daan and Meij, Edgar and Graus, David and Kenter, Tom},
    Booktitle = {Proceedings of the 2nd International Workshop on Web of Linked Entities (WoLE 2013)},
    Date-Added = {2013-05-15 14:09:58 +0000},
    Date-Modified = {2013-05-15 14:11:37 +0000},
    Title = {Multilingual Semantic Linking for Video Streams: Making "Ideas Worth Sharing" More Accessible},
    Year = {2013}}
Time series

OpenGeist: Insight in the Stream of Page Views on Wikipedia

We present a RESTful interface that captures insights into the zeitgeist of Wikipedia users. In recent years many so-called zeitgeist applications have been launched. Such applications are used to gain insights into the current gist of society and actual affairs. Several news sources run zeitgeist applications for popular and trending news. In addition, there are zeitgeist applications that report on trending publications such as LibraryThing, and trending topics, such as Google Zeitgeist. There is an interesting open data source from which a stream of people’s changing interests can be observed across a very broad spectrum of areas: the Wikimedia access logs. These logs contain the number of requests made to any Wikimedia domain, sorted by subdomain, and aggregated on an hourly basis. Since they are a log of the actual requests, they are noisy and can also contain non-existing pages. They are also quite large, yielding 60 GB worth of compressed textual data per month. Currently, we update the data on a daily basis and filter the raw source data by matching the URLs of all English Wikipedia articles and their redirects.

In this paper we describe an API that facilitates easy access to the access logs. We have identified the following requirements our system should have:

  • The user must have access to the raw time series data for a concept.
  • The user must be able to find the N most temporally similar concepts.
  • The user must be able to group concepts and their data, based either on the categorial system of Wikipedia or on similarity between concepts.
  • The system must return either a textual or a visual representation.
  • The user should be able to apply time series filters to extract trends and (recurring) events.

The API is an interface for clustering and comparing concepts based on the time series of the number of views of their Wikipedia page.

See http://www.opengeist.org for more info and examples.

  • [PDF] M-H. Peetz, E. Meij, and M. de Rijke, “OpenGeist: insight in the stream of page views on Wikipedia,” in Sigir 2012 workshop on time-aware information access, 2012.
    [Bibtex]
    @inproceedings{SIGIR-WS:2012:Peetz,
    Author = {Peetz, M-H. and Meij, E. and de Rijke, M.},
    Booktitle = {SIGIR 2012 Workshop on Time-aware Information Access},
    Date-Added = {2012-10-28 16:35:47 +0000},
    Date-Modified = {2012-10-31 10:48:46 +0000},
    Title = {{OpenGeist}: Insight in the Stream of Page Views on {Wikipedia}},
    Year = {2012}}

 

Twitter aspects

A Corpus for Entity Profiling in Microblog Posts

Microblogs have become an invaluable source of information for the purpose of online reputation management. An emerging problem in the field of online reputation management consists of identifying the key aspects of an entity commented in microblog posts. Streams of microblogs are of great value because of their direct and real-time nature and synthesizing them in form of entity profiles facilitates reputation managers to keep a track of the public image of the entity. Determining such aspects can be non-trivial because of creative language usage, the highly contextualized and informal nature of microblog posts, and the limited length of this form of communication.

In this paper we present two manually annotated corpora to evaluate the task of identifying aspects on Twitter, both of them based upon the WePS-3 ORM task dataset and made available online. The first is created using a pooling methodology, for which we have implemented various methods for automatically extracting aspects from tweets that are relevant for an entity. Human assessors have labeled each of the candidates as being relevant. The second corpus is more fine-grained and contains opinion targets. Here, annotators consider individual tweets related to an entity and manually identify whether the tweet is opinionated and, if so, which part of the tweet is subjective and what the target of the sentiment is, if any.

You can find more information on this test collection at http://nlp.uned.es/~damiano/datasets/entityProfiling_ORM_Twitter.html.

  • [PDF] D. Spina, E. Meij, A. Oghina, B. M. Thuong, M. Breuss, and M. de Rijke, “A corpus for entity profiling in microblog posts,” in Lrec 2012 workshop on language engineering for online reputation management, 2012.
    [Bibtex]
    @inproceedings{LEROM:2012:spina,
    Author = {Damiano Spina and Edgar Meij and Andrei Oghina and Bui Minh Thuong and Mathias Breuss and Maarten de Rijke},
    Booktitle = {LREC 2012 Workshop on Language Engineering for Online Reputation Management},
    Date-Added = {2012-03-29 12:18:51 +0200},
    Date-Modified = {2012-03-29 12:20:09 +0200},
    Title = {A Corpus for Entity Profiling in Microblog Posts},
    Year = {2012}}
semantic network of drugs

Entity Search: Building Bridges between Two Worlds

We have come to depend on technological resources to create order and find meaning in the ever-growing amount of online data. One frequently recurring type of query in web search are queries containing named entities (persons, organizations, locations, etc.): we organize our environments around entities that are meaningful to us. Hence, to support humans in dealing with massive volumes of data, next generation search engines need to organize information in semantically meaningful ways, structured around entities. Furthermore, instead of merely finding documents that mention an entity, finding the entity itself is required.

The problem of entity search has been and is being looked at by both the Information Retrieval (IR) and Semantic Web (SW) communities and is, in fact, ranked high on the research agendas of the two communities. The entity search task comes in several flavors. One is known as entity ranking (given a query and target category, return a ranked list of relevant entities), another is list completion (given a query and example entities, return similar entities), and a third is related entity finding (given a source entity, a relation and a target type, identify target entities that enjoy the specified relation with the source entity and that satisfy the target type constraint).

State-of-the-art IR models allow us to address entity search by identifying relevant entities in large volumes of web data. These methods often approach entity-oriented retrieval tasks by establishing associations between topics, documents, and entities or amongst entities themselves, where such associations are modeled by observing the language usage around entities. A major challenge with current IR approaches to entity retrieval is that they fail to produce interpretable descriptions of the found entities or of the relationships between entities. The generated models tend to lack human-interpretable semantics and are rarely meaningful for human consumption: interpretable labels are needed (both for entities and for relations). Linked Open Data (LOD) is a recent contribution of the emerging semantic web that has the potential of providing the required semantic information.

From a SW point of view, entity retrieval should be as simple as running SPARQL queries over structured data. However, since a true semantic web still has not been fully realized, the results of such queries are currently not sufficient to answer common information needs. By now, the LOD cloud contains millions of concepts from over one hundred structured data sets. This abundance, however, also introduces novel issues such as “cheap semantics” (e.g. wikilink relations in DBpedia) and the need for ranking potentially very large amounts of results. Furthermore, given the fact that most web users are not proficient users of semantic web languages such as SPARQL or standards such as RDF and OWL, the free-form text input used by most IR systems is more appealing to end users.

These concurrent developments give rise to the following general question: to which extent are state-of-art IR and SW technologies capable of answering information needs related to entity finding? In this paper we focus on the task of related entity finding (REF). E.g., for a source entity (“Michael Schumacher”), a relation (“Michael’s teammates while he was racing in Formula 1”) and a target type (“people”), a REF system should return entities such as “Eddie Irvine” and “Felipe Massa.” REF aims at making arbitrary relations between entities searchable. We focus on an adaptation of the official task as it was run at TREC 2009 and restrict the target entities to those having a primary Wikipedia article: this modification provides an elegant way of making the IR and SW results comparable.

From an IR perspective, a natural way of capturing the relation between a source and target entity is based on their co-occurrence in suitable contexts. Later, we use an aggregate of methods all of which are based on this approach. In contrast, a SW perspective on the same task is to search for entities through links such as the ones in LOD and for this we apply both standard SPARQL queries and an exhaustive graph search algorithm.

In this paper, we analyze and discuss to which extent REF can be solved by IR and SW methods. It is important to note that our goal is not to perform a quantitative comparison, and make claims about one approach being better than the other or vice versa. Rather, we investigate results returned by either approach and perform a more qualitative evaluation. We find that IR and SW methods discover different sets of entities, although these sets are overlapping. Based on the results of our evaluation, we demonstrate that the two approaches are complementary in nature and we discuss how each field could potentially benefit from the other. We arrive at and motivate a proposal to combine text-based entity models with semantic information from the Linking Open Data cloud.

  • [PDF] K. Balog, E. Meij, and M. de Rijke, “Entity search: building bridges between two worlds,” in Proceedings of the 3rd international semantic search workshop, 2010.
    [Bibtex]
    @inproceedings{semsearch:2010:balog,
    Author = {Balog, Krisztian and Meij, Edgar and de Rijke, Maarten},
    Booktitle = {Proceedings of the 3rd International Semantic Search Workshop},
    Date-Added = {2011-10-20 10:07:31 +0200},
    Date-Modified = {2012-10-30 08:41:54 +0000},
    Series = {SEMSEARCH 2010},
    Title = {Entity search: building bridges between two worlds},
    Year = {2010},
    Bdsk-Url-1 = {http://doi.acm.org/10.1145/1863879.1863888}}
crowd

Investigating the Demand Side of Semantic Search through Query Log Analysis

Semantic search is by its broadest definition a collection of approaches that aim at matching the Web’s content with the information need of Web users at a semantic level. Most of the work in this area has focused on the supply-side of semantic search, in particular elevating Web content to the semantic level by relying on methods of information extraction or working with explicit metadata embedded inside or linked to Web resources. With respect to explicit metadata, several studies have been done on the adoption of semantic web formats in the wild, mostly based on statistics from the crawls of semantic web search engines. Much less effort has focused on the demand-side of semantic search, i.e. interpreting queries at the semantic level and studying information needs at this level. Conversely, little is known as to how much the supply of metadata actually matches the demand for information on the Web.

In this paper, we address the problem of studying the information need of Web searchers at an ontological level, i.e., in terms of the particular attributes of objects they are interested in. We describe a set of methods for extracting the context words to certain classes of objects from a Web search query log. We do so based on the idea that common context words reflects aspects of objects users are interested in. We implement these methods in an interactive tool called the Semantic Search Assist. The original purpose of this tool was to generate type-based query suggestions when there is not enough statistical evidence for entity-based query suggestions. However, from an ontology engineering perspective, this tool answers the question of what attributes a class of objects would have if the ontology for it was engineered purely based on the information needs of end users. As such it allows us to reflect on the gap between the properties defined in Semantic Web ontologies and the attributes of objects that people are searching for on the Web. We evaluate our tool by measuring it’s predictive power on the query log itself. We leave the study of the gap between particular information needs and Semantic Web data for future work.

  • [PDF] E. Meij, P. Mika, and H. Zaragoza, “Investigating the demand side of semantic search through query log analysis,” in Proceedings of the workshop on semantic search (semsearch 2009) at the 18th international world wide web conference (www 2009), 2009.
    [Bibtex]
    @inproceedings{semsearch:2009:meij,
    Author = {Edgar Meij and P. Mika and H. Zaragoza},
    Booktitle = {Proceedings of the Workshop on Semantic Search (SemSearch 2009) at the 18th International World Wide Web Conference (WWW 2009)},
    Date-Added = {2011-10-12 18:31:55 +0200},
    Date-Modified = {2012-10-30 08:43:47 +0000},
    Title = {Investigating the Demand Side of Semantic Search through Query Log Analysis},
    Year = {2009}}

Deploying Lucene on the Grid

We investigate if and how open source retrieval engines can be deployed in a grid environment. When comparing grids to conventional distributed IR, the lack of a-priori knowledge about available nodes is one of the most significant differences. On top of that, it is also unknown when a particular node has time and resources available and starts a submitted job. Therefore, conventional methods such as RMI are not directly usable and we propose a different approach, using middleware designed specifically for grids. We describe GridLucene, an extension of the open source engine Lucene with grid-specific classes, based on this middleware. We report on an initial comparison between GridLucene and Lucene, and find a minor penalty (in terms of execution time) for grid-based indexing and a more serious penalty for grid-based retrieval.

The used middleware can gather a set of physical resources to form a single logical resource with some abstract properties. The user-definable properties can be used during indexing and retrieval to let GridLucene know which files it needs to access. By using this kind of semantic information, grid nodes can “discover” which indices exist on the grid and which particular documents need to be indexed.

GridLucene is available for downloading under the same license as Lucene.

  • [PDF] E. Meij and M. de Rijke, “Deploying lucene on the grid,” in Proceedings sigir 2006 workshop on open source information retrieval (osir2006), 2006.
    [Bibtex]
    @inproceedings{OSIR:2005:meij,
    Author = {Meij, E. and de Rijke, M.},
    Booktitle = {Proceedings SIGIR 2006 workshop on Open Source Information Retrieval (OSIR2006)},
    Date-Added = {2011-10-12 23:08:51 +0200},
    Date-Modified = {2011-10-12 23:08:51 +0200},
    Title = {Deploying Lucene on the Grid},
    Year = {2006}}