Time series

OpenGeist: Insight in the Stream of Page Views on Wikipedia

We present a RESTful interface that captures insights into the zeitgeist of Wikipedia users. In recent years many so-called zeitgeist applications have been launched. Such applications are used to gain insights into the current gist of society and actual affairs. Several news sources run zeitgeist applications for popular and trending news. In addition, there are zeitgeist applications that report on trending publications such as LibraryThing, and trending topics, such as Google Zeitgeist. There is an interesting open data source from which a stream of people’s changing interests can be observed across a very broad spectrum of areas: the Wikimedia access logs. These logs contain the number of requests made to any Wikimedia domain, sorted by subdomain, and aggregated on an hourly basis. Since they are a log of the actual requests, they are noisy and can also contain non-existing pages. They are also quite large, yielding 60 GB worth of compressed textual data per month. Currently, we update the data on a daily basis and filter the raw source data by matching the URLs of all English Wikipedia articles and their redirects.

In this paper we describe an API that facilitates easy access to the access logs. We have identified the following requirements our system should have:

  • The user must have access to the raw time series data for a concept.
  • The user must be able to find the N most temporally similar concepts.
  • The user must be able to group concepts and their data, based either on the categorial system of Wikipedia or on similarity between concepts.
  • The system must return either a textual or a visual representation.
  • The user should be able to apply time series filters to extract trends and (recurring) events.

The API is an interface for clustering and comparing concepts based on the time series of the number of views of their Wikipedia page.

See http://www.opengeist.org for more info and examples.

  • [PDF] M-H. Peetz, E. Meij, and M. de Rijke, “OpenGeist: insight in the stream of page views on Wikipedia,” in Sigir 2012 workshop on time-aware information access, 2012.
    [Bibtex]
    @inproceedings{SIGIR-WS:2012:Peetz,
    Author = {Peetz, M-H. and Meij, E. and de Rijke, M.},
    Booktitle = {SIGIR 2012 Workshop on Time-aware Information Access},
    Date-Added = {2012-10-28 16:35:47 +0000},
    Date-Modified = {2012-10-31 10:48:46 +0000},
    Title = {{OpenGeist}: Insight in the Stream of Page Views on {Wikipedia}},
    Year = {2012}}

 

escience graph

Enabling Data Transport between Web Services

Despite numerous benefits, many Web Services (WS) face problems with respect to data transport, either because SOAP doesn’t offer a scalable way of transporting large data-sets or because orchestration workflows (WF) don’t move data around efficiently. In this paper we address both problems with the development of the ProxyWS. This is a WS utilizing protocols offered by the Virtual Resource System (VRS), to enable other WS to transfer and access large datasets without modifying WS nor the underlying environment.

There is currently an abundance of deployed (legacy) WS using SOAP, which fail to produce access and return large datasets. Moreover, orchestration WF causes WS to pass messages containing data back through the WF engine. To address these problems we introduce the ProxyWS: a WS that is able to access data from remote resources (GridFTP, LFC, etc.), thanks to the VRS, and also transport larger data produced by WS, both legacy and new. For the ProxyWS to be able to provide larger data transfers to legacy WS, it has to be deployed on the same Axis-based container, just like a normal WS. This enables clients to make proxy calls to the ProxyWS instead of a legacy WS. As a consequence the ProxyWS returns a SOAP message containing a URI referring to the data location. For new implementations the ProxyWS is used as an API that can create data streams from remote data resources and other WS using the ProxyWS. This approach proved to be the most scalable since WS can process data as they are generated from producing WS. Thus with the introduction of the ProxyWS we are able to provide a separate channel for data transfers, that allows for more scalable SOA-based applications.

Many different approaches have been introduced in an attempt to address the problems mentioned earlier. Examples of these include Styx Grid Services, Data Proxy Web services for Taverna and Flex-SwA. Some noteworthy features of these approaches are: Direct streaming between WS, Usage of alternative protocols for data transports, and larger data delivery to legacy WS. However, each of these examples only addresses one part of the problem and, furthermore, do not include any means of allowing access to remote data resources. Leveraging these existing proposals and combining them with the VRS we implemented a ProxyWS. To validate it, we have tested its performance using 2 data-intensive WF. The first is a distributed indexing application that uses a set of WS to speedup the indexing of a large set of documents, while the second relies on the creation of that index for retrieving and recognizing protein names contained in results coming from a query. With the use of the ProxyWS we are able to retrieve data from remote locations (8.4 GB of documents for indexing), as well as to obtain more results relative to a query (8300 documents using the ProxyWS versus 1100 using SOAP).

We have presented the ProxyWS, which may be used to support large data transfers for legacy and new WS. We have verified its performance to deliver large datasets on two real-life tasks: Indexing using WS in a distributed environment and annotating documents from an index. From our experiments we have found that ProxyWS is able to facilitate data transports where normal SOAP messages would have failed. We have also demonstrated that with the use of the ProxyWS legacy WS can scale further, by avoiding data delivery via SOAP and by delivering data directly from the producing to the consuming WS.

  • [PDF] S. Koulouzis, E. Meij, and A. Belloum, “Enabling large data transfers between web services,” in 5th egee user forum, 2010.
    [Bibtex]
    @inproceedings{EGEE:2010:koulouzis,
    Author = {Koulouzis, S. and Meij, E. and Belloum, A.},
    Booktitle = {5th EGEE User Forum},
    Date-Added = {2011-10-20 10:00:08 +0200},
    Date-Modified = {2011-10-20 10:00:08 +0200},
    Title = {Enabling Large Data Transfers Between Web Services},
    Year = {2010}}
escience graph

Enabling Data Transport between Web Services through alternative protocols and Streaming

As web services gain acceptance in the e-Science community, some of their shortcomings have begun to appear. A significant challenge is to find reliable and efficient methods to transfer large data between web services. This paper describes the problem of scalable data transport between web services, and proposes a solution: the development of a modular Server/Client library that uses SOAP as a control channel while the actual data transport is accomplished by various protocol implementation, as well as a simple API that developers can use for data-intensive applications. Apart from file transport, the proposed approach offers the facility of direct data streaming between web services, an approach that could benefit workflow execution time by creating a data pipeline between web services. Finally, the performance and usability of this library is evaluated, under the indexing application that the Adaptive Information Disclosure Application (AIDA) Toolkit offers as a Web Service.

  • [PDF] S. Koulouzis, E. Meij, M. S. Marshall, and A. Belloum, “Enabling data transport between web services through alternative protocols and streaming,” in 4th ieee international conference on e-science, 2008.
    [Bibtex]
    @inproceedings{IEEE:2008:koulouzis,
    Author = {Koulouzis, S. and Meij, E. and Marshall, M.S. and Belloum, A.},
    Booktitle = {4th IEEE International Conference on e-Science},
    Date-Added = {2011-10-16 10:35:31 +0200},
    Date-Modified = {2011-10-16 10:35:31 +0200},
    Title = {Enabling Data Transport between Web Services through alternative protocols and Streaming},
    Year = {2008}}