Data Stream Processing: An Extended Publish Subscribe System

Overview

In many different situations, such as stock tickers, e-mail correspondence, and network analysis, data arrives in a stream. In these situations one often wants to pose queries about this data stream, asking, for example, if a stock price was seen to increase by 10% and then decrease by at least the same amount within 30 minutes. In general, the queries can be divided into 2 classes, stateless and stateful. Stateless queries can be answered just by looking at a single event. For example, IBM > $10 or MICROSOFT < IBM. Stateful queries, on the other hand, require previous information to be kept around, like the query mentioned in the first paragraph. There are a variety of stateless publish subscribe systems whose performance is quite impressive, such as Le Subscribe. Yet there is a lack of well defined systems able to answer stateful queries.

Goal
The purpose of this project is to design an extended publish subscribe system that is able to answer some subset of stateful queries. One of the major flaws with most existing stateful stream processing systems is that their semantics are not well defined. Many have quite powerful query langauges, but often it is unclear as to exactly what a given query should mean in the context of the stream. As such, this project began developing a formal algebra to describe the query language. With a complete formal backing, the work then shifted into implementing the algebra in an efficient real-time system.

Design
The extended publish subscribe system is broken into three main components. There is the classic publish subscribe system, the state manager, and the io wrapper. The classic system is used in its entirety to answer stateless queries. The architecture of the system allows for any publish subcribe system to be plugged in for this component. The state manager is what answers the stateful queries. It builds a finite state automaton to answer each query, using the publish subscribe system when appropriate. The io wrapper allows the system to hook up to an arbitrary stream when a schema has been provided, and allows the user to issue queries about this stream.

Demo
The demo will show how the extended publish subcribe system is able to process a large number of stateful queries in real time.

Involved Undergraduates