Publish-Subscribe System

Muthiah M Muthaia Chettiar
Manpreet Singh   Mingsheng Hong   Prof. Johannes Gehrke   Prof. Jayavel Shanmugasundaram  


Introduction

Publish-Subscribe system, also known as pubsub in short, is a type of database system that serves a group of subscribers interested in various events with notifications as the events occur. Subscribers who are interested in a set of attributes register their interests in the system. The list of these subscriptions is then indexed. Once a particular event occurs, it searches through its database of subscribers and finds all the subscribers who are interested in this event and notifies them that an event of their interest has occured. A simple example is a system that has a group of subscribers who are interested in the stock prices of various companies. An event in this case would be that the share price of a company, say Intel, has gone above $50. The system would then use the index it has built on the subscribers to identify all and only those subscribers who are interested in this event and output their id's. We can then use these id's as necessary to notify the subscribers.

One research direction in publish-subscribe systems is to find the best indexing and event matching model so that we have the best throughput - i.e. the number of events matched by the system is substantial for a given set of subscribers whose size is non-trivial.

While our research considers the above as one of our objectives, we also wish to design a pubsub system that takes advantage of pattern of subscriptions and events that comes up with a system that performs better on average. The existing publish-subscribe systems treat the set of subscribers and events to be uniformly distributed - that is, equal number of subscribers are interested in any attribute of the system. However, the reality is that this distribution is highy skewed and dynamic(e.g. a lot more subscribers will be interested in the share price of Google right after its IPO, than any other company, and its stock price will show more activity during this period). If this property of the system could be taken advantage of, we can come up a system that performs much better and adjusts itself to changes in the subscribers' pattern of interest as well as changing event distribution, and thus stay viable all the time. Mathematically, thus, we assume that both the subscription and event distribution follows the zipfian distribution, with respect to the attributes. That is, there will be a small set of attributes in the system that majority of the subscribers will be interested in and the generally large set of remaining attributes is much less popular.

Implementation

We are using the R+ Tree, the multidimensional version of the B+ Tree, as our indexing data structure. In our system, each attribute forms a dimension; subscribers register their interest by subscribing to any subset of these dimensions, and for each dimension, they express a range of interest. Thus, abstractly, each subscriber forms a hyper rectangle in our system. We index on these rectangles using R+ Tree and use it as our backbone of the pubsub system.

Progress and future work

Currently, we have implemented the basic R+ Tree structure as well as the event matching engine on this data structure. We have tested this system and it works out reasonably well for a subscriber pool of size 500000, in a system of 32 attributes, with each subscriber subscribing to 4 attributes on average. The throughput of the system reaches above 1000 events per second, fluctuating subject to various parameters of the system. We compared it with another existing pubsub system, Le Subscribe, and found that at this scale, on average, our system performs better under many circumstances (i.e. various sets of parameters of the system).

Next we plan to enhance the system so that it can adjust to dynamic workloads, i.e. to changing event patterns. That is, we expect to improve the indexing and event matching mechanism so that the index structure detects subscribers who subscribe to popular attribute and value set and adjust itself so that the event matching time involving these subscribers is much faster. To put it simply, serving frequently matching subscribers in the system becomes faster than serving subscribers who are rarely matched by incoming events. Considering the zipfian distribution of subscribers and events in terms of the attributes they subscribe to, we expect that there are going to be many events on these popular attributes and for these events, our system is expected to perform much better thatn otherwise, and thus, on average, we expect the throughput of the system to be better. More importantly, as the attributes that are popular changes over time, we expect our index to dynamically adjust itself so that it is now faster to serve events on these new set of popular attributes.