Exploratory Data Processing [PUMPKIN]

Pumpkin is a framework for distributed Data Transformation Network (DTN). It implements a protocol for distributed data processing. A data packet is a self contained processing unit which incorporates data, state, code and routing information. Automata is used to model the packet state transformation from one node to the next. The automata graph doubles as routing information for the packet. A decentralized distributed system takes care of routing data packets too.

Data partitionability, processing complexity and locality play a crucial role in the effectiveness of distributed systems. Through virtualization, resources have become scattered, heterogeneous, and dynamic in performance and networking. Collective and collaborative use of these resources for data processing is our main challenge.

In eScience, coordinating multiple tasks for running in-silico experiments is often the realm of Scientific Workflow Management Systems (SWMS). These are often centralized systems that work in confined resources.

A common denominator in most workflow systems is that the unit of reason is the process i.e. the abstract workflow describes a topology of tasks configured in a certain way. This is often tailored to the underlying infrastructure. Thus the process ordering is a description of how to best exploit resources and not necessarily a description of data processing.

The complexity and dynamism in big data processing entails a new unit of reason: the data itself. An abstract model for data processing will solely describe data transformations agnostically from the underlying resources.

Automata as a Data Processing Schema

Automata is an intuitive way to describe data processing a a transformation from one state to the next.  The data transformation model can be considered as a 5 tuple NFA:

5tupleQ is the set of states the data object can be in. Σ is the set of functions that performs the data transformations. δ is the transition function that maps data and functions to new states such that  Q x Σ -> P(Q). F is the set of final data states which mark the completion of processing. q0 is a starting data state.

Distributed Data Processing as a Protocol

Automata data model describes the abstract data processing model. The same model is used to build a distributed processing infrastructure around the data processing schema. The schema represents the knowledge of how data can be processed which at a network and resource level this represents a data routing table.

layersGlobally distributed resources are combined together in the PUMPKIN framework through a data processing protocol. Data is partitioned into packets. A packet is an atomic unit of data processing. Each data packet can encapsulate the automata as part of the header. The automaton header makes the packet self aware of where it has to go for processing. The data packet can also contain the code for processing the packet.

Data Packet = Data + Automaton + Code + State

The processing granularity is at the data packet level. This allows for various controllability  such as scaling at packet level. A packet source will load balance data packets to multiple replicated, identical data processing functions. Replication is also at the packet level. Data packets can be replicated to multiple functions requesting the same data state.

Data processing functions are hosted in nodes. Functions can be statically deployed or deployed through the data packet since the packet can also carry code. The task for each node is two fold: nodes process data and also route data.

arch Each node in PUMPKIN discovers routes to other nodes. A routing table allows nodes to send data packets to the next node in a P2P fashion. In SDNs the routing table can be used to reconfigure the network.

PUMPKIN in Action

snapshot3

snapshot9

snapshot2

 

snapshot6

TOP LEFT: Data automaton for  a bio-med application. TOP RIGHT: Data automaton for a Tweeter filtering application. BOTTOM LEFT:  Network connections for connecting bio-med VMs on a private cloud. BOTTOM RIGHT: Network connections connecting Tweeter VMs from various providers including Amazon (US, Europe), VPH-Share infrastructure, private cloud, Docker and PC.

Publications

Leave a Reply