Towards Low Overhead Provenance Tracking in Near Real

Download Report

Transcript Towards Low Overhead Provenance Tracking in Near Real

Towards Low Overhead
Provenance Tracking in
Near Real-Time Stream Filtering
Nithya N. Vijayakumar, Beth Plale
DDE Lab, Indiana University
{nvijayak, plale}@cs.indiana.edu
Project Description
• Provenance collection in stream filtering
systems
• Identify unique challenges posed by
stream filtering systems to provenance
tracking
• Low overhead data model and collection
model that addresses these challenges
Outline
•
•
•
•
•
•
•
•
•
Stream filtering systems
Challenges posed by stream filtering systems
Current provenance solutions applied to streams
Proposed provenance data model
Low overhead provenance collection model
Calder stream processing system
Implementation of provenance models in Calder
Application in LEAD
Future work
Stream filtering systems
• Data driven systems that accept events in real time
– appropriate when data is continuously generated
– data stream is an indefinite sequence of time ordered
events
• Filter (query, user defined application)
– a processing unit that takes one or more event sequences
as input, and generates a new event sequence, as output
– queries with well-defined language or customized
application code
– long running and associated with a lifetime
• Applications
– monitoring, stock ticks in financial applications,
performance measurements in network monitoring and
traffic management, sensor data, scientific datasets
Challenges posed by stream filtering systems
• Identifying provenance entities
– atomic unit? event/ stream/source
• Capturing stream filtering conditions with low
overhead
– distributed environment
– environmental and configuration changes
• Maintaining relevance with non-persistent data
– trace back source of events long after being derived
• Dynamic accuracy estimation
– quality of service guarantees for derived streams
– provenance across streams
– deduce accuracy of derived streams
Current provenance solutions applied to
streams: What is the challenge?
• Representing provenance for stream entities using
Virtual Data Grid system
– indefinite sequence of time ordered datasets
– non-persistent data events
– need accountability more than reproducibility
• Provenance collection using PASOA or Karma
– provenance to be collected for each stream and filters
executed on streams
– communication between components of the stream
filtering system not very important than the entities
themselves
Current provenance solutions applied to
streams (contd…)
• Logging environmental conditions using
Log4j
– non-trivial load on the service
– aggregating provenance traces difficult
• Augmenting accuracy and lineage using
Trio
– lineage cannot be associated with datasets
– need to trace the accuracy of a set of events
long after the stream is generated
Provenance data model: What to track?
• Atomic units
– streams generated outside the system (base
streams)
– declarative queries or application code that
executes continuously (adaptive filters)
– streams generated by executing adaptive filters
on base and derived streams (derived streams)
Provenance data model: How to store it?
• Provenance stack
– base provenance information and a list of changes
– latest information identified by timestamp and is current from
that point onwards
• Provenance tree
– derived stream refers to provenance of input streams (base and
derived) + adaptive filters
– provenance can refer to annotations outside the system (SAM)
• Store the provenance history (compressed or uncompressed)
of streams and filters
Low overhead provenance collection model
• Base provenance
– collected from user when registering a stream/filter
– document the available information (inputs, filters, rate,
sources etc)
– store system and user defined metadata as name value
pairs in base provenance information
– base provenance can be updated by the user
• Dynamic provenance
– subset of a stream identified by a starting timestamp and
ending timestamp
– changes logged with starting timestamp current from
then on
A simple example
<derivedstream>
<name>Temperature Feed</name>
<uniqueID>D0010</uniqueID> <queryID>Q0099</queryID>
<inputstreams>
<streamID>B0011</streamID>
<streamID>D0005</streamID>
</inputstreams>
<systemmetadata>
<name> owner </name> <value> foo </value>
<name> permissions </name> <value> open to everyone </value>
</systemmetadata>
<starttime> <timestamp> 13:00:00 Feb-10-2006</timestamp></starttime>
<changelog>
<event>
<timestamp> 13:34:56 Feb-10-2006 </timestamp>
<description> B0011 down</description>
<approximation> Sampling </approximation>
<accuracy> 0.85</accuracy>
</event>
</changelog>
</derivedstream>
Calder stream processing system
• Distributed processing
of streams
• Service oriented access
to data streams
• SQL based rule-action
support
• Extends OGSA-DAI v6
GDS to streaming
resources
• Synchronous and
asynchronous data
delivery
Calder
Data Streams
Data Management Subsystem
Queries/
Requests
Stream Grid Data Service
Query Planning Service
Users/
Application
Stream Rowset Service
Provenance Service
Result
data
Monitoring Service
Computatio
n
Node
Running
Query
Pub-sub
system Processing
Engine
Calder Query Execution
Provenance collection in Calder
Query
execution
plan
updates
Provenance
Queries/
Updates
Provenance
Service
Query Planner
Service
Subscribe to
receive event
of interest
Monitoring
updates
Provenance
Results
Monitoring
Service
Subscribe
to receive
event of
interest
XML
Database
Provenance
Propagation
Monitoring
Updates
Computation
nodes
Application in LEAD
• Radar meta-data is sent through
pub-sub system
• User submits filter query
• Calder executes filter query on
incoming data streams
• Filtered datasets are processed
using data mining algorithms
(MDA & ADaM)
• Triggers (WS-Notifications) sent
to workflows that invoke forecast
models.
• Provenance tracking will help in
understanding why and when a
trigger was sent
Future work
• Complex Event Processing
– processing multiple streams
– identifying global behavior
• Context Management
– informative search based on past usage
– predicting system characteristics
– managing profiles for users and dynamic
system configuration
Thank you
Questions and Feedback Welcome!
Nithya Vijayakumar
[email protected]