20060426-datalines-haberman

Download Report

Transcript 20060426-datalines-haberman

DataLines
a framework for building steaming data applications
Mike Haberman
Senior Software/Network Engineer
[email protected]
The Problem
• Data deluge: routers, switches, IDS,
servers (web, mail, logs, etc), software
(tcpdump, web100, SNMP, tarpit, etc),
sensors, taps, …
(help me)
?
The problem (continues)
• Disparate data formats
• Software (sometimes) to manage each
• Tweaking to get what you want (custom
software)
• Correlating data (more custom software)
DataLines
• Can we build a framework that can
remove all (most) of the tedium of
working with all these disparate data
formats?
DataLines Framework
• designed to manage and build
streaming data processing applications
DataLines Framework
• designed to manage and build
streaming data processing applications
DataLines Framework
designed to manage and build streaming data processing applications
• Manage: would like one tool to handle all
these different data sources.
DataLines Framework
designed to manage and build streaming data processing applications
• Build: uniform way of creating a data
processing application.
DataLines Framework
designed to manage and build streaming data processing applications
• Streaming data:
•
•
•
•
Never ending stream of ‘manageable’ chunks of data
No random access, no blocking operators
One look, linear or sub-linear algorithms/data ops
Each data item (a tuple in DataLines) is an
independent entity
• Many tools were not designed for streaming data
DataLines Framework
designed to manage and build streaming data processing applications
• Processing:
• Something you want to do to the data (e.g.
reading, writing, parsing, event generation,
filtering, statistics, reports, data synopsis, …)
DataLines
• Creating a DataLines application:
XML
“compile”
DataLines
Application
DataLines
• XML file defines 3 major components:
– Data Processors
• What one does with the data
– Processing Order
• The order in which the processors will operate
on the data
– Event Management
• What to do when a processor generates an
event
DataLines Processors
• Data Processors are the heart of D.L.
–
–
–
–
–
–
I/O:
Filters:
Collectors:
Gui:
Converters:
Misc:
socket, file
inline, dispatch
binning, windowing (w/operators)
charts, picture taking
binary to tuple
printers, counters, iterators, timers,
data generators, gates, delays
• Processors can generate events
• Processors can drop, mutate, mutilate the
tuple being processed, generate new tuples
DataLines Pipelines
• Control tuple movement among
processors
• Can connect either processors or other
pipelines
• Two paths within a pipeline: binary and
tuple
Event Management
• Allow processors to signal an event
– timers, open/close, client connects, etc
• Allow the user to tie in domain logic
• Allow the user to call a processor
specific API
DataLines Data
• The generalization of data is a DlTuple
• Tuple is just a set of values
• DlTuple is the interface processors use
– String[] <-- getFieldNames()
– DlValue <-- getValue(fieldname)
DataLines Data
• Tuples can have virtual fields
– calculated values, static values
• Tuples can have composite fields
• The creation of the tuple is left to the
processor in charge of conversion
XML Syntax … run away!
<application>
<dataline name =“dl”>
<processor name=“reader” type=“FileReader”>
<configInfo>
</configInfo>
</processor>
<pipeline name =“p1”>
<pipe from = “reader” to = “parser” />
<pipe from = “parser” to = “printer” />
</pipeline>
<eventManagement>
<event name=“start”>
<call method = “start” target = “reader”/>
</event>
<event name=“alert” from = “reader”>
<call method=“stop” target=“parser” />
</event>
</eventManagement>
<dataline>
</application>
Data Example
<arg name = “tupleField”>
<map name = “name” value = “Src Ip”/>
<map name = “peer”
value = “IpV4AddressPeer” />
<map name = “length” value = “4” />
</arg>
Data Example
<arg name = “tupleField”>
<map name = “name” value = “A”/>
<map name = “peer” value = “IntegerPeer” />
<map name = “length” value = “4” />
</arg>
<arg name = “tupleField”>
<map name = “name” value = “B”/>
<map name = “peer” value = “IntegerPeer” />
<map name = “length” value = “4” />
</arg>
<arg name = “tupleField”>
<map name = “name” value = “C”/>
<map name = “peer” value = “JepPeer” />
<data name = “expression”>
${A} + ${B}
</data>
</arg>
DataLines Tutorial
• Fast forward past a painful 3 hour
tutorial covering each of those sections
in detail (tuples, processors, pipelines,
event management, configurations)
• You have seen all the XML though!
DataLines Distilled
• A library of data processors that operate
on “Tuples”
– one of the processors takes the raw data and creates the tuple
• An XML compiler that takes the xml file,
the library, and creates an application
DataLines Example
DataLines in use
• DataLines does make it easier to hit the
ground running. Much of the tedious
work you need to do is taken care of
• For highly specific needs, you still need
to write code. But that code then
becomes part of the DataLines lib. That
others can build on
Balance Sheet
• Positive
•Flexible (vendor neutral, data,
debugging)
•Reusable (pipelines, processors)
•Fast development time
•“easy” to change the client (cli,
desktop, web page)
• Negative
•May need to write domain
specific code
•Learning curve -- processors
config, data expectations, events
DataLines in Action
• Network Engineering group
– Monitor router, tar pit, IDS, packet
sampling, L2/L3 mappings
• Security Group
– Network forensics
• Intergroup Wiring
• Use DataLines to share data between groups/projects
DataLines in Action
• Network Research group
– Monitor cluster network activity from MPI
layer
– Data Mining
– Misc. NSF data oriented projects
Future
• Open Source
• More Info: [email protected]
• http://datalines.ncsa.uiuc.edu
(a work in progress)