Stream Processing in Emerging Distributed Applications
Download
Report
Transcript Stream Processing in Emerging Distributed Applications
Course Project Ideas
Yanlei Diao
University of Massachusetts Amherst
New Directions for DB Research
Sensor data: new architecture
XML: new data model
Streams: new execution model
Data quality and lineage: new services
…
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Querying in Sensor Networks
Internet
Gateway
Push query to
sensors
• Store data locally at
sensors and push queries
into the sensor network
– Flash memory energyefficiency.
– Limited capabilities of sensor
platforms.
Flash Memory
Acoustic stream
Image stream
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Optimize for Flash and Limited RAM
Memory
• Flash Memory Constraints
– Data cannot be over-written, only
~4-10 KB
erased
– Pages can often only be erased in
blocks (16-64KB)
2. Modify in-memory
– Unlike magnetic disks, cannot
modify in-place
1. 1. Load block
3. Save
• Challenges:
2. Into Memory
– Energy: Organize data on flash to
minimize read/write/erase
operations
– Memory: Minimize use of memory
for flash database.
Yanlei Diao, University of Massachusetts Amherst
Erase
block
~16-64 KB
7/20/2015
block back
StonesDB: System Operation
Image Retrieval: Return images taken
last month with at least two birds one
of which is a bird of type A.
Proxy Cache of Image Summaries
Qui ckT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this picture.
Quic kT ime™ and a
T IFF (Uncompres sed) decompres sor
are needed to s ee this pict ure.
Quick Time™ an d a
TIFF ( Un compr ess ed ) de co mpr es sor
ar e n eed ed to s ee this pic tur e.
• Identify “best” sensors to
forward query.
• Provide hints to reduce
search complexity at
sensor.
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
StonesDB: System Operation
Image Retrieval: Return images taken last
month with at least two birds one of which
is a bird of type A.
Query Engine
Partitioned Access Methods
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Research Issues in StonesDB
• Local Database Layer
–
–
–
–
Reduce updates for indexing and aging.
New cost models for self-tuning sensor databases.
Energy-optimized query processing.
Query processing over aged data.
• Distributed Database Layer
– What summaries are relevant to queries?
– What remainder queries to send to sensors?
– What resolution of summaries to cache?
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
XML (Extensible Markup Language)
<bibliography>
<book> <title> Foundations… </title>
<author> Abiteboul </author>
<author> Hull </author>
<author> Vianu </author>
<publisher> Addison Wesley </publisher>
<year> 1995 </year>
</book>
…
</bibliography>
XML: a tagging mechanism to describe content.
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
XML Data Model (Graph)
db
#0
publisher
book
book
b1
b2
pub
title
#1
pcdata
author
#2
pcdata
#3
pcdata
pub
mkp
title
author
#5
#4
pcdata
author
pcdata
Complete... Chamberlin Principles... Bernstein
Newcomer
name
state
#6
pcdata
#7
pcdata
Morgan... CA
Main structure: ordered, labeled tree
References between node: becoming a graph
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
XQuery: XML Query Language
• A declarative language for querying XML data
• XPath: path expressions
– Patterns to be matched against an XML graph
– /bib/paper[author/lastname=‘Croft’]/title
• FLOWR expressions
– Combining matching and restructuring of XML data
– For
$p in distinct(document("bib.xml")//publisher)
Let
$b := document("bib.xml")/book[publisher = $p]
Where count($b) > 100
Order by $p/name
Return $p
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Metadata Management using XML
• File systems for large-scale scientific simulations
– File systems: petabytes or even more
– Directory tree (metadata): large, can’t fit in memory
– Links between files: steps in a simulation, data derivation
• File Searches
– all the files generated on Oct 1, 2005
– all the files whose name is like ‘*simu*.txt’
– all the files that were generated from the file ‘basic-measures.txt’
Build an XML store to manage directory trees!
– XML data model
– XML Query language
– XML Indices
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
XML Document Processing
Multi-hierarchical XML markup of text documents
–
–
–
–
Multi-hierarchies: part-of-speech, page-line
Features in different hierarchies overlap in scope
Need a query language & querying mechanism
References [Nakov et al., 2005; Iacob & Dekhtyar, 2005]
Querying and ranking of XML data
–
–
–
–
XML fragments returned as results
Fuzzy matches
Ranking of matches
References [Amer-Yahia et al., 2005; Luo et al., 2003]
• Well-defined problems identify your contributions!
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Data Stream Management
Traditional Database
Data Stream Processor
Results
Results
Query
Attr1 Attr2 Attr3
Data
Queries, Rules
Event Specs,
Subscriptions
•Data at rest
•Data in motion, unending
•One-shot or periodic queries
•Continuous, long-running queries
•Query-driven execution
•Data-driven execution
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
In-Network XML Processing
• XML is becoming the wire format for data
• In-network XML processing
–
–
–
–
–
Authentication
Authorization
Routing
Transformation
Pattern matching
Expedite traffic
Enhance security
Real-time monitoring
& diagnosis
• XPath widely used for in-network XML processing
• Applied directly to streaming XML data
• Line-speed performance
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Research Issues
Gigabit rate XPath processing
– Take one look, process XPath, buffer data for future use if
necessary
– Processing needs to be gigabit rate
– Memory usage needs to be minimized
• Time/space complexity of XPath stream
processing
– Theoretical analysis for common features of XPath
• Minimizing memory usage of YFilter technolgy
– YFilter: state-of-the-art for multi-XPath processing
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
RFID Technology
• RFID technology
01.01298.6EF.0A
04.0768E.001.F0
01.01267.60D.01
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
reader_id,
tag_id,
timestamp
RFID Stream Processing
<pml >
<tag>01.01298.6EF.0A</tag>
<time>00129038</time>
<location>shelf 2</location>
</pml>
RFID reader <pml>
RFID tag
<tag>01.01298.6EF.0A</tag>
Shoplifting: an item was taken out of store without being
checked out.
<time>02183947</time>
Out of stocks: the+number of items of product X on shelf ≤ 3.
Yanlei Diao, University of Massachusetts Amherst
<location>exit1</location>
</pml>
7/20/2015
RFID Processing: Global Tracking
Counterfeit drugs: a bottle is accepted at the retailer if it came from a legal
manufacturer and followed all necessary steps in the distribution network.
<pml>
<pml>
<epc>01.001298.6EF.0A</epc>
<epc>01.001298.6EF.0A</epc>
<pml>
a
bottle
is
accepted
at
the
retailer
it went through the
<ts type=“begin”>
<tsif
type=“end”>
<pml>
<pml>
<pml>
<epc>01.001298.6EF.0A</epc>
<date>…</date>
<date>…</date></ts>
<epc>01.001298.6EF.0A</epc>
<epc>01.001298.6EF.0A</epc>
network
in less
than
3
months
and
was
never exposed
to temperature >
<epc>01.001298.6EF.0A</epc>
<ts><date>…</date></ts>
</ts>
<entity type=“retailer”>
<ts><date>…</date></ts>
<ts><date>…</date></ts>
<ts><date>…</date></ts>
<location>…</location>
<entity type=“maker”>
<name type=“legal”>CVS
<location>…</location>
<location>…</location>
<location>…</location>
<msr
label=“temperature”
<name
type=“legal”>X
Ltd.
</name>
<msr label=“temperature”
<msr label=“temperature”
<msr label=“temperature”
max=2>80</msr>
</name>
</entity> …
max=5>95</msr>
max=2>85</msr>
max=2>90</msr>
…
</entity>
…
…
……
Expired/spoiled drugs:
distribution
+
Missing pallet, expected case, illegally cloned tags…
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
96 F.
Challenges in RFID Management
• Data-Information Mismatch
– RFID raw data: (tag id, reader id, timestamp)
– Meaningful information: shoplifting, misplaced inventory, out-ofstocks; expired drugs, spoiled drugs…
• Incomplete, inaccurate data
– Readers miss tags
– Readers can pick up tags from overlapping areas
• High-volume data
– Readers read constantly, from all tags in range, without line-of-sight
– Can create up to millions of terabytes of data in a single day
• Low-latency processing
– Up-to-the-second information, time-critical actions
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Research Issues
• Real-time event stream processing
– Handling duplicate readings/results
– Data cleaning
– Data compression
• Handling incomplete readings
– Inferences in event databases
– Inferences over event streams
• Distributed processing
– Real time anomaly detection
– Distributed inferences
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Adaptive Sensing of Atmosphere
• Environmental monitoring:
real-time processing of hugevolume meteorological data
• Challenges
–
–
–
–
Large volume but limited bandwidth
Real-time processing
Uncertain data
Data archiving and querying the
history
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Sense
Sense
Send Send
Merge
Detection
Prediction
Managing Uncertain Data
• Sources of data uncertainty
1) Sensing noise and partial scanning
2) Data compression
3) Lossy wireless links
4) Incomplete merging
(1)
(1)
(2)
(2)
(3)
(3)
• Managing uncertain data
– Model sources of data uncertainty
– Develop uncertainty calculus to
combine the effects of these sources
– Augment results with confidence
values
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Merge
(4)
Tornado
Detection
Prediction
(confidence?)
Managing Uncertain Data
• Sources of data uncertainty
1) Sensing noise and partial scanning
2) Data compression
3) Lossy wireless links
4) Incomplete merging
(1)
(1)
(2)
(2)
(3)
(3)
• Self diagnosis and tuning
– Compare predication at t with
observation at t+1 (no ground
truth?!)
– System diagnosis when confidence
value is low
– Automatically tune the system
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Merge
(4)
Tornado
Detection
Prediction
(confidence?)
Questions
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Outline
• An outside look: DB Application
• An inside look: Anatomy of DBMS
• Project ideas: DB Application
• Project ideas: DBMS Internals
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Application: UMass CS Pub DB
• UMass Computer Science Publication Database
– All papers on professors’ web pages and in their DBLP records
– All technical reports
• Search:
– Catalog search (author, title, year, conference, etc.)
– Text search (using SQL “LIKE”)
• Navigation
– Overview of the structure of document collection
– Area-based “drill down” and “roll up” with statistics
•
•
•
•
Add document
Top hits
Example: http://dbpubs.stanford.edu:8090/aux/index-en.html
Deliverables: useful software, user-friendly interface
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Application: RFID Database
• RFID technology
• RFID supply chain
Truck
Pallet
Case
– Locations
– Objects
Manufacturer
Supplier DC
Yanlei Diao, University of Massachusetts Amherst
Retail DC
7/20/2015
Retail Store
Application: RFID Database
• RFID technology
• RFID Supply chain
• Database propagation
– Streams of (reader_id, tag_id, time)
– Semantics: reader_id location, tag_id object
– Containment
• Location-based, items in a case, cases on a pallet, pallets in a truck…
• Duration of containment
– History of movement: (object, location, time_in, time_out)
– Data compression for duplicate readings
– Integration with sensors: temperature, location…
• Track and trace queries
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
Data Quality
•
•
Closed world assumption: not any more!
Various sources of data loss
1)
2)
3)
4)
•
Sensing noise
Data compression
Lossy wireless links
Incomplete merging
(1)
(1)
(2)
(2)
(3)
Probabilistic query processing
(3)
Merge
(4)
– Model sources of data loss
– Quantify the effect on queries max(), avg(), percentile…
– Output query results with confidence level
Yanlei Diao, University of Massachusetts Amherst
7/20/2015
• Some idea on INFOD/data dissemination
Yanlei Diao, University of Massachusetts Amherst
7/20/2015