Models and Sensor Networks
Download
Report
Transcript Models and Sensor Networks
Probabilistic Databases
Amol Deshpande, University of Maryland
Overview
V.S. Subrahmanian
Lise Getoor
ProbView, PXML, Temporal Probabilistic
Databases, Probabilistic Aggregates
Statistical Relational Learning, Probabilistic
Relational Models, Entity Resolution
Amol
MauveDB: Statistical Modeling in Databases,
Correlated tuples in probabilistic databases
Overview of Today’s Presentation
Model-based Views/MauveDB [Amol]
Statistical Relational Learning [Lise]
Representing arbitrarily correlated data and processing
queries over it [Prithviraj]
Overview of Today’s Presentation
Model-based Views/MauveDB [Amol]
Goal: Making it easy to continuously apply statistical models to
streaming data
Current focus on designing declarative interfaces, and on efficient
maintenance algorithms
Less on the “probabilistic databases” issues
Statistical Relational Learning [Lise]
Representing arbitrarily correlated data and processing
queries over it [Prithviraj]
Motivation
Unprecedented, and rapidly increasing,
Wireless sensor
networks
instrumentation of our every-day world
Huge data volumes generated continuously
that must be processed in real-time
Typically imprecise, unreliable and incomplete
Distributed measurement
networks (e.g. GPS)
data
Measurement noises, low success rates,
failures etc…
RFID
Industrial Monitoring
Data Processing Step 1
Process data using a statistical/probabilistic model
Regression and interpolation models
To eliminate spatial or temporal biases, handle missing data, prediction
Filtering techniques (e.g. Kalman Filters), Bayesian Networks
To eliminate measurement noise, to infer hidden variables etc
Temperature monitoring
Regression/interpolation models
GPS Data
Kalman Filters et
A Motivating Example
Inferring “transportation mode”/ “activities” [Henry Kautz
et al]
Using easily obtainable sensor data, e.g. GPS, RFID proximity
data
Can do much if we can infer these automatically
home
office
Have access to noisy “GPS” data
Infer the transportation mode:
walking, running, in a car, in a bus
Motivating Example
Inferring “transportation mode”/ “activities” [Henry Kautz
et al]
Using easily obtainable sensor data, e.g. GPS, RFID proximity
data
Can do much if we can infer these automatically
home
office
Preferred end result:
Clean path annotated with transportation mode
Dynamic Bayesian Network
Use a “generative model” for describing how the
observations were generated
Time = t
Transportation Mode:
Walking, Running, Car, Bus
Mt
True velocity and location
Xt
Need conditional probability
distributions
e.g. a distribution on
(velocity, location)
given the transportation mode
Prior knowledge or learned from
data
Observed location
Ot
Dynamic Bayesian Network
Use a “generative model” for describing how the
observations were generated
Time = t
Time = t+1
Transportation Mode:
Walking, Running, Car, Bus
Mt
Mt+1
True velocity and location
Xt
Xt+1
Ot
Ot+1
Observed location
Dynamic Bayesian Network
Given a sequence of observations (Ot), find the most likely
Mt’s that explain it.
Or could provide a probability distribution on the possible Mt’s.
Time = t
Time = t+1
Transportation Mode:
Walking, Running, Car, Bus
Mt
Mt+1
True velocity and location
Xt
Xt+1
Ot
Ot+1
Observed location
Statistical Modeling of Sensor Data
No support in database systems --> Database
ends up being used as a backing store
With much replication of functionality
Very inefficient, not declarative…
How can we push statistical modeling inside a
database system ?
Abstraction: Model-based Views
An abstraction analogous to traditional database
views
Present the output of the application of model as
a database view
That the user can query as with normal database
views
Example DBN View
User
User
Time
Location
Mode
prob
John
5pm
(x’1, y’1)
Walking
0.9
John
5pm
(x’1, y’1)
Car
0.1
John
5:05pm
(x’2, y’2)
Walking
0
John
5:05pm
(x’2, y’2)
Car
1
User
Time
Location
John
5pm
(x1, y1)
John
5:05pm
(x2, y2)
User view of the data
- Smoothed locations
- Inferred variables
e.g.
select count(*)
group by mode
sliding window 5 minutes
Application of the model/inference
is pushed inside the database
Opens up many optimization
opportunities
e.g. can do inference lazily when
queried etc
Original noisy GPS data
Correlations
User
User
Time
Location
Mode
prob
John
5pm
(x’1, y’1)
Walking
0.9
John
5pm
(x’1, y’1)
Car
0.1
John
5:05pm
(x’2, y’2)
Walking
0
John
5:05pm
(x’2, y’2)
Car
1
Strong and complex
correlations across tuples
- Mutual exclusivity
- Temporal correlations
MauveDB: Status
Written in the Apache Derby Java open source
database system
Support for Regression- and Interpolation-based views
Neither produce probabilistic data
SIGMOD 2006 (w/ Sam Madden)
Currently building support for views based on Dynamic
Bayesian networks [Bhargav]
Kalman Filters, HMMs etc
Initial focus on the user interfaces and efficient inference
Will generate probabilistic data; may not be able to do
anything too sophisticated with it
Research Challenges/Future Work
Generalizing to arbitrary models ?
Develop APIs for adding arbitrary models
Try to minimize the work of the model developer
Probabilistic databases
Uncertain data with complex correlation patterns
Query processing, query optimization
View maintenance in presence of high-rate
measurement streams
Thanks !!
Mauve == Model-based User Views