The Case for Predictive Database Systems: Opportunities
Download
Report
Transcript The Case for Predictive Database Systems: Opportunities
The Case for Predictive Database
Systems: Opportunities and Challenges
Mert Akdere, Ugur Cetintemel,
Matteo Riondato, Eli Upfal, Stan Zdonik
Brown University
.:CIDR 2011:.
Data-driven Predictive Analytics
Essentials: Data-driven Model-based Prediction for
Regression, Classification, Clustering, and Association
Historical
& Current
data
Statistical
Models
(Bayesian Nets,
Decision Trees)
Future,
missing data
values,
events,
trends
Widespread apps: monitoring, planning and
decision support
User-facing apps: financial analysis, customer decision
support, inventory management, …
System-facing apps: optimization, tuning, self management…
1
Support for Predictive Applications:
Off-Database Design
Iterative Learning Process
Database
System
Read
Data
Train
Sampling
Test
Database used as backend
storage
Off-database prediction
with Matlab, R, other machine
learning libraries or custom
code
2
Select
Feature(s)
Build
Model
Type1
Build
Model
Type2
Apply
Model
Build
Model
Type3
Accuracy
feedback
Output
Model
Off-Database Design:
Problems
Iterative Learning Process
Database
System
Read
Data
Train
Sampling
Too much data movement!
Duplication of database
functionality required for
efficiency
Usability issues:
• Custom, hard-to-manage
prediction logic
3
Select
Feature(s)
Test
Build
Model
Type1
Build
Model
Type2
Apply
Model
Build
Model
Type3
Accuracy
feedback
Output
Model
Predictive Database Systems
In-Database Prediction
Systemfacing apps
Prediction
Unit
Declarative
Prediction
Queries
Predictive
Database System
Declarative Prediction Queries
Models as First-class Citizens
Model Management
4
User-facing
Applications
Predictive Database Architecture
Data Management
Database
Catalog
Statistics
Data
Indices /
Views
Database Components
(Prediction)
Query
Optimizer
Executor
Model Management
Model
Catalog
Model
Instances
Prediction Components
Data and Model management in similar ways
Data independence and model independence
5
Prediction
Statistics
Relational DBMSs and Predictive DBMSs
6
RDBMS
PDBMS
Physical design and specialized
data structures
Facilitate model building and
predictive query execution
Pre-computation and
materialization
Pre-building models for adhoc and interactive queries
Profiling and modeling
Cost/accuracy characteristics
of models for optimization
Query optimization
Alternate ways of model
building and prediction
Longview
Predictive database project @Brown
Prototype implementation on PostgreSQL
Built on database extensibility mechanisms
(e.g., views, triggers) for portability
Offers standard and well-tested
predictive functionality
Longview does not compete with
custom/optimized prediction methods
Prediction as a “commodity” service,
e.g., Google prediction API
7
Use-Case: Customer Analysis
Given status for a subset
of the customers, build a
model for predicting the
status of all customers.
customer status =
‘preferred’ or ‘regular’
Database:
Customer(cid, age, city)
Orders(oid, cid, total)
Training Data:
CustomerStatus(cid, status)
Build Prediction Model(s):
Features
age, city, total
8
Model
Target
SVM,
status
Linear
Regression,…
Customer Analysis w/
Declarative Prediction Queries
Predictor schema (p_schema)
age, city, total status
Status predictor definition and
training data specification
Prediction with status
predictor for new data
9
Create P_schema StatusSchema (
age int, city text, total int,
TARGET status text)
Create Predictor StatusPredictor
On StatusSchema
With Data select age, city, sum(total), status
from customer c, orders o, customerstatus cs
where c.cid = o.cid and c.cid = cs.cid group by
c.cid, name, city, status
Select c.cid, StatusPredictor(age, city,
total) from customer c,
(select cid, sum(total) as total from Orders
group by cid) as o where c.cid = o.cid
Customer(cid, age, city),
Orders(oid, cid, total), CustomerStatus(cid, status)
Model Building and Maintenance
Prediction models integrated as black-box
functions with Prediction Model API
Automatically build, materialize and maintain
models for predictors
Model management modes:
Accuracy or time thresholds
Continuous in the background
10
Model Space
The space of models searched by
Longview in model building
Model Space Dimensions:
Model selection
Feature selection
Data selection
Sampling
Partitioning
11
Model Search
12
Task
Description
Opportunities
Model
Selection
Choosing the model
type:
SVMs, Bayesian
Networks, …
• Profiling and Modeling for
computation/memory and I/O
requirements of prediction models
• Parallel model building & testing
Feature
Selection
Choosing the model • Specialized data structures, access
attributes
methods and data organizations for
efficient access to feature subsets,
e.g., columnar storage, multidimensional indices
Data
Selection
Choosing the model
building data w/
Sampling,
Partitioning
• Efficient disk organizations for
(sequential) sampling, and efficient
access and storage methods for data
partitions
Model Composition
Model Ensembles
Combine/Select results of many
models based on decision rules
weighted vote
model1
For Computational Efficiency,
Improved Accuracy and Extensibility
Model Sequencing
Two or models are combined
where the output of one
model is used as input in
another model.
For Extensibility
13
model1
model2
model2
model3
model3
Scaling Up w/ Model Composition
Brute-force approach:
Exact (or approximate) distributed implementations of
prediction methods, e.g., Mahout
Using ensembles:
Sampling
Ensembles of models on memory-size data samples
Partitioning
Separate models for each data partition
14
Case Study:
Query Performance Prediction (QPP)
Predict query execution times with pre-runtime
features
Useful for workload scheduling, resource allocation
Database systems are complex with many
interacting components
Hard to achieve an accurate and portable
analytical model
15
QPP w/ Model Composition
Prediction on TPC-H queries
10GB DB, workload of 500 queries
Plan-level prediction
10% avg. error on known
templates
232% avg. error on an
unforeseen query template
Operator-level prediction
56% avg. error on
unforeseen templates
Hybrid prediction combines
operator-level and plan-level
38% avg. error on templates
16
Related Work
Declarative Forecasting Queries.
Duan and Babu. Processing forecasting queries. VLDB 2007
MauveDB. Querying model-based views.
Deshpande and Madden. MauveDB: supporting model-based user
views in database systems. SIGMOD 2006
Data Mining Services in DBMSs. Data mining with
stored procedures.
Microsoft SQL Server Data Mining
IBM InfoShpere
Oracle Data Mining
17
Conclusion
Predictive DBMSs to enable efficient declarative
predictive analytics
Both user and system facing apps benefit
Model management as the key technology
Similar to data management
Predictive models as first class citizens
Automated model building and use in predictive queries
Cost awareness
Longview aims to build a full-fledged PDBMS
18
Thanks!!!
Questions ?
19