The Case for Predictive Database Systems: Opportunities

Download Report

Transcript The Case for Predictive Database Systems: Opportunities

The Case for Predictive Database
Systems: Opportunities and Challenges
Mert Akdere, Ugur Cetintemel,
Matteo Riondato, Eli Upfal, Stan Zdonik
Brown University
.:CIDR 2011:.
Data-driven Predictive Analytics
 Essentials: Data-driven Model-based Prediction for
Regression, Classification, Clustering, and Association
Historical
& Current
data
Statistical
Models
(Bayesian Nets,
Decision Trees)
Future,
missing data
values,
events,
trends
 Widespread apps: monitoring, planning and
decision support
 User-facing apps: financial analysis, customer decision
support, inventory management, …
 System-facing apps: optimization, tuning, self management…
1
Support for Predictive Applications:
Off-Database Design
Iterative Learning Process
Database
System
Read
Data
Train
Sampling
Test
 Database used as backend
storage
 Off-database prediction
with Matlab, R, other machine
learning libraries or custom
code
2
Select
Feature(s)
Build
Model
Type1
Build
Model
Type2
Apply
Model
Build
Model
Type3
Accuracy
feedback
Output
Model
Off-Database Design:
Problems
Iterative Learning Process
Database
System
Read
Data
Train
Sampling
 Too much data movement!
 Duplication of database
functionality required for
efficiency
 Usability issues:
• Custom, hard-to-manage
prediction logic
3
Select
Feature(s)
Test
Build
Model
Type1
Build
Model
Type2
Apply
Model
Build
Model
Type3
Accuracy
feedback
Output
Model
Predictive Database Systems
 In-Database Prediction
Systemfacing apps
Prediction
Unit
Declarative
Prediction
Queries
Predictive
Database System
 Declarative Prediction Queries
 Models as First-class Citizens
 Model Management
4
User-facing
Applications
Predictive Database Architecture
Data Management
Database
Catalog
Statistics
Data
Indices /
Views
 Database Components
(Prediction)
Query
Optimizer
Executor
Model Management
Model
Catalog
Model
Instances
 Prediction Components
 Data and Model management in similar ways
 Data independence and model independence
5
Prediction
Statistics
Relational DBMSs and Predictive DBMSs
6
RDBMS
PDBMS
Physical design and specialized
data structures
Facilitate model building and
predictive query execution
Pre-computation and
materialization
Pre-building models for adhoc and interactive queries
Profiling and modeling
Cost/accuracy characteristics
of models for optimization
Query optimization
Alternate ways of model
building and prediction
Longview
 Predictive database project @Brown
 Prototype implementation on PostgreSQL
 Built on database extensibility mechanisms
(e.g., views, triggers) for portability
 Offers standard and well-tested
predictive functionality
 Longview does not compete with
custom/optimized prediction methods
 Prediction as a “commodity” service,
e.g., Google prediction API
7
Use-Case: Customer Analysis
 Given status for a subset
of the customers, build a
model for predicting the
status of all customers.
 customer status =
‘preferred’ or ‘regular’
Database:
Customer(cid, age, city)
Orders(oid, cid, total)
Training Data:
CustomerStatus(cid, status)
Build Prediction Model(s):
Features
age, city, total
8
Model
Target
SVM,
status
Linear
Regression,…
Customer Analysis w/
Declarative Prediction Queries
Predictor schema (p_schema)
age, city, total  status
Status predictor definition and
training data specification
Prediction with status
predictor for new data
9
Create P_schema StatusSchema (
age int, city text, total int,
TARGET status text)
Create Predictor StatusPredictor
On StatusSchema
With Data select age, city, sum(total), status
from customer c, orders o, customerstatus cs
where c.cid = o.cid and c.cid = cs.cid group by
c.cid, name, city, status
Select c.cid, StatusPredictor(age, city,
total) from customer c,
(select cid, sum(total) as total from Orders
group by cid) as o where c.cid = o.cid
Customer(cid, age, city),
Orders(oid, cid, total), CustomerStatus(cid, status)
Model Building and Maintenance
 Prediction models integrated as black-box
functions with Prediction Model API
 Automatically build, materialize and maintain
models for predictors
 Model management modes:
 Accuracy or time thresholds
 Continuous in the background
10
Model Space
 The space of models searched by
Longview in model building
 Model Space Dimensions:
 Model selection
 Feature selection
 Data selection
 Sampling
 Partitioning
11
Model Search
12
Task
Description
Opportunities
Model
Selection
Choosing the model
type:
SVMs, Bayesian
Networks, …
• Profiling and Modeling for
computation/memory and I/O
requirements of prediction models
• Parallel model building & testing
Feature
Selection
Choosing the model • Specialized data structures, access
attributes
methods and data organizations for
efficient access to feature subsets,
e.g., columnar storage, multidimensional indices
Data
Selection
Choosing the model
building data w/
Sampling,
Partitioning
• Efficient disk organizations for
(sequential) sampling, and efficient
access and storage methods for data
partitions
Model Composition
 Model Ensembles
 Combine/Select results of many
models based on decision rules
weighted vote
model1
For Computational Efficiency,
Improved Accuracy and Extensibility
 Model Sequencing
 Two or models are combined
where the output of one
model is used as input in
another model.
For Extensibility
13
model1
model2
model2
model3
model3
Scaling Up w/ Model Composition
 Brute-force approach:
 Exact (or approximate) distributed implementations of
prediction methods, e.g., Mahout
 Using ensembles:
 Sampling
 Ensembles of models on memory-size data samples
 Partitioning
 Separate models for each data partition
14
Case Study:
Query Performance Prediction (QPP)
 Predict query execution times with pre-runtime
features
 Useful for workload scheduling, resource allocation
 Database systems are complex with many
interacting components
 Hard to achieve an accurate and portable
analytical model
15
QPP w/ Model Composition
 Prediction on TPC-H queries
 10GB DB, workload of 500 queries
 Plan-level prediction
 10% avg. error on known
templates
 232% avg. error on an
unforeseen query template
 Operator-level prediction
 56% avg. error on
unforeseen templates
 Hybrid prediction combines
operator-level and plan-level
 38% avg. error on templates
16
Related Work
 Declarative Forecasting Queries.
Duan and Babu. Processing forecasting queries. VLDB 2007
 MauveDB. Querying model-based views.
Deshpande and Madden. MauveDB: supporting model-based user
views in database systems. SIGMOD 2006
 Data Mining Services in DBMSs. Data mining with
stored procedures.
Microsoft SQL Server Data Mining
IBM InfoShpere
Oracle Data Mining
17
Conclusion
 Predictive DBMSs to enable efficient declarative
predictive analytics
 Both user and system facing apps benefit
 Model management as the key technology
 Similar to data management
 Predictive models as first class citizens
 Automated model building and use in predictive queries
 Cost awareness
 Longview aims to build a full-fledged PDBMS
18
Thanks!!!
Questions ?
19