Selectivity Estimation using Probabilistic Models

Download Report

Transcript Selectivity Estimation using Probabilistic Models

Selectivity Estimation using
Probabilistic Models
Author: Lise Getoor, Ben Taskar, Daphne Koller
Presenter: Qidan Cheng
Outline





Introduction
Estimation for single tables
SRM: Statistical Relational Models
Selectivity Estimation using SRMs
Learning SRM
Introduction

Accurate estimates of the result size of
queries are crucial to several query
processing components of DBMS.
 Cost-based query optimizers: choose
the optimal query execution plan.
 Query profilers: predicting resource
consumption and distribution of query
results.
 Answer counting queries
Introduction

How to estimate the size of a selection query
over multiple attributes for a single table?
Query:
select * from R
where R.A1 = a1 and … and
R.Ak = ak
Size(Q) = |R|  PD(a1,…,ak)
The result size is determined by the joint
frequency distribution of the values of these
attributes.
Introduction



But…exponential in # of attributes vn,
representing all combination of attribute
values is infeasible.
Attribute value independence assumption:
joint distribution is product of single attribute
distributions
Problem: overestimate or underestimate the
query size.
→ Bayesian Network
Estimation for single table

Example:
Given a simple relation R with three
attributes:
Education (high-school, college,degree) 3 values
Income (low, medium, high)
3 values
Home-Owner (false, true)
2 values

Joint distribution need 18 numbers.
Observation:
Some of the correlations between
attributes might be indirect ones,
mediated by other attributes.
→ Conditional Independence
Estimation for single table

P(H=h|E=e,I=i)=P(H=h|I=i)
Home-owner is conditionally
independent of Education given
Income.
Education
Income
Home-owner
→ Compact form of the joint distribution
P(H,E,I) = P(E)P(I|E)P(H|I,E)=P(E)P(I|E)P(H|I)
Estimation for single table
Figure (b) can encode precisely the same joint
distribution as in Figure (a)
Bayesian Networks
Nodes = random variables
Edges = direct probabilistic
influence
Pneumonia
Tuberculosis
Lung Infiltrates
X-Ray
Sputum Smear
Network structure encodes independence
assumptions: X-Ray conditionally
independent of Pneumonia given Infiltrates
Bayesian Networks
PT
p t
P(I |P, T )
p t
0.6 0.4
p t
0.2 0.8
0.8 0.2
p t 0.01 0.99

Pneumonia
Tuberculosis
Lung Infiltrates
X-Ray
Sputum Smear
Associated with each node Xi there is a
conditional probability distribution P(Xi|Pai:)
— distribution over Xi for each assignment to
parents
BN Semantics
P
T
I
X
S
conditional
independencies
in BN structure
+
local
probability
models
=
full joint
distribution
over domain
Compact & natural representation:
nodes have  k parents  2k n vs. 2n params
BNs for Query Estimation

Query:
select * from R
where R.A1 = a1 and … and R.Ak = ak
Size(Q) = |R|  PD(a1,…,ak)

P(a1,a2,…an)= P(ai|parents(ai))
Use Bayesian inference algorithm to compute
PD(a1,…,ak)

Algorithm complexity depends on BN
connectivity; efficient in practice
Join Selectivity Estimation
Naïve Approach
Uniform Join Assumption
Purchase
Person
Assuming referential integrity
Size(Purchase
Person) = | Purchase |
Join Selectivity Estimation
Joining Two Tables
Example query Q:
“person.income=high and
purchase.type=luxury”
p = P (person.income=high)
q = P (purchase.type=luxury)
SizeQ = |Purchase|*p*q
Problems:
Correlated Attributes
Type = luxury
Income = high
Income = low
Type = necessity
Purchase
Person
The attributes of the two different tables are often
correlated
Skewed Join
Type = luxury
Income = high
Income = low
Type = necessity
Purchase
Person
The probability that two tuples join with each other can
also be correlated with various attributes.
Join Indicator
R
S
Query: select * from R, S
where R.F = S.K
and R.A = a and S.B = b
P(JF) = prob. randomly chosen tuple from R
joins with a randomly chosen tuple from S
size(Q) = | R | | S | P(JF, a, b)
Statistical Relational Models




Model distribution of attributes across
multiple tables
Allow attribute values to depend on
attributes in the same table (like a
BN)
Allow attribute values to depend on
attributes in other tables along a
foreign key join
Can model the join probability of two
tuples using join indicator variable
Statistical Relational Model


A SRM for a relational database is a
pair(S,θ),which specifies a local
probabilistic model for each of the following
variables
 A variable R.A for each table R and each
attribute A R.*
 A boolean join indicator variable R.JF
For each variable R.X
 S specifies a set of parents Pa(R.X)
 Θ specifies a CPD P(R.X|Pa(R.X))
Example SRM
School
Person
Prestige
Age
Bought-by
Attended
Type
Income
Jschool
Jperson
Type=necessity
false
true
Income = high
0.999, 0.001
false
0.9998, 0.0002
Purchase
0.99, 0.01
true
Universal Foreign Key Closure

Schema: R, S, T.R.F refers
to S, S.F refers to T
stratification: T < S < R
t
s.F2 = t.K
s
r.F1 = s.K
r

Schema: R, S
s1
R.F1 refers to S, R.F2 refers to S
stratification: S < R
r.F1 = s1.K
s2
r.F2 = s2.K
r
Universal Foreign Key Closure

Minimal extension Q+ to a query Q:
Let Q be a keyjoin query over r1,r2,…rk
For each r, if there is an attribute R.A
with parent R.F.B where R.F points to S,
then there is a unique tuple variable s
representing the join r.F=s.K

Proposition: Let Q be a query and let
Q+ be its minimal extension. Then
sizeQ[D]=sizeQ+ [D]
Answering Queries Using SRMs

Construct Query Evaluation BN for Query:
select * from Person, Purchase
where Person.id = Purchase.buyer-id
and Person.Income = high and
Purchase.Type=luxury
School
Person
Prestige
Age
Income
Purchase
Type
Jschool
Compute upward closure of query attributes
by including all parents as well
Jperson
SRM Learning
Strain
Database
Patient
Contact


Learn parameters & qualitative
dependency structure
Extend known techniques for learning
Bayesian networks from data.
Structure selection

Define scoring function:
log-likelihood function
l( θ,S | D)=log P(D | S, θ)
Finding the model that has maximum
log-likelihood given data.

Do the greedy local structure search
Parameter Estimation

The model contains a parameter Θa|x
for each value a of A and each assignment
of values x to X.
Θa|x = P(R.A=a |X=x)
Θa|x = FD(R.A=a,X=x)/FD(X=x)
System Architecture
Query Q
Database
Model
Constructor
Selectivity
Estimator
offline
execution time
Size(Q)
Conclusions
SRM is unique in its ability to handle
select and join operators
 Estimates the high-dimensional joint
distribution using a set of lowerdimensional conditional distributions
To do:
 Incremental maintenance of the SRM
as the database changes
 Joins over non-key attributes

Selected Publications
o “Learning Probabilistic Models of Link Structure”, L. Getoor, N. Friedman, D.
Koller and B. Taskar, JMLR 2002.
o “Probabilistic Models of Text and Link Structure for Hypertext Classification”, L.
Getoor, E. Segal, B. Taskar and D. Koller, IJCAI WS ‘Text Learning: Beyond
Classification’, 2001.
o “Selectivity Estimation using Probabilistic Models”, L. Getoor, B. Taskar and D.
Koller, SIGMOD-01.
o “Learning Probabilistic Relational Models”, L. Getoor, N. Friedman, D. Koller,
and A. Pfeffer, chapter in Relation Data Mining, eds. S. Dzeroski and N.
Lavrac, 2001.
o see also N. Friedman, L. Getoor, D. Koller, and A. Pfeffer, IJCAI-99.
o “Learning Probabilistic Models of Relational Structure”, L. Getoor, N. Friedman,
D. Koller, and B. Taskar, ICML-01.
o “From Instances to Classes in Probabilistic Relational Models”, L. Getoor, D.
Koller and N. Friedman, ICML Workshop on Attribute-Value and Relational
Learning: Crossing the Boundaries, 2000.
o Notes from AAAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2000.
o Notes from IJCAI Workshop on Learning Statistical Models from Relational
Data, eds. L.Getoor and D. Jensen, 2003.
See http://www.cs.umd.edu/~getoor