Towards Estimating the Number of Distinct Values
Download
Report
Transcript Towards Estimating the Number of Distinct Values
Towards Estimating the Number of Distinct Value
Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1
1University of Toronto
2IBM Toronto Lab
[email protected]
Distinct value combinations
1
2
3
Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
3 distinct value combinations
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
November 3, 2005
CIKM 2005
2
Motivation
Cardinality estimation for query
optimization, e.g.,
Estimating the size of ( country,city) Hotel
Estimating the size of the aggregation
SELECT sales_date, sales_person,
SUM(sales_quantity) AS unit_sold
FROM sales
GROUP BY sales_date, sales_person
Approximate query answering, e.g., COUNT
queries
November 3, 2005
CIKM 2005
3
Roadmap
Related work
Estimation with known marginal
distributions
Upper/lower bounds
An estimator
Estimation with histograms
Experimental results
Conclusions
November 3, 2005
CIKM 2005
4
Related work
Previous work has focused on the
case of single attribute.
[HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
Sampling approach is used.
Estimation through sampling is difficult
[CCMN’00]
No existing statistical information is
exploited.
November 3, 2005
CIKM 2005
5
Our solution
Considering multiple-attributes
Utilizing existing statistics on individual
attributes
Readily available in most database systems
Does not require access to the data
Granularity of statistics
Exact marginal frequency distributions
Approximate distributions: histograms etc.
November 3, 2005
CIKM 2005
6
Estimation with known marginals
Number of distinct values in attribute Ai,
di (i 1,2,..., m)
frequency vector fi ( f i1 , f i 2 ,..., f id ),
i
Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
di
j 1
f ij 1
f1 (0.6,0.4) f 2 (0.4,0.2,0.4)f3 (0.2,0.2,0.2,0.2,0.2)
November 3, 2005
CIKM 2005
7
The naïve estimator
COLSCARD = min
m
i 1
Number of possible
value combinations
di: the number of distinct
values in attribute Ai
di , N
Sanity bound:
COLSCARD cannot be
greater than the table size
The problem: Some value combinations with low
occurrence probabilities may not appear in the table!
November 3, 2005
CIKM 2005
8
Upper/Lower bounds
Trivial bounds
m
di , N
Upper bound: min
i 1
naïve estimator)
Lower bound: max d 1, d 2 ,..., d m
(the
Tighter bounds?
In the case of two attributes, tighter bounds
are available.
November 3, 2005
CIKM 2005
9
Tighter bounds
A2
A1
N = 10
value freq
a 1
b 1
c 8
Naïve bounds: 3, 9
value freq
1
1
[2, 3]
d
e
f
4
4
2
Lower bound = 2+1+1 = 4
Upper bound = 3+1+1 = 5
November 3, 2005
CIKM 2005
10
Expected number of combinations
Assumptions
1. The data distributions of individual columns are
independent
2. The occurrence of each combination in the table
is independent
f f1 f 2 f m
Each element of f represents the
frequency of a specific value combination.
An estimate of the probability of occurrence
November 3, 2005
CIKM 2005
11
Estimator
The probability of the i-th combination
not appearing in a particular tuple is
(1 f i )
The probability of the i-th combination
not appearing in the table (of size N) is
(1 f i ) N
The expected number of value
combinations is
E[COLSCARD ] M (1 f i )
N
( M j 1 d j )
m
i
November 3, 2005
CIKM 2005
12
Example revisited
Estimate the COLSCARD for attribute set (A1, A2, A3),
given
f1 (0.1,0.3,0.6) f2 (0.01,0.99) f3 (0.05,0.95) N 100
Naïve estimate: 3*2*2 = 12
f f1 f 2 f3 (0.00005,0.00095,0.00495,0.09405,
0.00015,0.00285,0.01485,0.28215,
0.00030,0.00570,0.02970,0.05643)
New estimate: 5.94
November 3, 2005
CIKM 2005
13
Roadmap
Related work
Estimation with known marginal
distributions
Upper/lower bounds
An estimator
Estimation with histograms
Experimental results
Conclusions
November 3, 2005
CIKM 2005
14
Estimation with histograms
Histograms exist on individual
attributes
Two classes of histograms
Partition-based
End-biased
Marginals can be (approximately)
reconstructed from histograms
Optimal histograms in each class?
November 3, 2005
CIKM 2005
15
Optimal histograms
Minimizing the error incurred by
histograms
ERR = |ESThist – ESTexact|
Partition-based histograms
A dynamic programming algorithm
similar to that for V-optimal histogram
construction [Jagadish et al. 98] can be
used.
November 3, 2005
CIKM 2005
16
Optimal end-biased histograms
An end-biased histogram with B buckets
stores
The exact frequencies of B-1 attribute values
The average of the remaining values
Which B-1 values to store exactly?
Most widely used end-biased histograms
store the frequencies of the most frequent
values
Not always optimal for COLSCARD estimation!!
November 3, 2005
CIKM 2005
17
Example
Attributes (A1, A2)
f1 (0.1,0.9)
N=10
case 1: f2 (0.01, 0.02, 0.03, 0.94)
case 2 : f 2' (0.01, 0.29, 0.31, 0.39)
Choose 1 frequency to store exactly
Error
table
Index of the frequency stored
f2
f 2'
November 3, 2005
1
2
3
4
1.68
2.01
2.17
0.15
0.01
1.10
1.09
1.02
CIKM 2005
18
Optimal end-biased histograms
Exhaustive search takes time proportional
to CdBj1
We prove that the optimal choices can be
one of the following
Most frequent values
Least frequent values
A combination of most frequent and least
frequent values
Only need to search both ends
Cost is linear in B, independent of dj!
November 3, 2005
CIKM 2005
19
Roadmap
Related work
Estimation with known marginal
distributions
Upper/lower bounds
An estimator
Estimation with histograms
Experimental results
Conclusions
November 3, 2005
CIKM 2005
20
Experiments – Data sets
Synthetic data
Skew: Zipfian parameter z=0 (uniform) to 4
(highly skewed)
Number of tuples: 10K to 1M
Real data
Cover Type: 581,012 tuples, 10 attributes
Census Income: 32,561 tuples, 14 attributes
Error measure: ratio error
ERR = max{true/est-1, est/true-1}
November 3, 2005
CIKM 2005
21
Effect of data skew
9
N=100K
8
7
6
ERR
5
4
3
2
1
0
z1 = 0,
z2=0
z1 = 0,
z2=2
z1 = 0,
z2=4
Proposed estimator 0.000237 0.000933 0.000982
Naive estimator
0.0516
6.5171
5.9423
z1 = 4,
z2=4
0.0654
8.4921
di=1k
November 3, 2005
CIKM 2005
22
Effect of number of tuples
0.14
z=0
z=2
z=4
0.12
ERR
0.1
0.08
0.06
0.04
0.02
0
1000
10000
100000
1000000
N
November 3, 2005
CIKM 2005
23
Results on real data
(b) Census Income
21
10
(a) Cover Type
2
5
3
19
4
59
31
ERR≤0.05
0.05<ERR≤0.1
0.1<ERR≤0.5
45 pairs
November 3, 2005
0.5<ERR≤1
ERR>1
91 pairs
CIKM 2005
24
Accuracy of end-biased histograms
0.3
0.25
ERR
0.2
0.15
0.1
0.05
0
10
20
30
50
Number of buckets
Results on the “capital-gain” attribute of Census Income data set
November 3, 2005
CIKM 2005
25
Conclusions
Utilizing existing knowledge
maintained in database systems
Proposed upper/lower bounds as well
as an estimator
Considered two cases
exact marginal frequencies
Histograms: optimal histograms
Experimental results show the
effectiveness of the proposed method
November 3, 2005
CIKM 2005
26