Towards Estimating the Number of Distinct Values

Download Report

Transcript Towards Estimating the Number of Distinct Values

Towards Estimating the Number of Distinct Value
Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1
1University of Toronto
2IBM Toronto Lab
[email protected]
Distinct value combinations
1
2
3
Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
3 distinct value combinations
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
November 3, 2005
CIKM 2005
2
Motivation
 Cardinality estimation for query
optimization, e.g.,
 Estimating the size of  ( country,city) Hotel
 Estimating the size of the aggregation
SELECT sales_date, sales_person,
SUM(sales_quantity) AS unit_sold
FROM sales
GROUP BY sales_date, sales_person
 Approximate query answering, e.g., COUNT
queries
November 3, 2005
CIKM 2005
3
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
4
Related work
 Previous work has focused on the
case of single attribute.
 [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
 Sampling approach is used.
 Estimation through sampling is difficult
[CCMN’00]
 No existing statistical information is
exploited.
November 3, 2005
CIKM 2005
5
Our solution
 Considering multiple-attributes
 Utilizing existing statistics on individual
attributes
 Readily available in most database systems
 Does not require access to the data
 Granularity of statistics
 Exact marginal frequency distributions
 Approximate distributions: histograms etc.
November 3, 2005
CIKM 2005
6
Estimation with known marginals
 Number of distinct values in attribute Ai,
di (i  1,2,..., m)
 frequency vector fi  ( f i1 , f i 2 ,..., f id ),
i

Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
di
j 1

f ij  1
f1  (0.6,0.4) f 2  (0.4,0.2,0.4)f3  (0.2,0.2,0.2,0.2,0.2)
November 3, 2005
CIKM 2005
7
The naïve estimator
COLSCARD = min

m
i 1
Number of possible
value combinations
di: the number of distinct
values in attribute Ai
di , N

Sanity bound:
COLSCARD cannot be
greater than the table size
The problem: Some value combinations with low
occurrence probabilities may not appear in the table!
November 3, 2005
CIKM 2005
8
Upper/Lower bounds
 Trivial bounds

m

di , N
 Upper bound: min
i 1
naïve estimator)
 Lower bound: max d 1, d 2 ,..., d m 
(the
 Tighter bounds?
 In the case of two attributes, tighter bounds
are available.
November 3, 2005
CIKM 2005
9
Tighter bounds
A2
A1
N = 10
value freq
a 1
b 1
c 8
Naïve bounds: 3, 9
value freq
1
1
[2, 3]
d
e
f
4
4
2
Lower bound = 2+1+1 = 4
Upper bound = 3+1+1 = 5
November 3, 2005
CIKM 2005
10
Expected number of combinations
 Assumptions
1. The data distributions of individual columns are
independent
2. The occurrence of each combination in the table
is independent
 f  f1  f 2    f m
 Each element of f represents the
frequency of a specific value combination.
 An estimate of the probability of occurrence
November 3, 2005
CIKM 2005
11
Estimator
 The probability of the i-th combination
not appearing in a particular tuple is
(1  f i )
 The probability of the i-th combination
not appearing in the table (of size N) is
(1  f i ) N
 The expected number of value
combinations is
E[COLSCARD ]  M   (1  f i )
N
( M   j 1 d j )
m
i
November 3, 2005
CIKM 2005
12
Example revisited
 Estimate the COLSCARD for attribute set (A1, A2, A3),
given
f1  (0.1,0.3,0.6) f2  (0.01,0.99) f3  (0.05,0.95) N  100
Naïve estimate: 3*2*2 = 12
f  f1  f 2  f3  (0.00005,0.00095,0.00495,0.09405,
0.00015,0.00285,0.01485,0.28215,
0.00030,0.00570,0.02970,0.05643)
New estimate: 5.94
November 3, 2005
CIKM 2005
13
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
14
Estimation with histograms
 Histograms exist on individual
attributes
 Two classes of histograms
 Partition-based
 End-biased
 Marginals can be (approximately)
reconstructed from histograms
 Optimal histograms in each class?
November 3, 2005
CIKM 2005
15
Optimal histograms
 Minimizing the error incurred by
histograms
 ERR = |ESThist – ESTexact|
 Partition-based histograms
 A dynamic programming algorithm
similar to that for V-optimal histogram
construction [Jagadish et al. 98] can be
used.
November 3, 2005
CIKM 2005
16
Optimal end-biased histograms
 An end-biased histogram with B buckets
stores
 The exact frequencies of B-1 attribute values
 The average of the remaining values
 Which B-1 values to store exactly?
 Most widely used end-biased histograms
store the frequencies of the most frequent
values
 Not always optimal for COLSCARD estimation!!
November 3, 2005
CIKM 2005
17
Example
Attributes (A1, A2)
f1  (0.1,0.9)
N=10
case 1: f2  (0.01, 0.02, 0.03, 0.94)
case 2 : f 2'  (0.01, 0.29, 0.31, 0.39)
Choose 1 frequency to store exactly
Error
table
Index of the frequency stored
f2
f 2'
November 3, 2005
1
2
3
4
1.68
2.01
2.17
0.15
0.01
1.10
1.09
1.02
CIKM 2005
18
Optimal end-biased histograms
 Exhaustive search takes time proportional
to CdBj1
 We prove that the optimal choices can be
one of the following
 Most frequent values
 Least frequent values
 A combination of most frequent and least
frequent values
 Only need to search both ends
 Cost is linear in B, independent of dj!
November 3, 2005
CIKM 2005
19
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
20
Experiments – Data sets
 Synthetic data
 Skew: Zipfian parameter z=0 (uniform) to 4
(highly skewed)
 Number of tuples: 10K to 1M
 Real data
 Cover Type: 581,012 tuples, 10 attributes
 Census Income: 32,561 tuples, 14 attributes
 Error measure: ratio error
 ERR = max{true/est-1, est/true-1}
November 3, 2005
CIKM 2005
21
Effect of data skew
9
N=100K
8
7
6
ERR
5
4
3
2
1
0
z1 = 0,
z2=0
z1 = 0,
z2=2
z1 = 0,
z2=4
Proposed estimator 0.000237 0.000933 0.000982
Naive estimator
0.0516
6.5171
5.9423
z1 = 4,
z2=4
0.0654
8.4921
di=1k
November 3, 2005
CIKM 2005
22
Effect of number of tuples
0.14
z=0
z=2
z=4
0.12
ERR
0.1
0.08
0.06
0.04
0.02
0
1000
10000
100000
1000000
N
November 3, 2005
CIKM 2005
23
Results on real data
(b) Census Income
21
10
(a) Cover Type
2
5
3
19
4
59
31
ERR≤0.05
0.05<ERR≤0.1
0.1<ERR≤0.5
45 pairs
November 3, 2005
0.5<ERR≤1
ERR>1
91 pairs
CIKM 2005
24
Accuracy of end-biased histograms
0.3
0.25
ERR
0.2
0.15
0.1
0.05
0
10
20
30
50
Number of buckets
Results on the “capital-gain” attribute of Census Income data set
November 3, 2005
CIKM 2005
25
Conclusions
 Utilizing existing knowledge
maintained in database systems
 Proposed upper/lower bounds as well
as an estimator
 Considered two cases
 exact marginal frequencies
 Histograms: optimal histograms
 Experimental results show the
effectiveness of the proposed method
November 3, 2005
CIKM 2005
26