Towards Estimating the Number of Distinct Values

Transcript Towards Estimating the Number of Distinct Values

Towards Estimating the Number of Distinct Value
Combinations for a Set of Attributes
Xiaohui Yu1, Calisto Zuzarte2, Ken Sevcik1
1University of Toronto
2IBM Toronto Lab
[email protected]
Distinct value combinations
1
2
3
Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
3 distinct value combinations
COLSCARD (COlumn Set CARDinality) = 3
The problem: estimating COLSCARD for a given set of attributes
November 3, 2005
CIKM 2005
2
Motivation
 Cardinality estimation for query
optimization, e.g.,
 Estimating the size of  ( country,city) Hotel
 Estimating the size of the aggregation
SELECT sales_date, sales_person,
SUM(sales_quantity) AS unit_sold
FROM sales
GROUP BY sales_date, sales_person
 Approximate query answering, e.g., COUNT
queries
November 3, 2005
CIKM 2005
3
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
4
Related work
 Previous work has focused on the
case of single attribute.
 [HÖT88],[HÖT89],[HNSS’95],[HS’98],[CCMN’00]
 Sampling approach is used.
 Estimation through sampling is difficult
[CCMN’00]
 No existing statistical information is
exploited.
November 3, 2005
CIKM 2005
5
Our solution
 Considering multiple-attributes
 Utilizing existing statistics on individual
attributes
 Readily available in most database systems
 Does not require access to the data
 Granularity of statistics
 Exact marginal frequency distributions
 Approximate distributions: histograms etc.
November 3, 2005
CIKM 2005
6
Estimation with known marginals
 Number of distinct values in attribute Ai,
di (i  1,2,..., m)
 frequency vector fi  ( f i1 , f i 2 ,..., f id ),
i

Country
City
Hotel Name
Germany
Bremen
Hilton
Germany
Bremen
Best Western
Germany
Frankfurt
InterCity
Canada
Toronto
Four Seasons
Canada
Toronto
Intercontinental
di
j 1

f ij  1
f1  (0.6,0.4) f 2  (0.4,0.2,0.4)f3  (0.2,0.2,0.2,0.2,0.2)
November 3, 2005
CIKM 2005
7
The naïve estimator
COLSCARD = min

m
i 1
Number of possible
value combinations
di: the number of distinct
values in attribute Ai
di , N

Sanity bound:
COLSCARD cannot be
greater than the table size
The problem: Some value combinations with low
occurrence probabilities may not appear in the table!
November 3, 2005
CIKM 2005
8
Upper/Lower bounds
 Trivial bounds

m

di , N
 Upper bound: min
i 1
naïve estimator)
 Lower bound: max d 1, d 2 ,..., d m 
(the
 Tighter bounds?
 In the case of two attributes, tighter bounds
are available.
November 3, 2005
CIKM 2005
9
Tighter bounds
A2
A1
N = 10
value freq
a 1
b 1
c 8
Naïve bounds: 3, 9
value freq
1
1
[2, 3]
d
e
f
4
4
2
Lower bound = 2+1+1 = 4
Upper bound = 3+1+1 = 5
November 3, 2005
CIKM 2005
10
Expected number of combinations
 Assumptions
1. The data distributions of individual columns are
independent
2. The occurrence of each combination in the table
is independent
 f  f1  f 2    f m
 Each element of f represents the
frequency of a specific value combination.
 An estimate of the probability of occurrence
November 3, 2005
CIKM 2005
11
Estimator
 The probability of the i-th combination
not appearing in a particular tuple is
(1  f i )
 The probability of the i-th combination
not appearing in the table (of size N) is
(1  f i ) N
 The expected number of value
combinations is
E[COLSCARD ]  M   (1  f i )
N
( M   j 1 d j )
m
i
November 3, 2005
CIKM 2005
12
Example revisited
 Estimate the COLSCARD for attribute set (A1, A2, A3),
given
f1  (0.1,0.3,0.6) f2  (0.01,0.99) f3  (0.05,0.95) N  100
Naïve estimate: 3*2*2 = 12
f  f1  f 2  f3  (0.00005,0.00095,0.00495,0.09405,
0.00015,0.00285,0.01485,0.28215,
0.00030,0.00570,0.02970,0.05643)
New estimate: 5.94
November 3, 2005
CIKM 2005
13
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
14
Estimation with histograms
 Histograms exist on individual
attributes
 Two classes of histograms
 Partition-based
 End-biased
 Marginals can be (approximately)
reconstructed from histograms
 Optimal histograms in each class?
November 3, 2005
CIKM 2005
15
Optimal histograms
 Minimizing the error incurred by
histograms
 ERR = |ESThist – ESTexact|
 Partition-based histograms
 A dynamic programming algorithm
similar to that for V-optimal histogram
construction [Jagadish et al. 98] can be
used.
November 3, 2005
CIKM 2005
16
Optimal end-biased histograms
 An end-biased histogram with B buckets
stores
 The exact frequencies of B-1 attribute values
 The average of the remaining values
 Which B-1 values to store exactly?
 Most widely used end-biased histograms
store the frequencies of the most frequent
values
 Not always optimal for COLSCARD estimation!!
November 3, 2005
CIKM 2005
17
Example
Attributes (A1, A2)
f1  (0.1,0.9)
N=10
case 1: f2  (0.01, 0.02, 0.03, 0.94)
case 2 : f 2'  (0.01, 0.29, 0.31, 0.39)
Choose 1 frequency to store exactly
Error
table
Index of the frequency stored
f2
f 2'
November 3, 2005
1
2
3
4
1.68
2.01
2.17
0.15
0.01
1.10
1.09
1.02
CIKM 2005
18
Optimal end-biased histograms
 Exhaustive search takes time proportional
to CdBj1
 We prove that the optimal choices can be
one of the following
 Most frequent values
 Least frequent values
 A combination of most frequent and least
frequent values
 Only need to search both ends
 Cost is linear in B, independent of dj!
November 3, 2005
CIKM 2005
19
Roadmap
 Related work
 Estimation with known marginal
distributions
 Upper/lower bounds
 An estimator
 Estimation with histograms
 Experimental results
 Conclusions
November 3, 2005
CIKM 2005
20
Experiments – Data sets
 Synthetic data
 Skew: Zipfian parameter z=0 (uniform) to 4
(highly skewed)
 Number of tuples: 10K to 1M
 Real data
 Cover Type: 581,012 tuples, 10 attributes
 Census Income: 32,561 tuples, 14 attributes
 Error measure: ratio error
 ERR = max{true/est-1, est/true-1}
November 3, 2005
CIKM 2005
21
Effect of data skew
9
N=100K
8
7
6
ERR
5
4
3
2
1
0
z1 = 0,
z2=0
z1 = 0,
z2=2
z1 = 0,
z2=4
Proposed estimator 0.000237 0.000933 0.000982
Naive estimator
0.0516
6.5171
5.9423
z1 = 4,
z2=4
0.0654
8.4921
di=1k
November 3, 2005
CIKM 2005
22
Effect of number of tuples
0.14
z=0
z=2
z=4
0.12
ERR
0.1
0.08
0.06
0.04
0.02
0
1000
10000
100000
1000000
N
November 3, 2005
CIKM 2005
23
Results on real data
(b) Census Income
21
10
(a) Cover Type
2
5
3
19
4
59
31
ERR≤0.05
0.05<ERR≤0.1
0.1<ERR≤0.5
45 pairs
November 3, 2005
0.5<ERR≤1
ERR>1
91 pairs
CIKM 2005
24
Accuracy of end-biased histograms
0.3
0.25
ERR
0.2
0.15
0.1
0.05
0
10
20
30
50
Number of buckets
Results on the “capital-gain” attribute of Census Income data set
November 3, 2005
CIKM 2005
25
Conclusions
 Utilizing existing knowledge
maintained in database systems
 Proposed upper/lower bounds as well
as an estimator
 Considered two cases
 exact marginal frequencies
 Histograms: optimal histograms
 Experimental results show the
effectiveness of the proposed method
November 3, 2005
CIKM 2005
26

Towards Estimating the Number of Distinct Values

Transcript Towards Estimating the Number of Distinct Values

Directory