Transcript ppt slides

CSE 6392 – Data Exploration
and Analysis in Relational
Databases
January 31, 2006
Example Problem
Suppose you had the following tables:
Employee
Employee-Sample
Gender
Salary
Gender
Salary
Possible Queries
•
Some possible queries to get the average
salary of all females in the company:
1. Select avg(salary) from Employee where gender =
“F”
2. Select avg(salary) from Employee-Sample where
gender = “F”
3. Select count(*) as C, sum(salary) as S, S/C from
Employee-Sample where gender = “F”
•
Is there a difference between 2 and 3 in terms
of results? No.
Estimator
• What is an estimator?
– Ex. count of a sample * (population/count)
– On the previous slide, 2 and 3 are estimators for 1.
• What is an unbiased estimator?
– Basically, an estimator that is not tilted towards the
lower or higher side of the estimation
• Formally:
^
– x is the estimator for some quantity x
^
^
– x is an unbiased estimator if E[ x ] = x.
Unbiased Estimators
• Example
– select count(*) as FC from Employee where
gender = “F”
– select count(*) * (N/n) as EFC from EmployeeSample with gender = “F”
• EFC is an unbiased estimator
• (N/n) is called the ‘ratio scale’
Unbiased Estimators (1)
• Example
– select sum(salary) as TFS from Employee
where gender = “F”
– select sum(salary)*(N/n) as ETFS from
Employee-Sample where gender = “F”
• ETFS is an unbiased estimator
• Note: This is important to statisticians, but
secondary for our purposes; we are more
concerned about the error
Unbiased Estimators (2)
• Example
– Select avg(salary) as AFS from Employee
where gender = “F”
– Select count(*) as C, sum(salary) as S,
EAFS=S/C from Employee-Sample where
gender = “F”
• Is EAFS unbiased? Not necessarily. The
use of 2 unbiased estimators does not
make it unbiased (ratio estimation).
Probability
Probability
• Example: roll a die.
How many times will
you get 1, 2, 3, 4, 5 or
6?
1/6
1
2
3
4
5
6
Number on die
Probability Density
• What is the probability that a random number
generator will generate .43 (of numbers between
0 and 1)?
– Answer: 0% (1/infinity)
• What about between .43 and .53?
– Answer: 10% (1/10)
• The probability density is the area under the
curve (integral) = 1.
• Any single number has a 0% probability, but an
interval has a chance.
Probability Density Function
Proper distribution if integral = 1


Probability Example
• How many female
employees (out of
50K employees)?
Probability that sample
process will give this
number
0
2
2
s 
n
Normal distribution
9K
10K
11K
actual
50K
Probability Sample
• If we sampled another company where the
actual number of females is 5K, the variance
would decrease:
Relative Error
• In Approximate Query Processing, people
use absolute error statistically, but relative
error practically.
relative error2 = (ETFC – TFC)2
TFC2
Central Limit Theorem
• The main point of this theorem is that it
does not matter how it was originally
distributed – the sample distribution will be
normal.
• Normal distribution:
( x )2
f ( x) 
e

2
2
2