Empirical Methods in Computer Science
Download
Report
Transcript Empirical Methods in Computer Science
Statistical Methods in
Computer Science
Hypothesis Testing I:
Treatment experiment designs
Ido Dagan
2
Hypothesis Testing: Intro
When setting up experiments:
Goal: To assess falsifying hypotheses
E.g: treatment has no effect
Goal fails =>
falsifying hypothesis not true (unlikely) =>
our theory survives
Falsifying hypothesis is called null hypothesis, marked H0
We want to check whether the likelihood of H0 being true is low.
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
3
Comparison Hypothesis Testing
A very simple design: treatment experiment
Also known as a lesion study / ablation test
Two populations: control & treatment (finite or infinite)
Assuming they are identical, except for the independent variable
treatment
control
Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1
Ind0 & Ex1 & Ex2 & .... & Exn ==> Dep2
Treatment condition: Categorical independent variable
What are possible hypotheses?
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
Hypotheses for a
Treatment Experiment
4
H1: Treatment has effect
H0: Treatment has no effect
Any effect is due to chance
But how do we measure effect?
We know of different ways to characterize data:
Moments: Mean, median, mode, ....
Dispersion measures (variance, interquartile range, std. dev)
Shape (e.g., kurtosis)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
Hypotheses for a
Treatment Experiment
H1: Treatment has effect
H0: Treatment has no effect
Any effect is due to chance
Transformed into:
H1: Treatment changes mean of population
H0: Treatment does not change mean of population
Any effect is due to chance
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
5
Hypotheses for a
Treatment Experiment
H1: Treatment has effect
H0: Treatment has no effect
Any effect is due to chance
Transformed into:
H1: Treatment changes variance of population
H0: Treatment does not change variance of population
Any effect is due to chance
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
6
Hypotheses for a
Treatment Experiment
H1: Treatment has effect
H0: Treatment has no effect
Any effect is due to chance
Transformed into:
H1: Treatment changes shape of population
H0: Treatment does not change shape of population
Any effect is due to chance
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
7
8
Chance Results for Samples
The problem:
Suppose first we know the mean of control population and sample
the treatment population
We find
mean treatment results = 0.7
mean control = 0.5
How do we know there is a real difference?
Difference could be due to chance – because we measure the value from a
sample and not from the population
In treatment experiment: two populations, null hypothesis H0 states that their
means are equal
The key question:
What is the probability of getting 0.7 in a sample from treatment
population given H0 ?
If low, then we can reject H0
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
9
One sample testing: Basics
We begin with a simple case
We are given a known control population P
Now we sample the treatment population
For example: life expectancy for patients (w/o treatment)
Known parameters (e.g. known mean)
Mean = Mt
The question: Was the mean Mt drawn by chance from a population
which behaves the same (mean, variance, ...) as the control
population?
To answer this, must know:
What is the sampling distribution of the mean of P?
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
10
Sampling Distributions
Suppose given P we repeat the following:
Draw N sample points, calculate mean M1
Draw N sample points, calculate mean M2
.....
Draw N sample points, calculate mean Mn
The collection of means forms a distribution, too:
The sampling distribution of the mean
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
11
Central Limit Theorem
The sampling distribution of the mean of samples of size N,
of a population with mean M and std. dev. S:
1. Approaches a normal distribution as N increases,
for which:
2. Mean = M
S
3. Standard Deviation = N
This is called the standard error of the sample mean
Regardless of shape of underlying population
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
12
So? Why should we care?
We can now examine the likelihood of obtaining the
observed sample mean for the known population
If it is “too unlikely”, then we can reject the null hypothesis
e.g., if likelihood that the mean is due to chance is less than 5%.
The process:
We are given a control population C
A sample of the treatment population
Mean Mc and standard deviation Sc
sample size N, mean Mt and standard deviation St
If Mt is sufficiently different than Mc then we can reject the
null hypothesis
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
13
Z-test by example
We are given:
Control mean Mc = 1, std. dev. = 0.948
Treatment N=25, Mt = 2.8
We compute:
Standard error = 0.948/5 = 0.19
Z score of Mt = (2.8-population-mean-given-H0)/0.19
= (2.8-1)/0.19 = 9.47
Now we compute the percentile rank of 9.47
This sets the probability of receiving Mt of 2.8 or higher by chance
Under the assumption that the real mean is 1.
Notice: the z-score has standard normal distribution:
Sample mean is normally distributed, subtracted/divided by
constants to obtain Z (maintaining normality): Mean=0, stdev=1.
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
14
One- and two-tailed hypotheses
The Z-test computes the percentile rank of the sample mean
Using percentile table for standard normal distribution
Assumption: drawn from sampling distribution of control population
What kind of null hypotheses are rejected?
Determined by research question
in advance
Z=0
=P50
One-tailed hypothesis testing:
H0: Mt = Mc
H1: Mt > Mc
If we receive Z >= 1.645, reject H0
Mean is most likely higher than MC,
to explain Mt
Empirical Methods in Computer Science
95% of
Population
© 2006-now Gal Kaminka/Ido Dagan
Z=1.645
=P95
15
One- and two-tailed hypotheses
What kind of null hypotheses are rejected?
Two-tailed hypothesis testing:
H0: Mt = Mc
Z=-1.96
=P2.5
H1: Mt != Mc
If we receive Z >= 1.96, reject H0.
If we receive Z <= -1.96, reject H0.
Z=1.96
=P97.5
Z=0
=P50
95% of
Population
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
16
Testing Errors
The decision to reject the null hypothesis H0 may lead to errors
Type I error: Rejecting H0 though it is true (false positive)
Type II error: Failing to reject H0 though it is false (false negative)
Classification perspective of false/true-positive/negative
We are worried about the probability of these errors (upper bounds)
α = PrtypeIerror
β = PrtypeIIerro r
Normally, alpha is set to 0.05 or 0.01.
This is our rejection criteria for H0 (usually the focus of significance tests)
1-beta is the power of the test (its sensitivity)
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
Two designs for treatment
experiments
One-sample: Compare sample to a known population
17
e.g., compare to specification, known history
Two-sample: Compare two samples, establish whether they
are produced from the same underlying distribution
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
18
Two-sample Z-test
Up until now, assumed we know control population mean
But what about cases where this is unknown?
This is called a two-sample case:
We have two samples of populations
Treatment & control
For now, assume we know std of both populations
We want to compare estimated (sample) means
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
Two-sample Z-test
(assume std known)
Compare the differences of two population means
When samples are independent (e.g. two patient groups)
H0: M1-M2 = d0
H1: M1-M2 != d0 (this is the two-tailed version)
M M 2 d 0
z= 1
σ12 σ 22
+
n1 n2
var(X-Y) = var(X) + var(Y) for independent variables
When we test for equality, d0 = 0
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
19
Mean comparison when std
unknown
Up until now, assumed we have population std.
But what about cases where std is unknown?
=> Have to be approximated
When N sufficiently large (e.g., N>30)
When population std unknown: Use sample std
Population std is:
Sample std is:
Empirical Methods in Computer Science
SS
σ= X
N
2
Xi
X
=
N
2
Xi
X
SS X
SX =
=
N 1
N 1
© 2006-now Gal Kaminka/Ido Dagan
20
21
The Student's t-test
Z-test works well with relatively large N
But is less accurate when population std unknown
e.g., N>30 for central limit theorem
Std is not a constant anymore
In this case, and small N: t-test is used
t-distribution approaches normal for larger N (~60-120):
t =0
=P50
t-test:
Performed like z-test with sample std
Compared against t-distribution
Assumes sample mean is normally distributed
t-score doesn’t distribute normally
(denominator is variable)
thicker tails
Which it is, based on the central limit theorem, though the t-score (based on
sample std) is not normally distributed
Requires use of size of sample
N-1 degrees of freedom, a different distribution for each degree
Std decreases as df increases, approaches normal
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
22
t-test variations
Available in excel or statistical software packages
Two-sample and one-sample t-test
Two-tailed, one-tailed t-test
t-test assuming equal and unequal variances
Paired t-test
Same inputs (e.g. before/after treatment), not independent
The t-test is common for testing hypotheses about means
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
23
Testing variance hypotheses
F-test: compares variances of populations
Z-test, t-test: compare means of populations
Testing procedure is similar
H0: σ12 = σ 22
H1: σ12 σ 22
OR σ12 > σ 22
Now calculate
f=
s 21
s
2
2
OR σ12 < σ 22
, where sx is the sample std of X
When far from 1, the variances likely different
To determine likelihood (how far), compare to F distribution
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
24
The F distribution
F is based on the ratio of population and sample variances
S12 / σ12
F= 2 2
S2 / σ 2
According to H0, the two standard deviations are equal
F-distribution
Two parameters: numerator and denominator degrees-of-freedom
Degrees-of-freedom (here): N-1 of sample
Assumes both variables are normal
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
25
Other tests for two-sample testing
There exist multiple other tests for two-sample testing
Each with its own assumptions and associated power
For instance, Kolmogorov-Smirnov (KS) test
Non-parametric estimate of the difference between two distributions
Turn to your friendly statistics book for help
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
26
Testing correlation hypotheses
We now examine the significance of r
To do this, we have to examine the sampling distribution of r
The distribution of r values we get from different samples
The sampling distribution of r is not easy to work with (how
does it look?)
Fisher's r-to-z transform:
1+ r
z r = 0.5ln
1 r
Approximately normal sampling distribution (N>10)
• Mean = z(ρ) (of population)
• standard error (independent of r):
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
27
Testing correlation hypotheses
We now plug these values and do a Z-test
For example:
Let the r correlation coefficient for variables x,y = 0.14
Suppose n = 30
H0: ρ = 0
H1: ρ != 0
1 + 0.14
z0 = z 0.14 = 0.5ln
= 0.141
1 0.14
Cannot reject H0
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
Treatment Experiments
(single-factor experiments)
Allow comparison of multiple treatment conditions
treatment1
treatment2
control
Ind1 & Ex1 & Ex2 & .... & Exn ==> Dep1
Ind2 & Ex1 & Ex2 & .... & Exn ==> Dep2
Ex1 & Ex2 & .... & Exn ==> Dep3
Compare performance of algorithm A to B to C ....
Control condition: Optional (e.g., to establish baseline)
Cannot use the tests we learned: Why?
Empirical Methods in Computer Science
© 2006-now Gal Kaminka/Ido Dagan
28