Quantitative methods and R – (2)

Download Report

Transcript Quantitative methods and R – (2)

Quantitative methods and R – (2)
LING115
December 2, 2009
Two sample t-test
• Check if the means of two samples are
different
– Calculate the difference between the two means
– Normalize it by the standard error
• Which standard error to use becomes an issue since
there are two samples
Two sample t-test – (2)
• If the two sample variances are roughly the same, pool the
two sample variances and then estimate SE
– Weighted average of the two variances, each of which is weighted by
its degrees of freedom (n-1)
2
2
(
n

1
)

s

(
n

1
)

s
a
b
b
s 2p  a
(na  1)  (nb  1)
SE 
s 2p
(na  nb )
x1  x2
t
SE
df = na+ nb-2
Two sample t-test – (3)
• If the two sample variances are not the same, SE is estimated
by the following formula:
SE  (sa2 / na )  (sb2 / nb )
• Degrees of freedom is calculated differently depending on the
sample size
– If both samples consist of more than 30 data points, we can use the normal distribution
as the distribution of t-scores
– If not, estimate the degrees of freedom by the following formula:
U2
df  2
2
V /( na  1)  W 2 /( nb  1)
V  sa2 / na
W  sb2 / nb
U 2  V W
Two sample t-test in R
$ cd /home/ling115/r
$ R
Two sample t-test in R – (2)
Paired t-test
• Some data make more sense when paired
– F1 of a set of vowels from males and females
– Difference in frequency of the same set of words
between two corpora
– Can control for the variation due to the factor by
which observations are paired (e.g. what the vowel is)
• Calculate the difference in score for each pair
• Run one sample t-test to see if the mean
difference value is different from zero
Paired t-test in R
Parametric vs. Nonparametric
• Parametric test (of
statistical significance)
– Assumes normal distribution
– Data are measured in interval
scales
– Makes use of parameters
such as mean, variance
• Nonparametric test
– Does not assume normal
distribution
– Knowledge of parameters is
not necessary
– e.g. Wilcoxon test instead of
t-test
•
Shapiro-Wilk test for normality
ANOVA
• Data points are grouped by a factor with more
than two levels
– F1 of a set of vowels produced by speakers from five
different dialect groups
– Difference in frequency of the same set of words
among ten corpora
• The goal of Analysis of Variance is to check if the
differences among the means of different groups
is greater than the differences among the
observations in the data set generally
Variance among groups
• Calculate the mean for each group
• Calculate the overall mean of the data pooled from all
groups
• Calculate the squared deviation of each group mean
from the overall mean
• Multiply the squared deviation by the number of data
points in each group so that the amount reflects the
size of each group
• Add up the values (SS_group)
• Normalize SS_group by the degrees of freedom
– df = number of groups minus one
Variance within the entire data
• On first thought, this may be the sample
variance
• But this variance includes variance due to
group difference
• So we want to get the variance of the whole
with the variance due to group difference
removed
Variance within the entire data – (2)
• Sum of squares of error (SS_error)
– Method 1
• Calculate the sum of squared deviations over all data points
(SS_total)
• Subtract the sum of squared deviations due to group
difference (SS_group), i.e. SS_total – SS_group
– Method 2
• Calculate the sum of squared deviations within each group
• Add up the SS
• Normalize SS_error by the degrees of freedom
– df = number of data points minus the number of
groups
ANOVA and F-ratio
• If variance among groups is different from
variance within the entire data, we assume
the group means are different
• F = VAR_group / VAR_error
– F = 1 if the two variances are exactly the same
– The more F is farther away from 1, the less likely it
is that the two variances are the same
F-distribution
• Probability distribution of ratio of variances
– Note that each variance will have its own degrees
of freedom
• F=1 if two variances are the same
• The farther away F is from 1, the less likely it is
that the two variances are the same
• F-distribution is sensitive to whether the
population distribution is normal
F-distribution graph (from wikimedia)
Comparison of variance in R
• var.test(x,y)
• We can also calculate F-ratio by var(x)/var(y)
Two factors
• ANOVA we discussed so far assumed there is a
single factor which divides the data points into
multiple groups
• There may be more than one factor
– Number of adjectives in each sentence
– Length of each sentence
Interaction
• With two factors, assuming they are
meaningful, there are three ways the factors
can affect the value of an observation
– Factor 1
– Factor 2
– Interaction of factor 1 and factor 2
Repeated measures
• ANOVA discussed so far assumes the data
points are independent from each other
• As in the case of paired t-test, some data
make more sense when observations are
matched