Transcript Methods

10:30-12:00 December 10 2012
For Survey of Quantitative Research, NORSI
CENTRE FOR INNOVATION, RESEARCH AND COMPETENCE IN THE LEARNING ECONOMY
Session 2:
Basic techniques for innovation data analysis.
Part I: Statistical inferences and comparisons of groups
Taehyun Jung
[email protected]
CIRCLE, Lund University
Objectives of this session
CIRCLE, Lund University, Sweden
2
Contents
Correlation
Statistical Inference and Hypothesis Testing
t-Test
Confidence Interval
Chi-square Statistic
CIRCLE, Lund University, Sweden
3
Correlation
CIRCLE, Lund University, Sweden
4
Scatterplot
A scatterplot displays the strength, direction, and form of the
relationship between two quantitative variables.
– As in any graph of data, look for the overall pattern and for striking
departures from that pattern.
 Form: linear, curved, clusters, no pattern
 Direction: positive, negative, no direction
 Strength: how closely the points fit the “form”
– An important kind of departure is an outlier, an individual value that falls
outside the overall pattern of the relationship.
CIRCLE, Lund University, Sweden
5
Linear
No relationship
Nonlinear
CIRCLE, Lund University, Sweden
6
The strength of the relationship between the two variables can be seen by
how much variation, or scatter, there is around the main form.
With a strong relationship, you
can get a pretty good estimate
of y if you know x.
CIRCLE, Lund University, Sweden
With a weak relationship, for
any x you might get a wide
range of y values.
7
correlation
The sample Pearson’s correlation coefficient r measures the strength of the
linear relationship between two quantitative variables.
 The correlation coefficient r measures the strength of the linear relationship
between two quantitative variables.
æ xi - x öæ yi - y ö
1
÷÷
r=
ç
÷çç
å
n -1 è sx øè sy ø
– r is always a number between -1 and 1.
– r > 0 indicates a positive association.
– r < 0 indicates a negative association.
– Values of r near 0 indicate a very weak linear relationship.
– The strength of the linear relationship increases as r moves away from 0 toward -1
or 1.
– The extreme values r = -1 and r = 1 occur only in the case of a perfect linear
relationship.
– Part of the calculation involves finding z, the standardized score
 Allows us to compare correlations between data sets where variables are measured in
different units or when variables are different
CIRCLE, Lund University, Sweden
8
Facts About Correlation
 Correlation makes no distinction between explanatory and response variables.
 r has no units and does not change when we change the units of measurement
of x, y, or both.
 Positive r indicates positive association between the variables, and negative r
indicates negative association.
 The correlation r is always a number between -1 and 1.
 Cautions
– Correlation requires that both variables be quantitative.
– Correlation does not describe curved relationships between variables, no matter
how strong the relationship is.
– Correlation is not resistant. r is strongly affected by a few outlying observations.
– Correlation is not a complete summary of two-variable data.
CIRCLE, Lund University, Sweden
9
“r” ranges from −1 to +1
Strength: How closely the
points follow a straight line.
Direction is positive when
individuals with higher x values
tend to have higher values of y
CIRCLE, Lund University, Sweden
10
Influential points
Correlations are calculated using means and
standard deviations and thus are NOT
resistant to outliers.
Just moving one point away from the
general trend here decreases the correlation
from −0.91 to −0.75.
CIRCLE, Lund University, Sweden
11
Statistical Inference and Hypothesis
Testing
CIRCLE, Lund University, Sweden
12
Normal distribution
The normal distribution has the bell-shaped (Gaussian) form.
– arises from the central limit theorem, which states that under mild
conditions, the mean of a large number of random variables independently
drawn from the same distribution is distributed approximately normally,
irrespective of the form of the original distribution
– very tractable analytically f (X )
f X  
1
e
 2
1 X   
 

2  
2
X ~ N  ,  2 
0
0
CIRCLE, Lund University, Sweden
 4
13
 3
 2
 

 +
 +2
 +3
 +4
X
Testing a hypothesis relating to the population mean
Assumption:
– Null hypothesis :
– Alternative hypothesis:
X~(𝜇, 𝜎 2 )
𝐻0 : 𝜇 = 𝜇0
𝐻1 : 𝜇 ≠ 𝜇0
– We will suppose that we have observations on a random variable with a
normal distribution with unknown mean m and that we wish to test the
hypothesis that the mean is equal to some specific value 𝜇0 .
CIRCLE, Lund University, Sweden
14
 Suppose that we have a sample of
data for the example model and the
sample mean 𝑋 is 𝜇0 − 𝑠𝑑. Would
this be evidence against the null
hypothesis 𝜇 = 𝜇0 ?
– No, it is not. It is lower than 𝜇0 , but
we would not expect to be exactly
equal to 𝜇0 because the sample
mean has a random component.
– If the null hypothesis is true, the
probability of the sample mean being
one standard deviation or more
above or below the population mean
is 31.7%.
CIRCLE, Lund University, Sweden
15
 four standard deviations above the
hypothetical mean?
– the chance of getting such an extreme
estimate is only 0.006%.
– We would reject the null hypothesis
 The usual procedure for making
decisions is to reject the null hypothesis
if it implies that the probability of
getting such an extreme sample mean
is less than some (small) probability p.
– For example, the probability of getting such
an extreme sample mean is less than 0.05
(5%)
– The 2.5% tails of a normal distribution
always begin 1.96 standard deviations from
its mean
CIRCLE, Lund University, Sweden
16
Decision rule (5% significance level): Reject 𝐻0 : 𝜇 = 𝜇0
– (1) if 𝑋 > 𝜇0 + 1.96 s.d.
– (1) if 𝑧 =
𝑋−𝜇0
𝑠.𝑑.
>1.96
or (2) if 𝑋 < 𝜇0 – 1.96 s.d.
or (2) if 𝑧 =
𝑋−𝜇0
𝑠.𝑑.
< 1.96
• Type I error: rejection of H0
when it is in fact true.
• Probability of Type I error: in
this case, 5%
• Significance level (size) of the
test is 5%.
CIRCLE, Lund University, Sweden
17
We can of course reduce the risk of making a Type I error by
reducing the size of the rejection region.
CIRCLE, Lund University, Sweden
18
t-Test
CIRCLE, Lund University, Sweden
19
What if we do not know the standard deviation? The test
statistic has a t distribution instead of a normal distribution
s.d. of X known
s.d. of X not known
discrepancy between hypothetical value
and sample estimate, in terms of s.d.:
z
discrepancy between hypothetical value
and sample estimate, in terms of
standard error (s.e.):
X  0
s.d.
t
X  0
s.e.
5% significance test:
5% significance test:
reject H0:  = 0 if
z > 1.96 or z < –1.96
reject H0:  = 0 if
t > tcrit or t < –tcrit
CIRCLE, Lund University, Sweden
20
For a sample of size n, the sample standard deviation s is:
1
2
s
(
x

x
)
 i
n 1
– n − 1 is the “degrees of freedom.”
– The value s/√n is called the standard error of the mean SEM.
– Scientists often present their sample results as the mean ± SEM.
CIRCLE, Lund University, Sweden
21
 When the number of degrees of freedom
is large, the t distribution looks very much
like a normal distribution (and as the
number increases, it converges on one)
 Then, why t-dist?
– Although the distributions are generally
quite similar, the t distribution has longer
tails than the normal distribution, the
difference being the greater, the smaller
the number of degrees of freedom
– the rejection regions have to start more
standard deviations away from zero for a t
distribution than for a normal distribution
CIRCLE, Lund University, Sweden
22
Example
 A certain city abolishes its local sales tax on consumer expenditure. A survey of
20 households shows that, in the following month, mean household
expenditure increased by $160 and the standard error of the increase was $60.
 We wish to determine whether the abolition of the tax had a significant effect
on household expenditure.
– We take as our null hypothesis that there was no effect: 𝐻0 : 𝜇 = 0
– The test statistic is
160  0
t
 2.67
60
– The critical values of t with 19 degrees of freedom are 2.09 at the 5 percent
significance level and 2.86 at the 1 percent level.
– Hence we reject the null hypothesis of no effect at the 5 percent level but not at
the 1 percent level.
CIRCLE, Lund University, Sweden
23
Robustness
The t tests are exactly correct when the population is distributed
exactly normally. However, most real data are not exactly normal.
The t tests are robust to small deviations from normality. This
means that the results will not be affected too much. Factors that
do strongly matter are:
– Random sampling. The sample must be an SRS from the population.
– Outliers and skewness. They strongly influence the mean and therefore the
t procedures. However, their impact diminishes as the sample size gets
larger because of the Central Limit Theorem.
– Specifically:
 When n < 15, the data must be close to normal and without outliers.
 When 15 > n > 40, mild skewness is acceptable, but not outliers.
 When n > 40, the t statistic will be valid even with strong skewness.
CIRCLE, Lund University, Sweden
24
Confidence interval
CIRCLE, Lund University, Sweden
25
Confidence interval
 Any hypothesis lying in the interval from  min to  max would be compatible with
the sample estimate (not be rejected by it). We call this interval the 95%
confidence interval.
(2)
min–1.96sd min–sd min
CIRCLE, Lund University, Sweden
(1)
min +sd
26
X
max–sd
max max+sd max+1.96sd
 Standard deviation known
95% confidence interval
𝑋 – 1.96 sd <  < 𝑋 + 1.96 sd
99% confidence interval
𝑋 – 2.58 sd <  < 𝑋 + 2.58 sd
 Standard deviation estimated by standard error
95% confidence interval
𝑋 – tcrit (5%) se <  < 𝑋 + tcrit (5%) se
99% confidence interval
𝑋 – tcrit (1%) se <  < 𝑋 + tcrit (1%) se
CIRCLE, Lund University, Sweden
27
Chi-square statistic
CIRCLE, Lund University, Sweden
28
Can we conclude that large firms use patents more
strategically than small firms based on this table?
Use of patents by firm size
Non-strategic
use
Strategic use
Small Firm
Large Firm
Column total
#
160
1,113
1,273
% column
91.95
81.66
82.82
#
14
250
264
% column
8.05
18.34
17.18
#
174
1,363
1,537
% column
100.00
100.00
100.00
Total
STATA command
– . tab dused_ndef largef, col chi
– Pearson chi2(1) = 11.4978 Pr = 0.001
CIRCLE, Lund University, Sweden
29
Chi-square hypothesis test
The chi-square statistic (𝜒 2∗ ) measures how far the sample is from
what we “expect” to see in a random sample from a population
with NO relationship.
– If 𝜒 2∗ is too far from what we expected, we conclude that the sample did
not come from a population with no relationship and therefore conclude
that the variables must be related in the population.
– H0: There is no relationship between categorical variable A and categorical
variable B.
– Ha: There is some relationship between categorical variable A and
categorical variable B.
This alternative hypothesis is not really one-sided (> or <) or twosided (). It can be called “many-sided” because it allows any kind
of relationship between variables A and B to count.
CIRCLE, Lund University, Sweden
30
 We want to test the hypothesis that there is no relationship between these two
categorical variables (H0).
– To test this hypothesis, we compare actual counts from the sample data with
expected counts given the null hypothesis of no relationship.
– The expected count in any cell of a two-way table when H0 is true is:
 The chi-square statistic (c2) is a measure of how much the observed cell counts
in a two-way table diverge from the expected cell counts.
c2  
observedcount - expectedcount 2
CIRCLE, Lund University, Sweden
expectedcount
31
Large values for 𝜒 2∗ represent strong deviations from the
expected distribution under the H0, and provide evidence against
H0.
However, since 𝜒 2∗ is a sum, how large a 𝜒 2∗ is required for
statistical significance will depend on the number of comparisons
made.
CIRCLE, Lund University, Sweden
32
For the chi-square test, H0 states that there is no association
between the row and column variables in a two-way table. The
alternative is that these variables are related.
If H0 is true, the chi-square test has approximately a χ2 distribution
with (r − 1)(c − 1) degrees of freedom.
The P-value for the
chi-square test is the
area to the right of c2 :
P(χ2 ≥ X2).
CIRCLE, Lund University, Sweden
33
Significance Level (alpha)
 Probability of rejecting the null hypothesis if H0 is true
– Typically, .05 or .01 significance level
– With a significance level of .05 and 1 df, X2=3.84; we will reject H0 when X2* is
greater than 3.84 and accept H0 when X2* is less than 3.84.
– if the null hypothesis is true (if the variables are not related in the population), we
will still (incorrectly) reject H0 (conclude that the variables are related in the
population) about 5 times (or 1 time) in 100 hypothesis tests
 A key step in the hypothesis test is deciding how willing we are to make a Type I
error. (We must take some chance of rejecting a true null hypothesis or we will
have no chance of rejecting a false one.)
– Type I error: incorrectly rejecting the null hypothesis.
– Type II error: Incorrectly accepting the null hypothesis.
CIRCLE, Lund University, Sweden
34