Sampling and correlation

Download Report

Transcript Sampling and correlation

Sampling
BASIC DEFINITIONS
Population
=
aggregate set of all case that conform some
set of specifications
Subpopulation
=
population stratum = stratum
Population
element
=
single member of population
Census
=
count of elements in population, determination
of characteristics of the population based on
info on all members
Sample
=
selection of some elements of population
Representative sampling plan carries the insurance that say, 90%
of the time (confidence level) the population estimates based on the
sample differ no more than 5% (margin of error) from the real value
Nonprobability sampling
• Accidental Samples
•select the first population elements you
encounter. danger: underrepresentation of
minorities, females, etc.
• Quota Sampling
•accidental sample while taking care to have
all strata represented in the sample as in the
population. danger: “own-friends”-bias
• Purposive Samples
•pick cases that are judged to be “typical” of
the target population. danger: judgement ...
Probability sampling
• Simple Random
Samples
• Stratified Random
Sample
• Cluster Sample
selection based on random numbers such that
each population element has equal and
independent probability of being sampled
Determine a proper sample size
t N1 N2 2 
y1  y 2 
 1
1 

s  
 N1 N2 
2
p
s s
2
p
2
pooled

 s s
2
1
2
2
2
Lambda () (Goodman & Kruskal, 1954)
To what extent does prediction of rows (columns) improve if the
column (row) is known in a 22 contingency table?
Divorce.
Marriage
%
Low
High
Low
18
5
23
High
6
19
25
24
Marriage rate predicted WITHOUT
knowledge of divorce rate :
High (25>23)
Error % 
23
 100  47.9%
48
Improvemen t 
24
Marriage rate predicted WITH
knowledge of divorce rate :
High if Divorce% high
Low if Divorce% low
56
Error % 
 100  22.9%
48
47.9%  22.9%
 0.522   predicting marriage rate
47.9%
Generalization:
correlation between continuous variables
 : To what extent does prediction of rows (columns) improve if the
column (row) is known in a 22 contingency table?
Generalization
r² : To what extent does prediction of y improve if this prediction is based
on X substituted in the regression line y=ax+b than if this prediction
were made without knowing that line?
Sum of squares of
errors WITHOUT
knowing X
n
 y
i 1
Sum of squared errors
WITH knowledge of X
n
2
ˆ


 yi  yi
y
2
i
i 1
n
Improvement:
r2 
 y
i 1
 y
2
i
n
2
ˆ
  y i  y i 
i 1
n
 y
i 1
 y
2
i
Correlation between one continuous
variable and one dichotomous variable
Autocratic
Teams
Democratic
Teams
8
10
10
12
7
9
11
11
12
13
mean=9.6
11.0
Number of puzzles
puzzles solved by
14
12
10
8
6
4
2
0
0
1
Leadership style
Regression line
yˆ  9.6  1.4x
Dummy variable
Autocratic = 0
Democratic = 1
Computation of r²
Team
xi
yi
1
2
3
4
5
6
7
8
9
10
Som
0
0
0
0
0
1
1
1
1
1
8
10
7
11
12
10
12
9
11
13
103
n
r2 
ŷ i
y i  yˆ i 2
y i  y i 2
9.6
9.6
9.6
9.6
9.6
11.0
11.0
11.0
11.0
11.0
2.56
.16
6.76
1.96
5.76
1.00
1.00
4.00
0.00
4.00
27.20
5.29
.09
10.89
.49
2.89
.09
2.89
1.69
.49
7.29
32.10
n
2
ˆ




y

y

y

y
 i
 i i
2
i 1
i 1
n
y  10.3
2


y

y
 i

32.10  27.20
 .15
32.10
i 1
If we know the leadership style in a team, then we can predict their
productivity 15% better than without that knowledge
Higher r²
r² depends on the difference between the group
averages and the variance within the groups
Higher r²
Relation between r and t-test
H0 : there is no relation between a dichotomous variable x (group)
and a continuous variable y
H0 is not likely to be true if r² is high.
WHAT IS
HIGH?
Statistics  if in the population no relation exists between x and y
then the samplingdistribution of
t N 2 
r N  2 has a t-distribution
1 r 2
r N 2
1 r 2
Student t-distribution versus
Normal distribution
N(0,1)
t(100)
.4
t(5)
.3
.2
Type I
error
.1
-3
-2
-1
0
1
2
3
Voorbeeld – leadership style and
productivity
r² for leadersship style and productivity equals.15
(N=10 teams, r = .39)
t N 2 
r N 2
1 r 2
 t8 
.39 8
1  .392
 1.20
Concluson: the obtained result is NOT strong enough to
reject the nulhypothesis that states that leadership style
and productivity are unrelated …
WHICH IS NOT THE SAME AS “H0 IS TRUE” !!
Classical approach to t-test
t N1 N2 2 
y1  y 2 
 1
1 

s  
 N1 N2 
2
p
Autocratic teams
Democratic teams
y 1  9 .6
s 2  4 .3
N1  5
y 2  11.0


s 2  2 .5
N2  5
2
sp2  spooled
 s12  s22 2  3.4
t N1 N2 2
11.0  9.6
1.4


 1.20
1.36
 1 1
3.4  
5 5
Attention!
Statistically significant

Theoretically relevant
t N 2 
r N 2
2
1 r 2
r 2 N  2

4
2
1 r
1 r 2
N  4 2 2
r
You can always find a sample size N
for which you get a significant test
result
r=.04
r²=.0016
yields t = 2,19
N=3000