Sampling and correlation
Download
Report
Transcript Sampling and correlation
Sampling
BASIC DEFINITIONS
Population
=
aggregate set of all case that conform some
set of specifications
Subpopulation
=
population stratum = stratum
Population
element
=
single member of population
Census
=
count of elements in population, determination
of characteristics of the population based on
info on all members
Sample
=
selection of some elements of population
Representative sampling plan carries the insurance that say, 90%
of the time (confidence level) the population estimates based on the
sample differ no more than 5% (margin of error) from the real value
Nonprobability sampling
• Accidental Samples
•select the first population elements you
encounter. danger: underrepresentation of
minorities, females, etc.
• Quota Sampling
•accidental sample while taking care to have
all strata represented in the sample as in the
population. danger: “own-friends”-bias
• Purposive Samples
•pick cases that are judged to be “typical” of
the target population. danger: judgement ...
Probability sampling
• Simple Random
Samples
• Stratified Random
Sample
• Cluster Sample
selection based on random numbers such that
each population element has equal and
independent probability of being sampled
Determine a proper sample size
t N1 N2 2
y1 y 2
1
1
s
N1 N2
2
p
s s
2
p
2
pooled
s s
2
1
2
2
2
Lambda () (Goodman & Kruskal, 1954)
To what extent does prediction of rows (columns) improve if the
column (row) is known in a 22 contingency table?
Divorce.
Marriage
%
Low
High
Low
18
5
23
High
6
19
25
24
Marriage rate predicted WITHOUT
knowledge of divorce rate :
High (25>23)
Error %
23
100 47.9%
48
Improvemen t
24
Marriage rate predicted WITH
knowledge of divorce rate :
High if Divorce% high
Low if Divorce% low
56
Error %
100 22.9%
48
47.9% 22.9%
0.522 predicting marriage rate
47.9%
Generalization:
correlation between continuous variables
: To what extent does prediction of rows (columns) improve if the
column (row) is known in a 22 contingency table?
Generalization
r² : To what extent does prediction of y improve if this prediction is based
on X substituted in the regression line y=ax+b than if this prediction
were made without knowing that line?
Sum of squares of
errors WITHOUT
knowing X
n
y
i 1
Sum of squared errors
WITH knowledge of X
n
2
ˆ
yi yi
y
2
i
i 1
n
Improvement:
r2
y
i 1
y
2
i
n
2
ˆ
y i y i
i 1
n
y
i 1
y
2
i
Correlation between one continuous
variable and one dichotomous variable
Autocratic
Teams
Democratic
Teams
8
10
10
12
7
9
11
11
12
13
mean=9.6
11.0
Number of puzzles
puzzles solved by
14
12
10
8
6
4
2
0
0
1
Leadership style
Regression line
yˆ 9.6 1.4x
Dummy variable
Autocratic = 0
Democratic = 1
Computation of r²
Team
xi
yi
1
2
3
4
5
6
7
8
9
10
Som
0
0
0
0
0
1
1
1
1
1
8
10
7
11
12
10
12
9
11
13
103
n
r2
ŷ i
y i yˆ i 2
y i y i 2
9.6
9.6
9.6
9.6
9.6
11.0
11.0
11.0
11.0
11.0
2.56
.16
6.76
1.96
5.76
1.00
1.00
4.00
0.00
4.00
27.20
5.29
.09
10.89
.49
2.89
.09
2.89
1.69
.49
7.29
32.10
n
2
ˆ
y
y
y
y
i
i i
2
i 1
i 1
n
y 10.3
2
y
y
i
32.10 27.20
.15
32.10
i 1
If we know the leadership style in a team, then we can predict their
productivity 15% better than without that knowledge
Higher r²
r² depends on the difference between the group
averages and the variance within the groups
Higher r²
Relation between r and t-test
H0 : there is no relation between a dichotomous variable x (group)
and a continuous variable y
H0 is not likely to be true if r² is high.
WHAT IS
HIGH?
Statistics if in the population no relation exists between x and y
then the samplingdistribution of
t N 2
r N 2 has a t-distribution
1 r 2
r N 2
1 r 2
Student t-distribution versus
Normal distribution
N(0,1)
t(100)
.4
t(5)
.3
.2
Type I
error
.1
-3
-2
-1
0
1
2
3
Voorbeeld – leadership style and
productivity
r² for leadersship style and productivity equals.15
(N=10 teams, r = .39)
t N 2
r N 2
1 r 2
t8
.39 8
1 .392
1.20
Concluson: the obtained result is NOT strong enough to
reject the nulhypothesis that states that leadership style
and productivity are unrelated …
WHICH IS NOT THE SAME AS “H0 IS TRUE” !!
Classical approach to t-test
t N1 N2 2
y1 y 2
1
1
s
N1 N2
2
p
Autocratic teams
Democratic teams
y 1 9 .6
s 2 4 .3
N1 5
y 2 11.0
s 2 2 .5
N2 5
2
sp2 spooled
s12 s22 2 3.4
t N1 N2 2
11.0 9.6
1.4
1.20
1.36
1 1
3.4
5 5
Attention!
Statistically significant
Theoretically relevant
t N 2
r N 2
2
1 r 2
r 2 N 2
4
2
1 r
1 r 2
N 4 2 2
r
You can always find a sample size N
for which you get a significant test
result
r=.04
r²=.0016
yields t = 2,19
N=3000