Conceptual outline. Pearson linear correlation
Download
Report
Transcript Conceptual outline. Pearson linear correlation
What does researcher want of
statistics?
What does researcher want of
statistics?
“I had a fun and get it in
addition to my cool
microscope images!”
“I have done a statistical
analysis of my results and
now give me my PhD,
pleeeease!..”
1. How variable it is?
2. Does “my pet thing” work?
3. Why do the things differ?
4. Why does it fail from time to
time?
5. Why patients have different
fate and where is the hope
for them?
6. What would the outcome of
a perturbation?
Generally speaking, all the statistics is about finding
relations between variables
Basic concepts to understand
•
•
•
•
•
•
•
•
•
•
Variability
Variable
Relation
Signal vs. noise
Factor vs. response (outcome), independent vs.
dependent variables
Statistical test
Null hypothesis
Power
Experimental design
Distribution
Deterministic
vs.
stochastic
data
Two graph concepts:
Histograms:
show quantities of objects of particular qualities
as variable-height columns
2400
2200
2000
1800
1600
1400
1200
No of obs
1000
800
600
400
200
0
0
2000
4000
6000
8000
10000
Distance in chromosome, b.p.
12000
14000
Two graph concepts:
Scatterplots:
show objects arranged by 2 particular qualities
as coordinates
Scatterplot (Irisdat 5v*150c)
SEPALWID = 3.4189-0.0619*x
4.6
4.4
4.2
4.0
3.8
3.6
3.4
3.2
SEPALWID
3.0
2.8
2.6
2.4
2.2
2.0
1.8
4.0
4.5
5.0
5.5
6.0
6.5
SEPALLEN
7.0
7.5
8.0
8.5
Two graph concepts:
Histograms vs. scatterplots
Matrix Plot (Irisdat 5v*150c)
SEPALLEN
SEPALWID
PETALLEN
PETALWID
Normal distribution
28
26
24
22
20
18
16
14
No of obs
12
10
8
6
4
2
0
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
3.4
3.6
3.8
4.0
4.2
4.4
SEPALWID
––––––
+++–––
+-+–+–
……………
---+++
++++++
Not a normal distribution
2400
2200
2000
1800
1600
1400
1200
No of obs
1000
800
600
400
200
0
0
2000
4000
6000
8000
10000
Distance in chromosome, b.p.
12000
14000
• Variance:
Var = Sum(deviation from mean)2
• Standard deviation:
SD = Square root from Var
• Skewness:
deviation of the distribution from symmetry
• Kurtosis:
“peakedness” of the distribution
• Standard error:
e.g. SE = SD / square root from N
•
Kurtosis: positive
40
35
30
25
20
No of obs
15
10
5
0
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
3.2
SEPALWID
3.4
3.6
3.8
4.0
4.2
4.4
4.6
4.8
Kurtosis: negative
24
22
20
18
16
14
12
No of obs
10
8
6
4
2
0
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
SEPALWID
3.2
3.4
3.6
3.8
4.0
4.2
4.4
Skewness
26
24
22
20
18
16
14
12
No of obs
10
8
6
4
2
0
1.6
1.8
2.0
2.2
2.4
2.6
2.8
3.0
SEPALWID
3.2
3.4
3.6
3.8
4.0
4.2
4.4
Analysis of correlations
• Simple linear correlation (Pearson r):
r = Mean(CoVar) / (StDev(X) x StDev(Y))
CoVar = (Deviation Xi from mean X) x (Deviation
Yi from mean Y)
• How to interpret the values of correlations
– Positive: the higher X, the higher Y
– Negative: the higher X, the lower Y
– ~0: no relation
Confidence:
– |r| > 0.7: strong
– 0.25 < |r| < 0.7: medium
– |r| < 0.25: weak
• Outliers
• Correlations in non-homogeneous groups
• Nonlinear relations between variables
• Measuring nonlinear relations
• Spurious correlations
• Multiple comparisons and Bonferroni
correction
• Coefficient of determination: r2
• How to determine whether two correlation
coefficients are significant
• Other correlation coefficients
When it should not work?
40
20
0
5.0
4.5
4.0
3.5
3.0
ASSETS
2.5
2.0
1.5
1.0
-1
0
1
2
3
4
5
6
7
8 0
20
40
INCOME
•Graphs
•2D graphs
•Scatterplots w/Histograms
Exploratory examination of correlation matrices
Matrix Plot (Irisdat 5v*150c)
SEPALLEN
SEPALWID
PETALLEN
PETALWID
When it should not work?
40000
20000
0
80000
70000
60000
50000
40000
Var2
30000
20000
10000
0
-10000
-5000
0
5000
10000
NewVar
15000
20000
25000
30000 0
20000
40000
Normalize it!
20000
E.g. NewX = log(X)
10000
0
14
12
10
8
NewVar2
6
4
2
0
-2
0
2
4
6
NewVar1
8
10
12 0
10000
20000
Causality
There is no way to establish from a correlation
which variable affects which.
It is just about a relation.
• Casewise vs. pairwise deletion of missing
data
• How to identify biases caused by the bias
due to pairwise deletion of missing data
• Pairwise deletion of missing data vs. mean
substitution
Statsoft’s Statistica
• A perfect, almost universal tool for the
researchers in the range for “very beginner” to
”advanced professional”.
• An old software with intrinsic development
history
• Most of the methods can be found in >1 module
• Most of the modules contain >1 method
• No method is perfect
• No module is complete
• Most of the special modules are unavailable in
the basic “budget” license