PowerPoint 簡報 - Asia University

Download Report

Transcript PowerPoint 簡報 - Asia University

17
Correlation
Chapter17 p399
Semimetric distance – Pearson correlation coefficient or Covariance
2
(
x

x
)
i1 i
n
Var( x)  s 2 
n 1
How about higher dimension data ?
- It is useful to have a similar measure to find out how much the
dimensions vary from the mean with respect to each other.
- Covariance is measured between 2 dimensions,
- suppose one have a 3-dimension data set (X,Y,Z), then one can calculate
Cov(X,Y), Cov(X,Z) and Cov(Y,Z)

Cov( X , Y ) 
n
i 1
( xi  x )( yi  y )
n 1
- to compare heterogenous pairs of variables, define the correlation
coefficient or Pearson correlation coefficient, -1≦ rXY ≦1
r XY
Cov( X , Y )

(var X )(varY )
-1  perfect anticorrelation
0  independent
+1 perfect correlation
Semimetric distance – the squared Pearson
correlation coefficient
• Pearson correlation coefficient is useful for examining correlations in the
data
• One may imagine an instance, for example, in which the same TF can cause
both enhancement and repression of expression.
• A better alternative is the squared Pearson correlation coefficient (pcc),
r sq  r
2
XY
[Cov( X , Y )]2

var(X ) var(Y )
The square pcc takes the values in the range 0 ≦ rsq ≦ 1.
0  uncorrelate vector
1  perfectly correlated or anti-correlated
pcc are measures of similarity
Similarity and distance have a reciprocal relationship
similarity↑  distance↓
 d = 1 – r is typically used as a measure of distance
Semimetric distance – Pearson correlation coefficient or Covariance
- The resulting rXY value will be larger than 0 if a and b tend to increase
together, below 0 if they tend to decrease together, and 0 if they are
independent.
Remark: rXY only test whether there is a linear dependence, Y=aX+b
- if two variables independent  low rXY,
- a low rXY may or may not  independent, it may be a non-linear relation
- a high rXY is a sufficient but not necessary condition for variable dependence
Semimetric distance – the squared Pearson correlation
coefficient
• To test for a non-linear relation among the data, one could make
a transformation by variables substitution
• Suppose one wants to test the relation u(v) = avn
• Take logarithm on both sides
• log u = log a + n log v
• Set Y = log u, b = log a, and X = log v
•  a linear relation, Y = b + nX
•  log u correlates (n>0) or anti-correlates (n<0) with log v
Semimetric distance – Pearson correlation
coefficient or Covariance matrix
A covariance matrix is merely collection of many covariances in the form
of a d x d matrix:
Spearman’s rank correlation (SRC)
• One of the problems with using the PCC is that it is susceptible to being skewed
by outliers: a single data point can result in two genes appearing to be correlated,
even when all the other data points suggest that they are not.
• Spearman’s rank correlation (SRC) is a non-parametric measure of correlation
that is robust to outliers.
• SRC is a measure that ignores the magnitude of the changes. The idea of the
rank correlation is to transform the original values into ranks, and then to
compute the correlation between the series of ranks.
• First we order the values of gene A and B in ascending order, and assign the
lowest value with rank 1. The SRC between A and B is defined as the PCC
between ranked A and B.
• In case of ties assign mid-ranks  both are ranked 5, then assign a rank of 5.5
Spearman’s rank correlation
The SRC can be calculated by the following formula, where xi and yi
denote the rank of the x and y respectively.
r SRC ( X , Y ) 

n
i 1
( xi  x )( yi  y )
[i 1 ( xi  x ) 2 ][ i 1 ( yi  y ) 2 ]
n
n
An approximate formula in case of ties is given by
6i 1 ( xi  yi ) 2
n
r SRC ( X , Y )  1 
n( n 2  1)
SRC vs. PCC
Time
Gene A ratio
Gene B ratio
0.5
-0.76359
-4.05957
1
1
2
2.276659
-1.7788
6
2
5
2.137332
-0.97433
5
4
7
1.900334
-1.44114
4
3
9
0.932457
-0.87574
3
5
11
0.761866
-0.52328
2
6
PCC(A, B) = 0.633
SRC(A,B) = -0.086
Gene A rank Gene B rank
Chapter17 p401
Chapter17 p408