PPT - Bioinformatics.ca

Download Report

Transcript PPT - Bioinformatics.ca

Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Module 6
Backgrounder in Statistical Methods
David Wishart
Schedule
Day 1
Day 2
8:30-10:30
Mod. 1: Introduction to
Metabolomics
Mod. 5: LC-MS Spectral
Processing using XCMS
10:30-11:00
Coffee
Coffee
11:00-12:30
Mod. 2: Metabolite ID and
Quantification Pt. I + Lab
Mod. 6: Backgrounder in
Statistical Methods
12:30-1:30
Lunch
Lunch
1:30-3:00
Mod. 3: Metabolite ID and
Annotation – Part II
Mod. 7: Metabolomic Data
Analysis w. MetaboAnalyst
3:00-3:30
Coffee
Coffee
3:30-5:00
Mod. 4: Databases for
Chemical/Spectral Data
Mod. 8: Data Integration &
Applications
5:00-6:30
Dinner
Survey & Close Out
6:30-9:00
Integrated Assignment
Learning Objectives
• Learn about distributions and
significance
• Learn about univariate statistics (ttests and ANOVA)
• Learn about correlation and
clustering
• Learn about multivariate statistics
(PCA and PLS-DA)
Statistics
• There are three kinds of lies: lies,
damned lies, and statistics - Benjamin
Disraeli
• 98% of all statistics are made up –
Unknown
• Statistics are like bikinis. What they
reveal is suggestive, but what they
conceal is vital - Aaron Levenstein
• Statistics is the mathematics of
impressions
Distributions &
Significance
Univariate Statistics
Univariate Statistics
• Univariate means a single variable
• If you measure a population using
some single measure such as height,
weight, test score, IQ, you are
measuring a single variable
• If you plot that single variable over
the whole population, measuring the
frequency that a given value is
reached you will get the following:
# of each
A Bell Curve
Height
Also called a Gaussian or Normal Distribution
Features of a Normal
Distribution
m = mean
• Symmetric Distribution
• Has an average or mean
value (m) at the centre
• Has a characteristic
width called the
standard deviation (s)
• Most common type of
distribution known
Normal Distribution
• Almost any set of biological or
physical measurements will display
some some variation and these will
almost always follow a Normal
distribution
• The larger the set of measurements,
the more “normal” the curve
• Minimum set of measurements to get
a normal distribution is 30-40
Gaussian Distribution
P( x) 
m-3s
m-2s
m-s
m
m+s
m+2s

1
e
2s
m+3s
( xm )2
2s 2
Some Equations
Mean
m = Sxi
N
Variance
s2 = S(xi - m)2
N
Standard Deviation s = S(xi - m)2
N
Standard Deviations (Z-values)
m ± 1.0 S.D. 0.683
> m + 1.0 S.D.
0.158
m ± 2.0 S.D. 0.954
> m + 2.0 S.D.
0.023
m ± 3.0 S.D. 0.9972
> m + 3.0 S.D.
0.0014
m ± 4.0 S.D. 0.99994
> m + 4.0 S.D.
0.00003
m ± 5.0 S.D. 0.999998
> m + 5.0 S.D.
0.000001
P( x) 
m-3s
m-2s
m-s
m
m+s
m+2s

1
e
2s
m+3s
( xm )2
2s 2
Significance
• Based on the Normal Distribution, the
probability that something is >1 SD away
(larger or smaller) from the mean is 32%
• Based on the Normal Distribution, the
probability that something is >2 SD away
(larger or smaller) from the mean is 5%
• Based on the Normal Distribution, the
probability that something is >3 SD away
(larger or smaller) from the mean is 0.3%
Significance
• In a test with a class of 400 students,
if you score the average you typically
receive a “C”
• In a test with a class of 400 students,
if you score 1 SD above the average
you typically receive a “B”
• In a test with a class of 400 students
if you score 2 SD above the average
you typically receive an “A”,
The P-value
• The p-value is the probability of
obtaining a test statistic (a score, a set
of events, a height) at least as extreme
as the one that was actually observed
• One "rejects the null hypothesis" when
the p-value is less than the significance
level α which is often 0.05 or 0.01
• When the null hypothesis is rejected,
the result is said to be statistically
significant
P-value
• If the average height of an adult (M+F)
human is 5’ 7” and the standard deviation
is 5”, what is the probability of finding
someone who is more than 6’ 10”?
• If you choose an a of 0.05 is a 6’ 11”
individual a member of the human
species?
• If you choose an a of 0.01 is a 6’ 11”
individual a member of the human
species?
P-value
• If you flip a coin 20 times and the
coin turns up heads 14/20 times the
probability that this would occur is
60,000/1,048,000 = 0.058
• If you choose an a of 0.05 is this
coin a fair coin?
• If you choose an a of 0.10 is this
coin a fair coin?
Mean, Median & Mode
Mode
Median
Mean
Mean, Median, Mode
• In a Normal Distribution the mean, mode
and median are all equal
• In skewed distributions they are unequal
• Mean - average value, affected by extreme
values in the distribution
• Median - the “middlemost” value, usually
half way between the mode and the mean
• Mode - most common value
Different Distributions
Unimodal
Bimodal
Other Distributions
• Binomial Distribution
• Poisson Distribution
• Extreme Value Distribution
• Skewed or Exponential
Distribution
Binomial Distribution
1
1 1
P(x) = (p +
q)n
1 2 1
1 3 3 1
1 4 6 4 1
1 5 10 10 5 1
Poisson Distribution
P( x) 
m =0.1
m =1
Proportion of samples
P(x)
m =2
m =3
m = 10
x
m xem
x!
Extreme Value Distribution
Gaussian Distribution
• Arises from
sampling the
extreme end of a
normal distribution
• A distribution
which is “skewed”
due to its selective
sampling
• Skew can be either
right or left
Skewed Distribution
Outliers
• Resembles an
exponential or
Poisson-like
distribution
• Lots of extreme
values far from mean
or mode
• Hard to do useful
statistical tests with
this type of
distribution
Fixing a Skewed
Distribution
• A skewed distribution or
exponentially decaying distribution
can be transformed into a “normal”
or Gaussian distribution by applying
a log transformation
• This brings the outliers a little closer
to the mean because it rescales the
x-variable, it also makes the
distribution much more Gaussian
Log Transformation
Skewed distribution
0
Normal distribution
exp’t B
exp’t B
linear
scale
0
log
transformed
0.5
0.4
0.3
0
0.2
0
0.1
0
0.0
0
8000
16000
24000
32000
V5
40000
48000
56000
64000
8.0
8.8
9.6
10.4
11.2
12.0
12.8
V4
13.6
14.4
15.2
16.0
Log Transformation on
Real Data
Distinguishing 2 Populations
Normals
Leprechauns
# of each
The Result
Height
Are they different?
What about these 2
Populations?
# of each
The Result
Height
Are they different?
Student’s t-Test
• Also called the t-Test
• Used to determine if 2 populations are different
• Formally allows you to calculate the probability
that 2 sample means are the same
• If the t-Test statistic gives you a p=0.4, and the
a is 0.05, then the 2 populations are the same
• If the t-Test statistic gives you a p=0.04, and the
a is 0.05, then the 2 populations are different
• Paired and unpaired t-Tests are available,
paired if used for “before & after” expts. while
unpaired is for 2 randomly chosen samples
Student’s t-Test
Variable 1
• A t-Test can also be used to determine
whether 2 clusters are different if the
clusters follow a normal distribution
Variable 2
What if the Distributions are
not Normal?
Mann-Whitney U-Test
• Also called the Wilcoxon Rank Sum Test
• Used to determine if 2 non-normally distributed
populations are different
• More powerful and robust than the t-test
• Formally allows you to calculate the probability
that 2 sample medians are the same
• If the U-Test statistic gives you a p=0.4, and the
a is 0.05, then the 2 populations are the same
• If the U-Test statistic gives you a p=0.04, and
the a is 0.05, then the 2 populations are
different
Distinguishing 3+ Populations
Normals
Leprechauns
Elves
# of each
The Result
Height
Are they different?
Distinguishing 3+ Populations
# of each
The Result
Height
Are they different?
ANOVA
• Also called Analysis of Variance
• Used to determine if 3 or more populations are
different, it is a generalization of the t-Test
• Formally ANOVA provides a statistical test (by
looking at group variance) of whether or not the
means of several groups are all equal
• Uses an F-measure to test for significance
• 1-way, 2-way, 3-way and n-way ANOVAs, most
common is 1-way which just is concerned
about whether any of the 3+ populations are
different, not which pair is different
ANOVA
Variable 1
• ANOVA can also be used to determine
whether 3+ clusters are different -- if the
clusters follow a normal distribution
Variable 2
Distinguishing N Populations
(False Discovery Rate)
• Suppose you performed 100 different t-tests,
and found 20 results with a p value of <0.05
• What are the odds that one of these findings
is going to be false?
• Roughly 20 X 0.05 = 1.00
• How many of these 20 tests are likely false
positives? 20x0.05 = 1
• To correct for this you try to choose those
results with a p value < 0.05/20 or p < 0.0025
Example (Some Weather
Predictions)
•
•
•
•
•
•
•
P = 0.08 It will rain
P = 0.05 It will be sunny
P = 0.06 It will be foggy
P = 0.02 It’ll be cloudy
P = 0.05 It will snow
P = 0.07 It will be windy
P = 0.06 It will be calm
•
•
•
•
•
•
•
P = 0.09 It will hail
P = 0.02 Lightning
P = 0.16 Thunder
P = 0.001 Eclipse
P = 0.09 Tornado
P = 0.18 Hurricane
P = 0.05 Sleet
100% certainty it will do something tomorrow
Only one prediction is significant with FDR or
Bonferroni correction (Eclipse)
Normalization/Scaling
Normalization/Scaling
# of each
• What if we measured the top population
using a ruler that was miscalibrated or
biased (inches were short by 10%)? We
would get the following result:
Height
Normalization
# of each
• Normalization adjusts for systematic
bias in the measurement tool
• After normalization we would get:
Height
Normalization
• Normalization also has other
meanings in statistics…
• When working with univariate and
multivariate statistics, normalization
also means making the distribution
look normal or Gaussian
• Key assumption in most statistical
modeling is that the population is
“normal” or Gaussian
Log Transformation for
Normalization
Skewed distribution
0
Normal distribution
exp’t B
exp’t B
linear
scale
0
log
transformed
0.5
0.4
0.3
0
0.2
0
0.1
0
0.0
0
8000
16000
24000
32000
V5
40000
48000
56000
64000
8.0
8.8
9.6
10.4
11.2
12.0
12.8
V4
13.6
14.4
15.2
16.0
Data Comparisons &
Dependencies
Data Comparisons
• In many kinds of experiments we want to know
what happened to a population “before” and
“after” some treatment or intervention
• In other situations we want to measure the
dependency of one variable against another
• In still others we want to assess how the observed
property matches the predicted property
• In all cases we will measure multiple samples or
work with a population of subjects
• The best way to view this kind of data is through a
scatter plot
A Scatter Plot
Scatter Plots
• If there is some dependency between
the two variables or if there is a
relationship between the predicted and
observer variable or if the “before” and
“after” treatments led to some effect,
then it is possible to see some clear
patterns to the scatter plot
• This pattern or relationship is called
correlation
Correlation
“+” correlation
Uncorrelated
“-” correlation
Correlation
High
correlation
Low
correlation
Perfect
correlation
Correlation Coefficient
r=
r = 0.85
S(xi - mx)(yi - my)
S(xi - mx)2(yi - my)2
r = 0.4
r = 1.0
Correlation Coefficient
• Sometimes called coefficient of linear
correlation or Pearson product-moment
correlation coefficient
• A quantitative way of determining what
model (or equation or type of line) best fits
a set of data
• Commonly used to assess most kinds of
predictions, simulations, comparisons or
dependencies
Correlation Coefficient vs.
Coefficient of Determination
• R (correlation coefficient) vs. R2
(coefficient of determination)
• R and R2 are very different
• Do not confuse R with R2
• Do not call R2 a correlation
coefficient – THIS IS WRONG
• Avoid using R2 in discussions or
comparisons in scientific papers
Significance of Correlation
r = 0.85
Is this significant?
r = 0.99
Is this significant?
Significance & Correlation
Add 2 more points to the plot
r = 0.99
r = 0.05
Tricks to Getting Good (but
meaningless) Correlation Coefficients
Use only data at extreme
ends of the curve or line
r = 0.95
Is this significant?
Use only a small number
of “good” data points
r = 0.95
Is this significant?
Student’s t-Test (Again)
• The t-Test can also be used to
assess the statistical significance
of a correlation
• It specifically determines whether
the slope of the regression line is
statistically different than 0
• As might be expected, more
points in a scatter plot lead to
more confidence in correlation
Correlation and Outliers
Experimental error or
something important?
A single “bad” point can destroy a good correlation
Outliers
• Can be both “good” and “bad”
• When modeling data -- you don’t like to
see outliers (suggests the model is bad)
• Often a good indicator of experimental or
measurement errors -- only you can know!
• When plotting metabolite concentration
data you do like to see outliers
• A good indicator of something significant
Height
Detecting Clusters
Weight
Height
Is it Right to Calculate a
Correlation Coefficient?
r = 0.73
Weight
Or is There More to This?
Height
male
female
Weight
Clustering Applications in
Bioinformatics
• Metabolomics and Cheminformatics
• Microarray or GeneChip Analysis
• 2D Gel or ProteinChip Analysis
• Protein Interaction Analysis
• Phylogenetic and Evolutionary Analysis
• Structural Classification of Proteins
• Protein Sequence Families
Clustering
• Definition - a process by which objects
that are logically similar in characteristics
are grouped together.
• Clustering is different than Classification
• In classification the objects are assigned
to pre-defined classes, in clustering the
classes are yet to be defined
• Clustering helps in classification
Clustering Requires...
• A method to measure similarity (a
similarity matrix) or dissimilarity (a
dissimilarity coefficient) between objects
• A threshold value with which to decide
whether an object belongs with a cluster
• A way of measuring the “distance”
between two clusters
• A cluster seed (an object to begin the
clustering process)
Clustering Algorithms
• K-means or Partitioning Methods - divides
a set of N objects into M clusters -- with or
without overlap
• Hierarchical Methods - produces a set of
nested clusters in which each pair of
objects is progressively nested into a
larger cluster until only one cluster remains
• Self-Organizing Feature Maps - produces a
cluster set through iterative “training”
K-means or Partitioning
Methods
• Make the first object the centroid for the
first cluster
• For the next object calculate the similarity
to each existing centroid
• If the similarity is greater than a threshold
add the object to the existing cluster and
redetermine the centroid, else use the
object to start new cluster
• Return to step 2 and repeat until done
K-means or Partitioning
Methods
Initial cluster
choose 1
centroid=
choose 2
centroid=
Rule: lT = lcentroid +
- 50 nm
test & join
Hierarchical Clustering
• Find the two closest objects and
merge them into a cluster
• Find and merge the next two closest
objects (or an object and a cluster, or
two clusters) using some similarity
measure and a predefined threshold
• If more than one cluster remains
return to step 2 until finished
Hierarchical Clustering
Initial cluster
pairwise
compare
select
closest
Rule: lT = lobs +
- 50 nm
select
next closest
Hierarchical Clustering
A
A
A
B
B
C
B
C
D
E
Find 2 most
similar metabolite
expression levels
or curves
Find the next
closest pair
of levels or
curves
F
Iterate
Heat map
Multivariate Statistics
Multivariate Statistics
• Multivariate means multiple variables
• If you measure a population using
multiple measures at the same time
such as height, weight, hair colour,
clothing colour, eye colour, etc. you
are performing multivariate statistics
• Multivariate statistics requires more
complex, multidimensional analyses
or dimensional reduction methods
A Typical Metabolomics
Experiment
A Metabolomics Experiment
• Metabolomics experiments typically
measure many metabolites at once,
in other words the instruments are
measuring multiple variables and so
metabolomic data are inherently
multivariate data
• Metabolomics requires multivariate
statistics
Multivariate Statistics – The
Trick
• The key trick in multivariate statistics
is to find a way that effectively
reduces the multivariate data into
univariate data
• Once done, then you can apply the
same univariate concepts such as pvalues, t-Tests and ANOVA tests to
the data
• The trick is dimensional reduction
Dimension Reduction & PCA
Scores plot
• PCA – Principal
Componenent Analysis
• Process that transforms
a number of possibly
correlated variables into
a smaller number of
uncorrelated variables
called principal
components
• Reduces 1000’s of
variables to 2-3 key
features
Principal Component Analysis
Hundreds of peaks
2 components
25
PAP
PC2
20
15
10
ANIT
5
ANIT
0
-5
Control
-10
Control
-15
PAP
-20
PC1
-25
-30
-20
-10
0
Scores plot
PCA captures what should be visually detectable
If you can’t see it, PCA probably won’t help
10
Visualizing PCA
• PCA of a “bagel”
• One projection produces
a weiner
• Another projection
produces an “O”
• The “O” projection
captures most of the
variation and has the
largest eigenvector (PC1)
• The weiner projection is
PC2 and gives depth info
PCA - The Details
t1 t2 ….. tm
……. xn
s1 s2 s3… samples. sk
x1 x2 x3, … variables
Scores = t
(eigen vectors
uncorrelated
orthogonal)
p1
p2
…..
• PCA involves the
calculation of the
eigenvalue (singular
value) decomposition of a
data covariance matrix
• PCA is an orthogonal
linear transformation
• PCA transforms data to a
new coordinate system
so that the greatest
variance of the data
comes to lie on the first
coordinate (1st PC), the
second greatest variance
on the 2nd PC etc.
pk
Loadings = p
scores = loadings x data
t1 = p1x1 + p2x2 + p3x3 + … + pnxn
Visualizing PCA
• Airport data from
USA
• 5000 “samples”
• X1 - latitude
• X2 - longitude
• X3 - altitude
• What should you
expect?
Data from Roy Goodacre (U of Manchester)
Visualizing PCA
PCA is equivalent to K-means clustering
K-means Clustering
Initial cluster
choose 1
centroid=
choose 2
centroid=
Rule: lT = lcentroid +
- 50 nm
test & join
PCA Clusters
• Once dimensional reduction has
been achieved you obtain clusters of
data that are mostly normally
distributed with means and variances
(in PCA space)
• It is possible to use t-Tests and
ANOVA tests to determine if these
clusters or their means are
significantly different or not
PCA and ANOVA
PC 1
• ANOVA can also be used to determine
whether 3+ clusters are different if the
clusters follow a normal distribution
PC 2
PCA Plot Nomenclature
• PCA Generate 2
kinds of plots, the
scores plot and the
loadings plot
• Scores plot (on
right) plots the
data using the
main principal
components
PCA Loadings Plot
• Loadings plot shows
how much each of the
variables
(metabolites)
contributed to the
different principal
components
• Variables at the
extreme corners
contribute most to the
scores plot separation
PCA Details/Advice
• In some cases PCA will not succeed in
identifying any clear clusters or obvious
groupings no matter how many
components are used. If this is the case,
it is wise to accept the result and assume
that the presumptive classes or groups
cannot be distinguished
• As a general rule, if a PCA analysis fails to
achieve even a modest separation of
classes, then it is probably not worthwhile
using other statistical techniques to try to
separate them
PCA vs. PLS-DA
• Partial Least Squares Discriminant Analysis
• PLS-DA is a supervised classification
technique while PCA is an unsupervised
clustering technique
• PLS-DA uses “labeled” data while PCA uses
no prior knowledge
• PLS-DA enhances the separation between
groups of observations by rotating PCA
components such that a maximum
separation among classes is obtained
PLS-DA Validation
• PLS-DA results are essentially
prediction models or class predictors
• These models need to be validated
and assessed to make sure they are
not over-trained or over-fitted
• There are several routes to
assessing the quality and robustness
of the model – R2/Q2 assessments
and permutation testing
Validating PLS-DA with Q2 & R2
• The performance of a PLS-DA model
can be quantitatively evaluated in
terms of an R2 and/or a Q2 value
• R2 is the correlation index and refers
to the goodness of fit or the
explained variation (range = 0-1)
• Q2 refers to the predicted variation or
quality of prediction (range = 0-1)
• Typically Q2 and R2 track very closely
together
PLS-DA R2
• R2 is a quantitative measure (with a
maximum value of 1) that indicates
how well the PLS-DA model is able to
mathematically reproduce the data in
the data set
• A poorly fit model will have an R2 of
0.2 or 0.3, while a well-fit model will
have an R2 of 0.7 or 0.8.
PLS-DA Q2
• To guard against over-fitting, the value
Q2 is commonly determined. Q2 is
usually estimated by cross validation
or permutation testing to assess the
predictive ability of the model relative
to the number of principal components
used in the model
• Generally a Q2 > 0.5 if considered good
while a Q2 of 0.9 is outstanding
Validating PLS-DA
(Permutation)
PCA
Labelled data
PLS-DA/SVM
Permuted data
PLS-DA/SVM
Separation score
Other Supervised
Classification Methods
• SIMCA – Soft Independent Modeling
of Class Analogy
• OPLS – Orthoganol Projection of
Latent Structures
• Support Vector Machines
• Random Forest
• Naïve Bayes Classifiers
• Neural Networks
Breaching the Data Barrier
Unsupervised Methods
PCA
K-means clustering
Factor Analysis
Supervised Methods
PLS-DA
LDA
PLS-Regression
Machine Learning
Neural Networks
Support Vector Machines
Bayesian Belief Net
Data Analysis Progression
• Unsupervised Methods
– PCA or cluster to see if natural clusters form or
if data separates well
– Data is “unlabeled” (no prior knowledge)
• Supervised Methods/Machine Learning
– Data is labeled (prior knowledge)
– Used to see if data can be classified
– Helps separate less obvious clusters or features
• Statistical Significance
– Supervised methods always generate clusters -this can be very misleading
– Check if clusters are real by label permutation
Note of Caution
• Supervised classification methods are powerful
– Learn from experience
– Generalize from previous examples
– Perform pattern recognition
• Too many people skip the PCA or clustering
steps and jump straight to supervised methods
• Some get great separation and think the job is
done - this is where the errors begin…
• Too many don’t assess significance using
permutation testing or n-fold cross validation
• If separation isn’t partially obvious by eye-balling
your data, you may be treading on thin ice