Transcript Slide 1
Canadian Bioinformatics
Workshops
www.bioinformatics.ca
Module #: Title of Module
2
Lecture 3
Univariate Analyses: Discrete Data
MBP1010
†
Dr. Paul C. Boutros
Winter 2014
DEPARTMENT OF
MEDICAL BIOPHYSICS
†
Aegeus, King of Athens, consulting the Delphic Oracle. High Classical (~430 BCE)
This workshop includes material
originally developed by Drs. Raphael Gottardo,
Sohrab Shah, Boris Steipe and others
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
How Will You Be Graded?
• 9% Participation: 1% per week
• 56% Assignments: 8 x 7% each
• 35% Final Examination: in-class
• Each individual will get their own, unique assignment
• Assignments will all be in R, and will be graded according
to computational correctness only (i.e. does your R script
yield the correct result when run)
• Final Exam will include multiple-choice and written
answers
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Course Information Updates
• Website will have up to date information, lecture notes,
sample source-code from class, etc.
• http://medbio.utoronto.ca/students/courses/mbp1010/mbp_10
10.html
• Tutorials are Thursdays 13:00-15:00 in 4-204 TMDT
• New TA (focusing on bioinformatics component) will be
Irakli (Erik) Dzneladze
• Assignment #1 is released today, due on January 30
• Assignment #2 will be released on January 31, due Feb 7
• Updated course-schedule on website
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
House Rules
• Cell phones to silent
• No side conversations
• Hands up for questions
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Review From Lecture #1
Population vs. Sample
All MBP Students = Population
MBP Students in 1010 = Sample
How do you report statistical information?
P-value, variance, effect-size, sample-size, test
Why don’t we use Excel/spreadsheets?
Input errors, reproducibility, wrong results
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Review From Lecture #2
Define discrete data
No gaps on the number-line
What is the central limit theorem?
A random variable that is the sum of many
small random variables is normally distributed
Theoretical vs. empirical quantiles
Probability vs. percentage of values less than p
Components of a boxplot?
25% - 1.5 IQR, 25%, 50%, 75%, 75% + 1.5 IQR
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Boxplot
Descriptive
statistics can be
intuitively
summarized in a
Boxplot.
1.5 x IQR
75% quantile
IQR
Median
25% quantile
> boxplot(x)
1.5 x IQR
Everything above and below 1.5 x
IQR is considered an "outlier".
IQR = Inter Quantile Range = 75% quantile – 25% quantile
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Review From Lecture #2
How can you interpret a QQ plot?
Compares two samples or a sample and a
distribution. Straight line indicates identity.
What is hypothesis testing?
Confirmatory data-analysis; test null hypothesis
What is a p-value?
Evidence against null; probability of FP,
probability of seeing as extreme a value by
chance alone
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Review From Lecture #2
Parametric vs. non-parametric tests
Parametric tests have distributional assumptions
What is the t-statistics?
Signal:Noise ratio
Assumptions of the t-test?
Data sampled from normal distribution;
independence of replicates; independence of
groups; homoscedasticity
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Flow-Chart For Two-Sample Tests
Is Data Sampled From a
Normally-Distributed Population?
Yes
No
Equal Variance
(F-Test)?
Yes
Homoscedastic
T-Test
Yes
Sufficient n for
CLT (>30)?
No
Heteroscedastic
T-Test
Lecture 3: Univariate Analyses II: Discrete Data
No
Wilcoxon
U-Test
bioinformatics.ca
Topics For This Week
• Correlations
• ceRNAs
• Attendance
• Common discrete univariate analyses
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Power, error rates and decision
Power calculation in R:
> power.t.test(n = 5, delta = 1, sd=2,
alternative="two.sided", type="one.sample")
One-sample t test power calculation
n=5
delta = 1
sd = 2
sig.level = 0.05
power = 0.1384528
alternative = two.sided
Other tests are available – see ??power.
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Power, error rates and decision
PR(False Negative)
PR(Type II error)
μ0 μ 1
PR(False Positive)
PR(Type I error)
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Problem
When we measure more one
than one variable for each
member of a population, a
scatter plot may show us that
the values are not completely
independent: there is e.g. a
trend for one variable to
increase as the other
increases.
Regression analyses assess
the dependence.
Examples:
• Height vs. weight
• Gene dosage vs.
expression level
• Survival analysis:
probability of death vs. age
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Correlation
When one variable depends on
the other, the variables are to
some degree correlated.
(Note: correlation need not
imply causality.)
In R, the function cov()
measures covariance and cor()
measures the Pearson
coefficient of correlation (a
normalized measure of
covariance).
Pearson's coeffecient of
correlation values range
from -1 to 1, with 0 indicating
no correlation.
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
How to interpret the correlation
coefficient:
Explore varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.99;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9999666
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.8;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.9661111
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.4;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.6652423
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Varying degrees of randomness ...
> x<-rnorm(50)
> r <- 0.01;
> y <- (r * x) + ((1-r) * rnorm(50));
> plot(x,y); cor(x,y)
[1] 0.01232522
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # periodic ...
> y <- (r * cos(x*pi)) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] 0.3438495
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # polynomial ...
> y <- (r * x*x) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] -0.5024503
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # exponential
> y <- (r * exp(5*x)) + ((1-r) * rnorm(50))
> plot(x,y); cor(x,y)
[1] 0.6334732
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Pearson's Coefficient of Correlation
Non-linear relationships ...
> x<-runif(50,-1,1)
> r <- 0.9
> # circular ...
> a <- (r * cos(x*pi)) + ((1-r) * rnorm(50))
> b <- (r * sin(x*pi)) + ((1-r) * rnorm(50))
> plot(a,b); cor(a,b)
[1] 0.04531711
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Correlation coefficient
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Other Correlations
• There are many other types of correlations
• Spearman’s correlation
• rho
• Kendall’s correlation
• Tau
• Spearman is a Pearson on ranked values
• Spearman rho = 1 means a monotonic relationship
• Pearson R = 1 means a linear relationship
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
When Do We Use Statistics?
• Ubiquitous in modern biology
• Every class I will show a use of statistics in a (very, very)
recent Nature paper.
January 9, 2014
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Non-Small Cell Lung Cancer 101
15% 5-year
survival
Lung Cancer
80% of lung
cancer
Non-Small Cell
Adenocarcinomas
Squamous Cell
Carcinomas
Lecture 3: Univariate Analyses II: Discrete Data
Small Cell
Large Cell
(and others)
bioinformatics.ca
Non-Small Cell Lung Cancer 102
Stage I
Local Tumour Only
Stage II
Local Lymph Nodes
Stage III
Distal Lymph Nodes
Stage IV
Metastasis
IA = small tumour; IB = large tumour
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
General Idea: HMGA2 is a ceRNA
What are ceRNAs?
Salmena et al. Cell 2011
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Test Multiple Constructs for Activity
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
What Statistical Analysis Did They Do?
• No information given in main text!
• Figure legend says:
“Values are technical triplicates, have been performed
independently three times, and represent mean +/- standard
deviation (s.d.) with propagated error.”
• In supplementary they say:
“Unless otherwise specified, statistical significance was
assessed by the Student’s t-test”
• So, what would you do differently?
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Attendance Break
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Let’s Go Back to Discrete vs. Continuous
• Definition?
• Let’s take a few examples of discrete univariate statistical
analyses in biology and write them down here:
•
•
•
•
•
Cell counts
Embryo pigmentation yes/no with morpholino
SNP calling
Immunohistochemistry
Colony formations
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Four Main Discrete Univariate Tests
• Hypergeometric test
• Is a sample randomly selected from a fixed population?
• Proportion test
• Are two proportions equivalent?
• Fisher’s Exact test
• Are two binary classifications associated?
• (Pearson’s) Chi-Squared Test
• Are paired observations on two variables independent?
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Hypergeometric Test
•
•
•
•
•
•
Is a sample randomly selected from a fixed population?
Closer to discrete mathematics than statistics
Technically: sampling without replacement
s N s
In R: ?phyper
x
n
x
P
(
x
)
Classic example: marbles
N
Less classic: poker
n
5/24 are yellow
1/6 sampled are yellow
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Hypergeometric Test: Biological Example
• Class example in genomics: pathway analysis
•
•
•
•
I do a screen and identify n genes associated with something
Are those n genes biased towards a pathway?
Well a pathway contains m genes
So is n a random selection of m? Hypergeometric test!
• Similar example: drug screening
• I test 1000 drugs to see which ones kill a cell-line
• 100 of these are kinase inhibitors
• 100 drugs kill my cell-line
• 30 of these are kinase inhibitors
• Did I find more kinase inhibitors than expected by chance?
• Let’s do the calculation
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Hypergeometric: Venn Diagram Overlap
Let’s pretend X and Y are sets
of genes (or drugs, etc.) found
in two separate experiments.
We want to know, is there more
overlap than expected by
chance? To do this:
Total Balls: total number of genes considered (but a gene must be
analyzed in both experiments: exclude those studied in only one)
Black Balls: all genes found in experiment X
White Balls: all genes not found in experiment X
Sample: all genes found in experiment Y
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Proportion Test
• Are two proportions equivalent?
• Example: is the fraction of people who play hockey in MBP
different from the fraction who play hockey in
Mathematics?
• Mathematics: 12/85
• MBP: 24/135
• In R: prop.test
• Only useful for two-group studies
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Proportion Test: Biological Example
• Does the frequency of TP53 mutations differ between
prostate cancer patients who will suffer a recurrence and
those who will not?
• 12/150 patients whose tumours recur have mutated TP53
• 50/921 patients whose tumours do not recur have
mutated TP53
• P-value guesses?
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Fisher’s Exact Test
• Are two binary categorizations associated?
• Based on a contingency table
• What are these? Have we seen any before?
• In R: ?fisher.test
• Classic example: drinking tea
Dr. Muriel Bristow claimed to be able to taste if whether tea or milk
was added first to a cup. Dr. Ronald Fisher didn’t believe her.
Milk
Lecture 3: Univariate Analyses II: Discrete Data
Tea
Milk
4
0
Tea
0
4
bioinformatics.ca
Fisher’s Exact Test: Biological Example
• You can use this any time you form a contingency table
• Any time you make predictions (biomarkers)
• Any time you compare two binary phenomena
• Examples?
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
(Pearson’s) Chi-Squared Test
• Are two variables independent?
• There are a lot of different chi-squared tests. Why?
•
•
•
•
Pearson
Yates
McNemar
Portmanteau test
• In R: ?chisq.test
• You can think of it as a multiple-category Fisher’s test
• The assumptions break down if <5 values in a cell
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Chi-Squared Test: Biological Example
• Comparing sex across different tumour subtypes
Male
Female
Adenocarcinoma
250
192
Squamous Cell Carcinoma
202
261
Small Cell Carcinoma
15
9
Neuroendocrine
12
10
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca
Course Overview
•
•
•
•
•
•
•
•
•
•
Lecture 1: What is Statistics? Introduction to R
Lecture 2: Univariate Analyses I: continuous
Lecture 3: Univariate Analyses II: discrete
Lecture 4: Multivariate Analyses I: specialized models
Lecture 5: Multivariate Analyses II: general models
Lecture 6: Sequence Analysis
Lecture 7: Microarray Analysis I: Pre-Processing
Lecture 8: Microarray Analysis II: Multiple-Testing
Lecture 9: Machine-Learning
Final Exam (written)
Lecture 3: Univariate Analyses II: Discrete Data
bioinformatics.ca