Transcript Lecture 15

Lecture 15
Wrap up of class
What we intended to do and what we have
done:
• Topics:
• What is the Biological Problem at hand?
• Types of data: micro-array, proteomic, RNAseq, GWAS
• Why and when does one use them?
Sources of Variation led us to our next topic
• Statistical issues concerning:
•
a. Normalization of data
•
b. Stochastic error versus systematic errors
Normalization
• VERY important we realize WHY we normalize
data as opposed to HOW to normalize data. I
am including Background correction along
with Normalizing here.
• The pros and Cons of normalizing vs not.
• What theoretically Normalizing is supposed to
do and WHAT it actually does.
Statistical topics:
•
•
•
•
LOESS
Quantile Normalization
Tukey Bi-weight
Wilcoxon Signed Rank test
Now come the QUESTIONS OF INTEREST:
What are the genes that are different for the
healthy versus diseased cells?
• –Gene discovery, differential expression
Is a specified group of genes all up-regulated in a
specified condition?
• –Gene set differential expression
• Did not get time for this too much but can be
included in Clustering after DE
Tests we talked about:
For 2 conditions:
• Pooled t test
• Welch’s t test
• Wilcoxon Rank Sum Test
• PermutationTest
• Bootstrap t test.
• EB Bayes Test
Announcement
• I am totally voice-less today
• So we will present as follows:
– Andrew
– Cameron
– Lili
– Huinan
– Ben
– Amit
– Xin
Contd…
•
•
•
•
•
•
•
David
Chongjin
Jie
Jeff
Miaoru
Jillian
Jeff
Tests contd
•
•
•
•
For multiple Conditions
ANOVA F test
Kruskal Wallis Test
EB Bayes Test
Multiplicity:
• The question of multiplicity adjustment, FWE,
PCE or FDR?
• Bonferroni corrections,
• False Discovery Rates, FDR
• Sequential Bonferroni, the Holm adjustment
• Bootstrapping, Permutation adjustments
Class discovery, clustering
• To do clustering we need a distance metric and a
linkage method.
• We can have hierarchical or non-hierarchical
clustering.
•
• Non-hierarchical Clustering: Partitioning Methods
(need to know number of clusters0
• Hierarchical Clustering: Produces trees (produces
tree-diagram)
Distance and Linkages
•
•
•
•
•
•
•
•
•
•
•
Distance:
Eucledean
Manhattan
Mahalanobis
Correlation
Linkages:
Complete
Singles
Centroid
Average
Class prediction, classification
• •Are there tumour sub-types not previously
identified? Do my genes group into previously
undiscovered pathways?
LDA
• Feature Selection: gene filtering
– Differential Expression
– PCA
– Penalized Least Square
•
• Choosing the rules
– Parametric ones:
•
•
•
•
•
Liklihood
Linear Discriminant Rule
Mahalanobis rule
Posterior Probability Rule
The General Classification Rule (using cost of mis-classification and
priors)
Misclassifications
– Non-parametric ones
• K-NN
• Estimating Misclassification rates
– Resubstitution
– Hold-out Samples
– Cross validation/Jack-knife
This is just the beginning of this
journey
• Remember you still have loads to learn
• You have to keep reading and be willing to
incorporate new ideas
• Thanks a bunch for sharing this journey with
me!