Transcript Slide 1
Bioinformatics: gene expression
basics
Ollie Rando, LRB 903
Experimental
Cycle
Biological question
(hypothesis-driven or explorative)
To call in the statistician
after the
Experimental design
experiment is done may be no more than
Failed
Microarray experiment
asking him to perform
a post-mortem
examination:
Image analysis
Quality
Measurement
Pre-processing
He may be able to
say what the
Normalization
Pass
experiment died of.
Analysis
Estimation
Testing
Clustering
Biological verification
and interpretation
Discrimination
Ronald
Fisher
DNA Microarray
Lecture 1.1
3
From experiment to data
Microarrays & Spot Colour
Lecture 1.1
5
Microarray Analysis Examples
Lung
20,224
Liver
37,807
Prostate
7,971
Skin
3,043
Lecture 1.1
Brain
67,679
Brain
Lung
Heart
9,400
Colon
4,832
Liver
Bone
4,832
6
Liver Tumor
Raw data are not mRNA
concentrations
•
•
•
•
•
tissue contamination
RNA degradation
amplification efficiency
reverse transcription efficiency
Hybridization efficiency and
specificity
• clone identification and
mapping
• PCR yield, contamination
•
spotting efficiency
•
DNA support binding
•
other array manufacturing related
issues
•
image segmentation
•
signal quantification
•
“background” correction
Scatterplot
Data
Message: look at your data on log-scale!
Data (log scale)
MA Plot
A = 1/2 log2(RG)
Median centering
One of the simplest strategies is to bring all „centers“ of the array data to
the same level.
Assumption: the majority of genes are un-changed between
conditions.
Divide all expression
measurements of
each array by the
Median.
Log Signal, centered at 0
Median is more robust to outliers than the mean.
Problem of median-centering
Median-Centering is a global Method. It does not adjust for local effects,
intensity dependent effects, print-tip effects, etc.
Scatterplot of log-Signals
after Median-centering
Log Red
M = Log Red - Log Green
M-A Plot of the same data
Log Green
A = (Log Green + Log Red) / 2
M = Log Red - Log Green
Lowess normalization
Local
estimate
A = (Log Green + Log Red) / 2
Use the estimate to bend
the banana straight
Summary I
• Raw data are not mRNA concentrations
• We need to check data quality on different
levels
– Probe level
– Array level (all probes on one array)
– Gene level (one gene on many arrays)
• Always log your data
• Normalize your data to avoid systematic (nonbiological) effects
• Lowess normalization straightens banana
OK, so I’ve got a gene list with expression
changes: now what?
YPL171C
7.743877387
YBR008C
6.390877387
YFL056C
5.740877387
YKL086W
5.408877387
YOL150C
4.831877387
YOL151W
4.760877387
YFL057C
4.725877387
YKL071W
4.172877387
YLR327C
4.167877387
YLL060C
4.130877387
YLR460C
4.063877387
YML131W
4.047877387
YDL243C
4.031877387
YKR076W
3.942877387
YOR374W
3.937877387
“Huh. Turns out the standard names for the
most upregulated genes all start with ‘HSP’,
or ‘GAL’ … I wonder if that’s real …”
Gene Ontology
• Organization of curated biological knowledge
– 3 branches: biological process, molecular function, cellular component
Hypergeometric Distribution
• Probability of observing x or more genes in a cluster of n
genes with a common annotation
–
–
–
–
N = total number of genes in genome
M = number of genes with annotation
n = number of genes in cluster
x = number of genes in cluster with annotation
• Multiple hypothesis correction required if testing
multiple functions (Bonferroni, FDR, etc.)
• Additional genes in clusters with strong enrichment may
be related
Kolmogorov-Smirnov test
• Hypergeometric test requires “hard calls” – this list of
278 genes is my upregulated set
• But say all 250 genes involved in oxygen consumption go
up ~10-20% each – this would not likely show up
• KS test asks whether *distribution* for a given geneset
(GO category, etc.) deviates from your dataset’s
background, and is nonparametric
• Cumulative Distribution Function (CDF) plot:
• Gene Set Enrichment Analysis:
• http://www.broadinstitute.org/gsea/
GO term Enrichment Tools
• SGD’s & Princeton’s GoTermFinder
– http://go.princeton.edu
• GOLEM (http://function.princeton.edu/GOLEM)
• HIDRA
Sealfon et al., 2006
Supervised analysis
= learning from examples, classification
– We have already seen groups of healthy and sick
people. Now let’s diagnose the next person walking
into the hospital.
– We know that these genes have function X (and
these others don’t). Let’s find more genes with
function X.
– We know many gene-pairs that are functionally
related (and many more that are not). Let’s extend
the number of known related gene pairs.
Known structure in the data needs to be
generalized to new data.
Un-supervised analysis
= clustering
– Are there groups of genes that behave similarly in
all conditions?
– Disease X is very heterogeneous. Can we identify
more specific sub-classes for more targeted
treatment?
No structure is known. We first need to find it.
Exploratory analysis.
Supervised analysis
Calvin, I still don’t know the
difference between cats and
dogs …
Oh, now I get it!!
Class 1: cats
Don’t worry!
I’ll show you once
more:
Class 2: dogs
Un-supervised analysis
Calvin, I still don’t know the
difference between cats and
dogs …
I don’t know it either.
Let’s try to figure it out
together …
Supervised analysis: setup
• Training set
– Data: microarrays
– Labels: for each one we know if it falls into our class
of interest or not (binary classification)
• New data (test data)
– Data for which we don’t have labels.
– Eg. Genes without known function
• Goal: Generalization ability
– Build a classifier from the training data that is good
at predicting the right class for the new data.
One microarray, one dot
Think of a space with #genes
dimensions (yes, it’s hard for more
than 3).
Expression of gene 2
Each microarray corresponds to a
point in this space.
If gene expression is similar under
some conditions, the points will be
close to each other.
If gene expression overall is very
different, the points will be far away.
Expression of gene 1
Which line separates best?
A
B
C
D
No sharp knive, but a …
Support Vector Machines
Maximal margin
separating hyperplane
Datapoints closest
to separating
hyperplane
= support vectors
How well did we do?
Training error: how well
do we do on the data we
trained the classifier on?
But how well will we do in
the future, on new data?
Test error: How well does
the classifier generalize?
Same classifier (= line)
New data from same classes
The classifier will usually perform
worse than before:
Test error > training error
Cross-validation
Training error
Train classifier and test it
Test error
Train
Test
K-fold Cross-validation
Here for
K=3
Step 1.
Train
Train
Test
Step 2.
Train
Test
Train
Step 3.
Test
Train
Train
Additional supervised
approaches might
depend on your goal:
cell cycle analysis
Clustering
• Let the data organize itself
• Reordering of genes (or conditions) in the
dataset so that similar patterns are next to
each other (or in separate groups)
• Identify subsets of genes (or experiments) that
are related by some measure
Quick Example
Genes
Conditions
Why cluster?
• “Guilt by association” – if unknown gene X is
similar in expression to known genes A and B,
maybe they are involved in the same/related
pathway
• Visualization: datasets are too large to be able
to get information out without reorganizing
the data
Clustering Techniques
• Algorithm (Method)
–
–
–
–
–
–
–
–
Hierarchical
K-means
Self Organizing Maps
QT-Clustering
NNN
.
.
.
• Distance Metric
–
–
–
–
–
–
–
–
Euclidean (L2)
Pearson Correlation
Spearman Correlation
Manhattan (L1)
Kendall’s t
.
.
.
Distance Metrics
• Choice of distance measure is important for most clustering
techniques
• Pair-wise metrics – compare vectors of numbers
– e.g. genes x & y, ea. with n measurements
Euclidean Distance
Pearson Correlation
Spearman Correlation
Distance Metrics
Spearman Correlation
Euclidean Distance
Pearson Correlation
Hierarchical clustering
• Imposes (pair-wise) hierarchical structure on
all of the data
• Often good for visualization
• Basic Method (agglomerative):
1.
2.
3.
4.
Calculate all pair-wise distances
Join the closest pair
Calculate pair’s distance to all others
Repeat from 2 until all joined
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
Hierarchical clustering
HC – Interior Distances
• Three typical variants to calculate interior
distances within the tree
– Average linkage: mean/median over all possible
pair-wise values
– Single linkage: minimum pair-wise distance
– Complete linkage: maximum pair-wise distance
Hierarchical clustering: problems
• Hard to define distinct clusters
• Genes assigned to clusters on the basis of all
experiments
• Optimizing node ordering hard (finding the optimal
solution is NP-hard)
• Can be driven by one strong cluster – a problem for
gene expression b/c data in row space is often highly
correlated
Cluster analysis of combined yeast data sets
Eisen M B et al. PNAS
1998;95:14863-14868
©1998 by The National Academy of Sciences
To demonstrate the biological origins of patterns seen in Figs. 1 and 2, data from Fig. 1 were
clustered by using methods described here before and after random permutation within rows
(random 1), within columns (random 2), and both (random 3).
Eisen M B et al. PNAS 1998;95:14863-14868
©1998 by The National Academy of Sciences
Hierarchical Clustering: Another Example
• Expression of tumors hierarchically clustered
• Expression groups by clinical class
Garber et al.
K-means Clustering
• Groups genes into a pre-defined number of
independent clusters
• Basic algorithm:
1. Define k = number of clusters
2. Randomly initialize each cluster with a seed (often with
a random gene)
3. Assign each gene to the cluster with the most similar
seed
4. Recalculate all cluster seeds as means (or medians) of
genes assigned to the cluster
5. Repeat 3 & 4 until convergence
(e.g. No genes move, means don’t change much, etc.)
K-means example
K-means example
K-means example
K-means
example
K-means: problems
• Have to set k ahead of time
– Ways to choose “optimal” k: minimize withincluster variation compared to random data or held
out data
• Each gene only belongs to exactly 1 cluster
• One cluster has no influence on the others
(one dimensional clustering)
• Genes assigned to clusters on the basis of all
experiments
Clustering “Tweaks”
• Fuzzy clustering – allows genes to be “partially”
in different clusters
• Dependent clusters – consider between-cluster
distances as well as within-cluster
• Bi-clustering – look for patterns across subsets of
conditions
– Very hard problem (NP-complete)
– Practical solutions use heuristics/simplifications that
may affect biological interpretation
Cluster Evaluation
• Mathematical consistency
– Compare coherency of clusters to background
• Look for functional consistency in clusters
– Requires a gold standard, often based on GO,
MIPS, etc.
• Evaluate likelihood of enrichment in clusters
– Hypergeometric distribution, etc.
– Several tools available
More Unsupervised Methods
• Search-based approaches
– Starting with a query gene/condition, find most
related group
• Singular Value Decomposition (SVD) & Principal
Component Analysis (PCA)
– Decomposition of data matrix into “patterns”
“weights” and “contributions”
– Real names are “principal components”
“singular values” and “left/right eigenvectors”
– Used to remove noise, reduce dimensionality, identify
common/dominant signals
SVD (& PCA)
• SVD is the method, PCA is performing SVD on centered
data
• Projects data into another orthonormal basis
• New basis ordered by variance explained
Vt
Singular
values
“Eigen-genes”
=
X
U
Original
Data matrix
“Eigen-conditions”
SVD
SVD
OK, so all that’s fine. Let’s give it a shot
• Say we’ve run a gene expression array for changes in
gene expression when chromatin protein X is deleted
• What GO categories show differential expression?
• What TF binding sites regulate these genes?
• I think this protein will affect genes near the ends of
the chromosomes – how do I check?
• I bet TATA-containing genes are disproportionately
affected, so let’s check.
• I think this protein is involved in stress response – let’s
compare it to a stress response dataset
Where do we go for relevant datasets?
• GO: see previous
• Yeast genomic annotations: Saccharomyces
Genome Database
• Potential regulatory sites – MEME:
http://meme.sdsc.edu/meme4_3_0/cgibin/meme.cgi
• TATA box data for yeast: Basehoar … Pugh,
Cell, 2004
• Stress response: Gasch et al