Transcriptome - Nematode bioinformatics. Analysis tools and data

Download Report

Transcript Transcriptome - Nematode bioinformatics. Analysis tools and data

Microarray analysis
Quantitation of Gene Expression
Expression Data to Networks
Reading: Ch 16
BIO520 Bioinformatics
Jim Lund
Microarray data
• Image quantitation.
• Normalization
• Find genes with significant expression
differences
• Annotation
• Clustering, pattern analysis, network
analysis
Sources of Non-Biological Variation
• Dye bias: differences in heat and light sensitivity,
efficiency of dye incorporation
• Differences in the amount of labeled cDNA
hybridized to each channel in a microarray
experiment (Channel is used to refer to a
combination of a dye and a slide.)
• Variation across replicate slides
• Variation across hybridization conditions
• Variation in scanning conditions
• Variation among technicians doing the lab work.
Factors which impact on the
signal level
•
•
•
•
•
Amount of mRNA
Labeling efficiencies
Quality of the RNA
Laser/dye combination
Detection efficiency of photomultiplier or
CCD
Hela
HepG2
Hela
HepG2
M = Log (Red - Log Green
M vs. A Plot
A = (Log Green + Log Red) / 2
M v A plots of chip pairs: before normalization
M v A plots of chip pairs: after quantile normalization
Types of normalization
• To total signal (linear normalization)
• LOESS (LOcally WEighted polynomial
regreSSion).
• To “house keeping genes”
• To genomic DNA spots (Research
Genetics) or mixed cDNA’s
• To internal spikes
Microarray analysis
• Data exploration: expression of gene X?
• Statistical analysis: which genes show
large, reproducible changes?
• Clustering: grouping genes by
expression pattern.
• Knowledge-based analysis: Are amine
synthesis genes involved in this
experiment?
Fold change: the crudest method of finding
differentially expressed genes
Hela
HepG2
>2-fold expression change
>2-fold expression change
What do we mean by differentially
expressed?
• Statistically, our gene is different from the
other genes.
Distribution of measurements
for gene of interest
Log ratio
Probability of a given
Value of the ratio
Number of genes
Distribution of average ratios
for all genes
Finding differentially expressed genes
What affects our certainty that a gene is up or down-regulated?
Probe Signal
• Number of sample points
• Difference in means
• Standard deviations of
sample
Sample
A
Sample
B
Practical views on statistics
• With appropriate biological replicates, it is possible to
select statistically meaningful genes/patterns.
• Sensitivity and selectivity are inversely related - e.g.
increased selection of true positives WILL result in more
false positive and less false negatives.
• False negatives are lost opportunities, false positives
cost $’s and waste time.
• A typical set of experiments treated with conservative
statistics typically results in more
genes/pathways/patterns than one can sensibly follow so use conservative statistics to protect against false
positives when designing follow-on experiments.
Statistical Tests
• Student’s t-test
– Correct for multiple testing! (Holm-Bonferroni)
• False discovery rate.
• Significance Analysis of Microarrays (SAM)
– http://www-stat.stanford.edu/~tibs/SAM/
• ANOVA
• Principal components analysis
• Special methods for periodic patterns in data.
p-value
Volcano plot: log(expr) vs p-value
Log(fold change)
Scatter plot showing genes with
significant p-values
Pattern finding
• In many cases, the patterns of differential expression
are the target (as opposed to specific genes)
– Clustering or other approaches for pattern
identification - find genes which behave similarly
across all experiments or experiments which behave
similarly across all genes
– Classification - identify genes which best distinguish
2 or more classes.
• The statistical reliability of the pattern or classifier is still
an issue and similar considerations apply - e.g. cluster
analysis of random noise will produce clusters which
will be meaningless….
What is clustering?
• Group similar objects together.
– Genes with similar expression patterns.
• Objects in the same cluster (group) are more
similar to each other than objects in different
clusters.
Clustering
• What is clustering?
• Similarity/distance metrics
• Hierarchical clustering algorithms
– Made popular by Stanford, ie. [Eisen et al. 1998]
• K-means
– Made popular by many groups, eg. [Tavazoie
et al. 1999]
• Self-organizing map (SOM)
– Made popular by Whitehead, ie. [Tamayo et al.
1999]
Typical Tools
• SAM (Significance Analysis of
Microarrays), Stanford
• GeneSpring
• Affymetrix GeneChip Operating System
(GCOS)
• Cluster/Treeview
• R statistics package microarray analysis
libraries.
How to define similarity?
1
1 Experiments
X
p
genes
n
genes
genes
X
Y
n
Y
Raw matrix
n
Similarity matrix
• Similarity metric:
– A measure of pairwise similarity or dissimilarity
– Examples:
• Correlation coefficient
• Euclidean distance
Similarity metrics
• Euclidean distance
Euclidean
clustering =
magnitude &
Direction
p
2
(
X
[
j
]

Y
[
j
]
)

j 1
• Correlation coefficient
p
p
 ( X [ j ]  X )(Y [ j ]  Y )
j 1
p
p
 ( X [ j ]  X )  (Y [ j ]  Y )
2
j 1
, where X 
j 1
2
 X [ j ] Correlation
j 1
p
clustering =
direction
Sporulation-example
Sporulation-example
Self-organizing maps (SOM)
[Kohonen 1995]
• Basic idea:
– map high dimensional data onto a 2D
grid of nodes
– Neighboring nodes are more similar
than points far away
Self-organizing maps (SOM)
SOM Clusters
Things learned from from microarray
gene expression experiments
• Pathways not known to be involved
–Ontology?
• Novel genes involved in a known
pathway
• “like” and “unlike” tissues
Transcription Factors
Regulatory Networks
• Identify co-regulated genes
• Search for common motifs
(transcription factor binding sites)
–Evaluate known motifs/factors
–Search for new ones.
• Programs: MEME, etc.
mRNA-protein Correlation
• YPD: should have relevant data
– will yeast be typical?
• Electrophoresis 18:533
– 23 proteins on 2D gels
– r=0.48 for mRNA=protein
• Post transcriptional and post
translational regulation important!
Other microarray formats
• Single nucleotide polymorphism (SNP) chips
– Oligos with each of 4 nt at each SNP.
• Chromosomal IP chips (ChIP:chip)
– Determine transcription factor binding sites
– Promoter DNA on the chip.
• Alternative splicing chips
– Long oligos, covering alternatively spliced
exons, or all exons.
• Genome tiling chips
ChIP:chip--Identification of
Transcription Factor Binding Sites
• Cross link transcription factors to DNA with
formaldehyde
• Pull out transcription factor of interest via
immunoprecipitation with an antibody or by
tagging the factor of interest with an
isolatable epitope (e.g GST fusion).
• Fractionate the DNA associated with the
transcription factor, reverse the cross links,
label and hybridize to an array of protomer
DNA.
• Brown et.al. (2001) Nature, 409(533-8)
ChIP:chip
Analysis of TF
Binding Sites
On to Proteomics
DNARNA Protein