Gene Expression and DNA Chips

Download Report

Transcript Gene Expression and DNA Chips

Gene Expression and
DNA Chips
Based on slides by Ron Shamir
http://www.bio.davidson.edu/courses/genomics/chip/chip.html
1
Monitoring Gene Expression
• Goal: Simultaneous measurement of expression
levels of all genes in one experiment.
• 2 fundamental biological assumptions:
– Transcription level indicates genes’ regulation.
– Only genes that contribute to organism fitness are
expressed.
=> Detecting changes in a gene’s expression level
provides clues on the function of its product
2
Factors controlling expression
Chromatin
remodeling
DNA
Alternative
splicing
PremRNA
transcription
RNA
interference /
degradation
Mature
mRNA
splicing
Post-translational
modifications
protein
translation
3
Hybridization
• DNA double strands form
by “gluing” of
complementary single
strands
• Complementarity rule: A-T,
G-C
TGAGGC
| | | | | |
ACTCCG
Use probe to identify if target contains
a particular sequence
4
DNA chips / Microarrays
• Perform thousands of
hybridizations in a single
experiment
• Variants:
– Oligonucleotide arrays
– cDNA microarrays
• Another distinction
– Single channel
– Dual channel
• Allow global view of cellular
processes: Monitor transcription
levels of numerous/all genes
simultaneously.
5
Oligonucleotide Arrays
6
/http://www.affymetrix.com/corporate/media/image_library
A single feature on the chip
7
Hybridization on chip
8
Reading off a chip
9
10
11
Manufacturing a GeneChip Array
12
Many probes for a single gene
13
cDNA Microarrays
14
For Flash animation of the technology, see
http://www.bio.davidson.edu/Courses/genomics/chip/chip.html
15
cDNA Microarrays (2)
16
Affymetrix oligo arrays vs cDNA
microarrays
•Short oligos
•Low specificity
•High density
•Many probes per gene
•Synthetic oligos
•Absolute exp values
•Yield problems
•“turnkey” solutions
•Price: +++
•Long oligos
•High specificity
•Lower density
•One probe per gene
•Probes: cDNAs
•Relative exp values
•Spotting problems
•Custom solutions
•Price : ++
17
… and other technologies
• Agilent:
– In situ synthesized arrays using ink-jet
technology
– 60-mer arrays: more specific than Affy’s
– Allows custom design without expensive masks
– Differential measurements: target vs
reference
• Nimblegen
• Illumina
18
Comparative genomic hybridization (CGH)
Cells of Interest
microarrays
Known DNA sequences
Isolate genomic
DNA
Glass slide
Flourescently
labeled
Reference sample
(almost identical to gene expression arrays, but genomic DNA is hybridized
instead of mRNA)
Chromosomes with varying copy number fluctuations from analysis of the tumor cell line SK-BR-3 as
compared with the normal reference
Robert Lucito et al. Genome Res. 2003; 13: 2291-2305
Single nucleotide polymorphism
(SNP) detection
SNP: single base sequence variation
Target sequence:
A/G
GCCATGCANGAGTTACTACAGTAGC
PM + 4 Allele A
CGGTACGTTCTCAATGATGTCATCG
MM +4 Allele A
CGGTACGTTCTCTATGATGTCATCG
PM +4 Allele B
CGGTACGTCCTCAATGATGTCATCG
MM + 4 Allele B
CGGTACGTCCTCTATGATGTCATCG
(Affymetrix Human Mapping 500K Array)
Remember Gene
Transcription?
RNA polymerase
(protein)
Transcription
Factors
(proteins)
5’
3’
C T A A T G T . . .
3’
5’
G A T T A C A . . .
Transcription factors recognize transcription factor binding sites and bind to
them, forming a complex.
RNA polymerase binds the complex.
(eukaryotes)
Using microarrays to measure
protein-DNA interactions
ChIP-chip:
Chromatin immunoprecipitation chip
(microarray)
(antibodies bind transcription
factor of interest)
(TF-bound sequences
hybridized to microarray)
Simon et al., Cell 2001
Mapping transcription factor binding
sites in yeast with ChIP-chip
Harbison C., Gordon B., et al. Nature 2004
Dynamic role of transcription
factors
Harbison C., Gordon B., et al. Nature 2004
Other microarray applications:
Competitive growth assays
Barcode CTAACTC
Deletion
Strain:
TCGCGCA
yfg1D
Rich media
yfg2D
TCATAAT
yfg3D
…
Growth 6hrs
in minimal media
Harvest and label genomic DNA
Measuring relative fitness
with a barcode microarray
Oligo barcodes matching each
strain are also spotted on a DNA
microarray
Protein Microarrays
• Protein microarrays are lagging behind DNA
microarrays
• Same idea but immobilized elements are proteins
instead of nucleic acids
• Number of elements (proteins) on current protein
microarrays are limited (approx. 500)
• Antibodies for high density microarrays have
limitations (cross-reactivities)
• Aptamers or engineered antibodies/proteins may be
viable alternatives
(Aptamers:RNAs that bind proteins with high specificity and affinity)
Applications
Screening for:
• Small molecule
targets
• Post-translational
modifications
• Protein-protein
interactions
• Protein-DNA
interactions
• Enzyme assays
• Epitope mapping
High-throughput proteomic analysis
Label all Proteins in Mixture
Haab et al. Genome Biology 2000;1:1-22
Cytokine Specific Microarray
(Microarray version of ELISA)
IL-1 
IL-6
IL-10
marker protein
cytokine
Detection system
BIOTINYLATED MAb
ANTIGEN
CAPTURE MAb
VEGF
MIX
Tissue Microarrays
• Printing on a slide tiny amounts of tissue
• Array many patients in one slide (e.g. 500)
• Process all at once (e.g.
immunohistochemistry)
• Works with archival tissue (paraffin blocks)
Tissue Microarray
Alizadeh et al. J Pathol 2001;195:41-52
How Gene Expression Data
Looks
Entries of the Raw Data matrix:
• Ratio values
• Absolute values
• Distributions…
• Column = experiment/condition’s
profile
genes
• Row = gene’s expression pattern /
fingerprint vector
conditions
Expression
levels,
“Raw Data”
34
conditions
•Input: Real-valued raw data
matrix.
•Compute the similarity matrix
(dot product/correlation/…)
genes
Data Preprocessing
Expression
levels,
“Raw Data”
From the Raw Data matrix we
compute the similarity matrix S.
Sij reflects the similarity of
the expression patterns of gene i and
gene j.
10
20
30
40
50
60
10
20
30
40
50
60
35
DNA chips: Applications
• Deducing functions of unknown genes
(similar expression pattern
similar function)
• Identifying disease profiles
• Deciphering regulatory mechanisms
(co-expression
co-regulation).
• Classification of biological conditions
• Genotyping
•Drug development
•…
Analysis requires clustering of genes/conditions.
36
Pearson Correlation Coefficient,
r. Values are in [-1,1] interval
• Gene expression over d experiments is a
vector in Rd, e.g. for gene C: (0, 3, 3.58, 4,
3.58, 3)
• Given two vectors X and Y that contain N
elements, we calculate r as follows:
Cho & Won, 2003
Intuition for Pearson
Correlation Coefficient
r(v1,v2) close to 1: v1, v2 highly correlated.
r(v1,v2) close to -1: v1, v2 anti correlated.
r(v1,v2) close to 0: v1, v2 not correlated.
Pearson Correlation and p-Values
When entries in v1,v2 are distributed according to
normal distribution, can assign
(and efficiently compute) p-Values for a given result.
These p-Values are determined by the Pearson
correlation coefficient, r, and the
dimension, d, of the vectors.
For same r, vectors of higher dimension will
be assigned more significant (smaller) p-Value.
Spearman Rank Order Coefficient
(a close relative of Pearson,
non parametric)
• Replace each entry xi by its rank in vector x.
• Then compute Pearson correlation coefficients of
rank vectors.
• Example: X = Gene C = (0, 3.00, 3.41, 4, 3.58, 3.01)
Y = Gene D = (0, 1.51, 2.00, 2.32, 1.58, 1)
• Ranks(X)= (1,2,4,6,5,3)
• Ranks(Y)= (1,3,5,6,4,2)
• Ties should be taken care of, but: (1) rare
(2) can randomize (small effect)
From Pearson Correlation
Coefficients to a Gene Network
• Compute correlation coefficient for all
pairs of genes (what about missing data?)
• Choose p-Value threshold.
• Put an edge between gene i and gene j iff
p-Value exceeds threshold.
Clustering: Objective
Group elements (genes) to clusters
satisfying:
• Homogeneity: Elements inside a cluster
are highly similar to each other.
• Separation: Elements from different
clusters have low similarity to each other.
42
The Clustering Bazaar
43
Hierarchical clustering
44
An Alternative View
Form a tree-hierarchy of the input
elements satisfying:
• More similar elements are placed closer
along the tree.
• Or: Tree distances reflect element
similarity
•Note: No explicit partition into clusters.
45
Hierarchical Representations
46
Hierarchical Representations (2)
Dendrogram: rooted tree, usually
binary, and all root-leaf distances
are equal
5.0
4.5
2.8
1
2
3
4
1
2
3 474
Neighbor Joining Algorithm
Saitou & Nei, 87
• Input: Distance matrix Dij; Initially each
element is a cluster.
• Find min element Drs in D; merge clusters r,s
• Delete elts. r,s, add new elt. t with
Dit=Dti=(Dir+ Dis – Drs)/2
• Repeat
• Present the hierarchy as a tree with similar
elements near each other
48
Hierarchical Clustering: Average Linkage
Sokal & Michener 58, Lance & Williams 67
• Input: Distance matrix Dij; Initially
each element is a cluster. nr- size of
cluster r
• Find min element Drs in D; merge
clusters r,s
• Delete elts. r,s, add new elt. t with
Dit=Dti=nr/(nr+ns)•Dir+ ns/(nr+ns) • Dis
• Repeat
49
A General Framework
Lance & Williams 67
• Input: Distance matrix Dij; Initially each
element is a cluster.
• Find min element Drs in D, merge clusters
r,s
• Delete elts. r,s, add new elt. t with
Dit=Dti=rDir+ sDis + |Dir-Dis|
50
Hierarchical clustering of GE data
•
•
•
•
•
•
•
•
Eisen et al., PNAS 1998
Growth response: Starved human fibroblast
cells, added serum
Monitored levels of 8600 genes over 13 timepoints using cDNA microarrays
tij - fluorescence levels of target gene i in
condition j; rij – same for reference
Dij= log(tij/rij)
D*ij= [Dij –E(Di)]/std(Di)
Similarity of genes k,l: Skl=(jD*kj •D*lj)/Ncond
Applied average linkage method
Ordered leaves by increasing subtree weight:
average expression level, time of maximal 51
induction, other criteria
52
53
Comments
• Distinct measurements of same genes
cluster together
• Genes of similar function cluster together
• Many cluster-function specific insights
• Interpretation is a REAL biological
challenge
54
More on hierarchical methods
• All methods described above –
agglomerative
• An alternative approach: Divisive
• Advantages:
– gives a single coherent global picture
– Intuitive for biologists (from phylogeny)
• Disadvantages:
– no single partition; no specific clusters
– Forces all elements to fit a tree
hierarchy
55
Non-Hierarchical Clustering
56
Clustering: Objective
Group elements (genes) to clusters
satisfying:
• Homogeneity: Elements inside a cluster
are highly similar to each other.
• Separation: Elements from different
clusters have low similarity to each other.
•Needs formal objective functions
•Most useful versions are NP-hard.
57
K-means clustering
MacQueen, 65
• Initialize an arbitrary partition P into k
clusters C1 ,…, Ck.
• For cluster Cj, element i Cj, EP(i, Cj) =
cost of soln. if i is moved to cluster Cj.
Pick EP(r, Cs) that is minimum; move r to
cluster Cs if the new partition is better
than P
• Repeat until no improvement possible
• Requires knowledge of k
58
K-means variations
• Input: vector vi for each element i
• Compute a centroid cp for each cluster Cp,
e.g., gravity center = average vector
• Solution cost: clusters pi in cluster pd(vi,cp)
• EP(i,j)= change in soln. cost if i is moved to
cluster Cj.
• Parallel version: move each elt. to the cluster
with the closest centroid simultaneously
• Sequential version: one elt. each time
• “moving centers” approach
• Objective = homogeneity only (k fixed)
• Variations for changing k
59
60
61
Gene Expression Omnibus
(GEO)
• NCBI’s repository for gene expression data
• The EU database is ArrayExpress
• Both databases exchange data (GenBank’s
model)
• Basic entities
– Series – a deposited experiment that wasn’t
processed yet, but the data is available
– Dataset – processed and manually curated
– Platform – a microarray platform (e.g., Affymetrix
HG-U133A chips)
– Profiles – the expression of a gene in an experiment
62
63
64
65
66
67
68
Novartis GNF
69
Promoter analysis
Position Weight Matrix (PWM)
a.k.a Position Specific Scoring Matrix (PSSM)
Example:
Need to set score
threshold
A
C
0.1 0.8 0 0.7 0.2 0
0 0.1 0.5 0.1 0.4 0.6
G
T
0
0 0.5 0.1 0.4 0.1
0.9 0.1 0 0.1 0 0.3
ATGCAGGATACACCGATCGGTA
GGAGTAGAGCAAGTCCCGTGA
AAGACTCTACAATTATGGCGT
0.0605
0.0605
0.0151
Computational approaches to
promoter analysis
• Look for overrepresented BSs in groups of
promoters
– Obtained by clustering expression profiles
– Of genes with a common known function (e.g. from
GO annotations)
– From chip2 data – requires knowledge of the TF,
and an antibody.
- Use a combination of sources
• De-novo or using known TF signatures
.
Location analysis
Ren et al., Science 290:2306-2309 (2000).
ATM-dependent Transcriptional
Response to Ionizing Radiation
• DNA damage response modulates many
signaling pathways, including lesion
processing, repair, cell cycle checkpoints
and apoptotic pathway.
• ATM protein kinase is a master regulator
of cellular response to double strand
breaks.
Goal: identify the transcriptional network.
S. Rashi, R. Elkon, N. Weizman, C. Linhart, N. Amariglio, N. Orlev, G. Sternberg, A.
Barzilai, Y. Shiloh
Experimental Design
Gene expression profiles:
wild-type and Atm-/- mice ± ionizing radiation.
Thymus tissue, time points: 0, 30 min, 120 min
Filtering ‘responding genes’
1206 genes whose expression level is changed by >1.75 fold
Clustering
6 main clusters generated by the CLICK algorithm
Promoter Analysis
NF-B and p53 found by PRIMA analysis
Major Gene Clusters – Irradiated Thymus
Atm-dependent responding genes:
The genes respond to radiation only in wild type
Major Gene Clusters – Irradiated Thymus
Atm-dependent 2nd wave of responding
genes
Major Gene Clusters – Irradiated Thymus
Similar response in both genotypes
Transcription Factors
Clues are
in the
promoters
ATM
Hidden
layer
TF-D
?
TF-B
?
TF-C
?
TF-A
?
p53
?
Observed
layer
g13
g12
g11
g10
g9
g8
g7
g6
g5
g4
g3
g2
g1
R. Elkon, C. Linhart, Y. Shiloh
PRIMA: PRomoter Integration in
Microarray Analysis
• Assumption: Co-expression → Transcriptional
co-regulation → common cis-regulatory
promoter elements
• Step 1: Identification of co-expressed genes
using microarray technology and clustering
algorithms
• Step 2: Computational identification of
transcription factors whose binding site
signatures are significantly over-represented
among promoters of co-expressed genes
PRIMA - Results
PRIMA - Results
Transcription
factor
NF-B
p53
Enrichment
factor
5.1
4.2
P-value
3.8x10-8
9.6x10-7
Hypothesis: NF-B and p53 mediate the late response to
DNA damage.
Molecular Classification of Cancer:
Class Discovery and Class Prediction
by Gene Expression Monitoring
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller
H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES.
Science 286 (Oct 1999) 531-537
Computational paper: Class Prediction and Discovery Using Gene Expression Data
Donna K. Slonim, Pablo Tamayo, Jill P. Mesirov, Todd R. Golub, Eric S. Lander Proc.
RECOMB 2000
ppt Source: Elashof-Horvath UCLA course, Statistical Analysis of DNA Microarray
Data http://www.genetics.ucla.edu/horvathlab/Biostat278/Biostat278.htm
Background: Cancer Classification
• Cancer classification is central to cancer treatment;
• Traditional cancer classification methods: by sites;
by morphology, etc;
• Limitations of morphology classification: tumors of
similar histopathological appearance can have
significantly different clinical courses and response
to therapy;
• Traditionally cancer classification relied on specific
biological insights
• Challenges:
– finer classification of morphologically similar tumors at the
molecular level;
– systematic and unbiased approaches;
Background: Cancer Classification
(Continued)
Three challenges:
• Class prediction (classification) :
assignment of particular tumor
samples to already-defined classes.
• Feature selection : Identify the most
informative genes for prediction
• Class discovery : defining previously
unrecognized tumor subtypes ( =
clusters)
Background: Leukemia
• Acute leukemia: variability in clinical outcome and subtle
differences in nuclear morphology
• Subtypes: acute lymphoblastic leukemia (ALL) or acute
myeloid leukemia (AML);
• ALL subcategories: T-lineage ALL and B-lineage ALL;
• Particular subtypes of acute leukemia have been found
to be associated with specific chromosomal
translocations;
• No single test is currently sufficient to establish the
diagnosis, but a combination of different tests in
morphology, histochemistry and immunophenotyping etc.
• Although usually accurate, leukemia classification
remains imperfect and errors do occur;
Objective
• Develop a systematic approach to cancer
classification based on gene expression data
from microarray
• Use leukemia as test case
Method: Biological Samples & microarrays
Learning set: 38 bone marrow samples (27 ALL, 11 AML) •
obtained from acute leukemia patients at the time of
diagnosis;
test set: 34 leukemia samples (24 bone marrow and •
10 peripheral blood samples);
RNA from cells hybridized to high-density Affymetrix oligo •
arrays (6817 human genes)
Feature selection
50 genes mostly highly correlated with AML-ALL:
Class predictor
The prediction of new samples assigned 36 of 38 samples as either AML or ALL and the remaining 2 are uncertain. •
All predictions agree with patients’ clinical diagnosis. •