Analysis of microarray data

Download Report

Transcript Analysis of microarray data

Analysis of microarray
data
Introduction
• Microarrays are chips which measure
whether genes are switched on or off in
cells.
• They can be used to detect sets of
genes responsible for genetic diseases
such as cancer.
• This lecture:
– introduce microarray technology
– discuss a few applications
– introduce statistical and computational
techniques for analysing microarray data
Gene expression
• All cells in an organism have the same
genomic DNA.
• Distinct cellular identities are due to
differences in gene expression (= transcription
& translation of gene).
• Whether a gene is transcribed is often
determined by the presence/ absence of other
genes products (esp. proteins) …
• … so genes interact in complex networks:
gene A switches on B, which turns off C which
upregulates (increases) A, …
• Hence perturbations to single gene can lead to
changes in expression of many genes.
Functional genomics
• Next step after sequencing of human genome:
understand connection between DNA sequence
& phenotypic (actual) characteristics of
organism.
• This is complex, because proteins and genes
act in highly connected networks and
signalling pathways in an orchestrated
manner.
• Traditionally molecular biology has worked on
a “one gene one function” basis & experiments
tend to study the effects of a single gene/ few
genes at a time, but...
Microarray chips
• …microarrays can measure many genes at
once.
• Microarray chips are commonly glass slides
with a matrix of spots printed (using eg. dot
matrix technology) on to them.
• A spot contains millions of identical molecules
of DNA or oligonucleotide (the probes),
which will bind a specific DNA sequence, such
as the cDNA of a gene.
• The glass slides can contain 1000s of spots,
each recognising a different sequence, eg. one
spot for every gene in the human genome.
Microarray experiments
• Since almost all mRNA translated protein, total
mRNA of cell ~ genes expressed.
• Mash up cells and extract mRNA.
• Reverse transcribe RNA  cDNA (can be
heated to make single-stranded).
• Label cDNA from reference cells green (Cy3)
and cDNA from target cells red (Cy5).
• Hybridise (wash on equal amounts of target &
reference sample & allow to bind to probes
which have complementary bases) both
samples, reference and target, to a single
microarray chip.
Results of microarray
experiments
• The spot for gene 1 =
– red if more mRNA 1 in target cells
– green if more mRNA 1 in reference cells
– yellow if same in both
• Actually, images of red & green
fluorescence are taken separately using
laser & scanner & their intensities are
measured using image software.
• Data often expressed as matrix of
intensity red
relative expression levels = intensity green ,
indexed by genes and target samples.
Microarray data
Red (Cy5)
and green
(Cy3)
images are
overlaideach spot
corresponds
to a gene.
Microarray data
• Reason for using relative intensities: process
of printing of spots on to chips does not give a
reliable fixed number of molecules, so the
intensity measurements (which correspond to
the amount of bound sample cDNA) represent
not only the level of expression of the gene,
but also the peculiarities of the chip.
• Some disadvantages to not having the
absolute gene expression values- eg.
confidence limits on the microarray
measurement depend heavily on the actual
values.
Principal uses of chips
• Genome-scale expression analysis
– Differentiation
– Response to environmental factors
– Disease states
– Effect of drugs
• Detection of sequence variation
Applications of
microarrays - yeast
• The fact that we can only reliably measure
relative gene expression, means that
microarrays tend to be used for comparative
experiments:
• Eg. “what changes in gene expression arise
when yeast is in anaerobic v. aerobic
conditions?” - deRisi et al, Science v. 278,
pp680-686
• Spot arrays with complementary DNAs to all
genes from the yeast genome (the probes).
• Approx. 6400 probes
Applications of
microarrays - yeast
• Reverse transcribe mRNA from yeast cells
harvested at various time points as conditions
are varied from anaerobic to aerobic (start
fermentation in sugary solution and allow
yeast to deplete sugar).
• 7 time points (2hr intervals, first 9 hrs after
placed in sugary medium)
• Let sample from first time point be “reference”
(totally anaerobic, lots of sugar).
• Label reference cDNA with green dye (Cy3)
and other sample cDNA (later time points)
with red dye (Cy5).
Applications of
microarrays - yeast
• Hybridize mixture of equal quantities of
reference sample and one of the later-time
samples (also do timepoint 1 against itself as
control test) to a microarray chip- one
experiment/ chip per timepoint.
• Take images of red and green fluorescence,
measure intensities, process (details of this
later in lecture) and create a matrix, M, with
entries,
intensity red
M ij 
, at spot
intensity green
representing gene i in chip containing sample j
(jth timepoint).
Applications of
microarrays - yeast
• Look for genes that are differentially
expressed in aerobic and anaerobic conditions.
• Find that when sample at initial timepoint is
compared to itself, 99% correlation between
intensity values.
• Timepoint 1 v. timepoint 2: 95% of genes
have < 1.5-fold difference in expressioncorrelation of 98% between data at 2
timepoints
• Timepoint 1 v. timepoint 7: c. 1700 genes out
of 6400 had > 2-fold difference in expressionsome genes had much higher ratio.
• Authors could infer properties of signalling
pathways involved in the shift in metabolism.
Applications of
microarrays- cancer
• Take a set of patients with a certain type of
cancer and a set of control patients with no
cancer, take cells from tumour/ region where
tumour is in cancer patients. Extract mRNA,
make cDNA and dye one of the samples from
a control patient green; all other samples red.
• Make/ buy a chip with human genes- as many
as possible/ those thought to be relevant for
cancer.
• Hybridise mixture of reference sample (green)
and one of the other target samples to each
chip.
Applications of
microarrays- cancer
• Process data and statistically analyse to find
genes which have significantly higher/ lower
expression in cancer cells than in normal cells.
These genes are likely to be important in
causing cancer/ effects of cancer.
• Can also cluster data to discover different
subclasses of cancer, eg.
Alizadeh et al. (2000) Nature, v. 403,
pp503-511
• A cancer of the immune cells (lymphoma) is
clinically diverse: 40% patients respond well
to therapy and have good survival. Authors
used hierarchical clustering (see later) to
discover two new subclasses of the cancer,
classified based on gene expression profiles.
Applications of
microarrays- cancer
• Thinking of the relative gene expression
values (in fact intensities) of the different
samples (patients) as a vector, the authors
were able to cluster the data.
• Microarray profiling of tumours can be used to
classify tumours into subclasses (with eg.
survival implications) of already known tumour
types.
Different kinds of
microarray
• cDNA versus oligonucleotide
• We have discussed so far gene expression
microarrays, but also:
– Sequencing chips: contain as probes, all possible
sequences of a given length k (typ. k=8-10 bases
long). Mark target sample with fluorescent dye and
hybridise. The spots with fluorescence are where
target bound. The corresponding sequence is part of
the target spectrum (=set of k-base sequences in
target). Then use computers to assemble whole
sequence. Target cannot be too long (eg. 150-200
bps if k=8).
– Can be used for looking for gene mutations/
polymorphism.
Analysis of microarray
data
• Data is matrix, Mij of (absolute or usually)
relative expression values of gene i in
condition j. Often presented as log2 values,
since this means that downregulation of gene
(eg. ratio ½) is not squashed into interval
(0,1), but takes values (eg. –1) in (-,0).
• Pre-processing: There are several sources
of variation in intensity in microarray
experiments other than differences in gene
expression between samples. These are
thought of as noise and we want to remove
them, by pre-processing. First subtract
background intensity, which is due to binding
to wrong spot, etc. (this is usually done by the
image processing software).
Analysis of microarray
data
• Normalization: Another source of noise is
due to differences in labelling and detection
efficiencies for the fluorescent labels and in
the amount of RNA between the 2 samples
(red/green). Normalization tries to get rid of
this by dividing all the ratios by an appropriate
constant to make the mean or the median of
the ratios =1 (mean /median centring
respectively). If the data is in log form- simply
subtract constant.
• Assumption is that on average across all (or a
chosen subset of) genes the levels of mRNA
produced will be the same in the two samples.
• Alternatively use scatter plot of intensity green
v. red & normalize to make slope=1.
Analysis of microarray
data
• Normalized data: log 2 ( R / G)  c  log 2 ( R /(Gk))
where R & G are the red & green intensities –
the respective backgrounds and c  log 2 k is the
normalization constant.
• Filtering: This is the process of working out
which genes are differentially expressed
across the different conditions (eg. timepoints
of the yeast experiment or cancer v. noncancer) and removing from the dataset those
genes which don’t vary. We will discuss this in
detail later.
Analysis of microarray
data
• Clustering: If you view the expression values
of a single gene across different samples
(rows of the expression matrix) as a vector
then the genes can be clustered based on the
similarity of the vectors. Likewise, using the
columns of the matrix, the samples can be
clustered. This helps eg. to classify cancers/
find genes which are in same network as each
other or have similar functions.
Conclusions
• We have described microarray chips for
analysing gene expression.
• We have mentioned three key areas of
analysis
– Normalization
– Filtering
– Clustering
• In the next session we will cover
statistical methods necessary for
filtering microarray data.