Analysis of Gene Expression

Download Report

Transcript Analysis of Gene Expression

Analysis of Gene Expression
Overview
•
Genome analysis tells us what genes are present, but before we can
determine the organism’s phenotype, we need to know how those genes
are expressed: under what conditions, in what tissues, how much gene
product is made, etc.
– Also, understanding and curing diseases is tied to the analysis of what genes are
expressed in disease states.
•
•
There is some progress in deciphering gene control signals, but right now it
is more profitable to do experiments and use bioinformatic tools to analyze
their results. Later, with some luck, we will be able to make better
predictions about gene expression just from sequence data.
What we are going to cover:
–
–
–
–
–
Microarray data (messenger RNA = “transcriptome”)
SAGE (mRNA)
2-dimensional electrophoresis and mess spectrometry (proteins = “proteome”)
Data normalization: picking out signal from noise
Cluster analysis: how to decide which genes are co-expressed
Microarrays
•
DNA microarrays and DNA chips are essentially the
same thing: a set of DNA molecules attached to a
solid substrate in an array of very small spots.
–
–
•
•
•
Affymetrix is a company that sells microarray chips
attached to a silicon substrate
Many microarrays are homemade: DNA spotted onto
glass microscope slides
Microarrays work by hybridization: cDNA made from
mRNA is labelled with a fluorescent tag, then
hybridized with the array. After washing, only
complementary sequences remain bound. A laser
scanner excites each spot in turn, and the amount of
fluorescence is read. The level of fluorescence is
proportional to the amount of mRNA present in the
original prep.
Originally, cDNA from each gene was used to make
the array, Later, synthetic oligonucleotides were
used, and today, 50-60 nt synthetic oligonucleotides
based on the gene sequences seem to be the
standard.
In most cases, RNAs from two different conditions are
compared (experimental vs. control). The two cDNAs
derived from the RNAs are labelled with Cy3, a greenfluorescing dye, and Cy5, a red-fluorescing dye.
–
If the two RNAs are present in equal amounts, you get a
yellow spot; otherwise red or green predominates.
SAGE
• Serial Analysis of Gene
Expression
• The basis of this technique is
that a gene can be uniquely
identified using only a small
(10-30 nt) piece from the 3’
end (which is not translated)
• These tags are extracted (from
cDNA), then concatenated into
long molecules that are
amplified with PCR (or cloned)
and sequenced.
• The number of times each tag
appears is proportional to the
amount of its mRNA present.
• Much SAGE data in NCBI.
Digital Differential Display
• DDD is based on data
from EST experiments.
The NCBI UniGene
database combines
ESTs for each gene
separately. The
proportion of ESTs from
a given gene can be
compared between
experimental
treatments.
• This is obviously limited
to well-studied species.
RNA Seq
•
•
This is a new method, published in 2008. It is
probably the method of choice today for
analyzing RNA content. Also called whole
transcriptome shotgun sequencing.
Very simple: isolate messenger RNA, break it into
200-300 base fragments, reverse transcribe, then
perform large scale sequencing using 454,
Illumina. Or other massively parallel sequencing
technology.
– RNA sequences then compared to genomic
sequences to find which gene is expressed and
also exon boundaries
– Exon boundaries are a problem with very short
reads: you might only have a few bases of overlap
to one of the exons.
•
•
As with all RNA methods, which RNAs are
present depends on the tissue analyzed and
external conditions like environmental stress or
disease state.
Get info on copy number over a much wider
range than microarrays. Also detects SNPs.
Two-Dimensional Gel Electrophoresis
•
2D gels are a way of separating proteins into
individual spots that can be individually
analyzed.
– Proteins are first separated by their isoelectric
point and then by their molecular weight.
– Individual protein spots can then be identified
using mass spectrometry
•
Isoelectric focusing separates proteins by
their isoelectric point, the pH at which the net
surface charge is zero.
– IEF uses a mixture of ampholytes, chemical
compounds that contain both acidic and basic
groups.
– When an electric field is applied, the
ampholytes move to a position in the gel
where their net charge is zero, and in the
process they set up a pH gradient.
– Proteins also move down the pH gradient until
they reach a pH where they have no net
charge, their isoelectric point.
– Isoelectric focusing is thus an equilibrium
process: proteins move to a stable position
and stay there. (But in practical terms, the
gradient breaks down over time).
SDS-PAGE
•
SDS-PAGE is a method for
separating proteins according to
their molecular weight.
– SDS = sodium dodecyl sulfate
(a.k.a. sodium lauryl sulfate), a
detergent that unfolds proteins
and coats them in charged
molecules so that their charge to
mass ratio is essentially identical.
• “Native” gel electrophoresis
uses undenatured proteins,
which vary greatly in charge
to mass ratio.
– SDS denaturation isn’t perfect:
some proteins behave
anomalously,
– PAGE = polyacrylamide gel
electrophoresis
2D gels
•
•
•
First, isoelectric focusing is performed on a
protein sample, running the proteins through a
narrow tube or strip of acrylamide.
Then the IEF gel is placed on top of an SDS gel,
allowing the proteins to be separated by their
molecular weight at right angles to the isoelectric
point separation.
Then the gel is stained with a general protein
stain such as Coomassie Blue or silver stain.
–
•
A Western blot involves transferring the separated
proteins onto a membrane, where specific proteins
can be identified by antibody binding.
A couple of issues:
–
–
–
–
While a cell might contain up to 100,000 proteins, at
best only 3000 spots can be resolved.
Proteins expressed at a low level (such as
regulatory proteins) don’t show up well: spot size is
proportional to the amount present in the cell
Special techniques are needed for membrane
proteins, which aren’t easily solubilized by the usual
techniques.
Comparing spots between 2D gels require image
analysis software (and well-standardized running
conditions).
What SDS
used to mean
• Students for a
Democratic
Society, a left
wing radical group
from the 1960’s
and early 1970’s.
The Weathermen
were an even
more radical
offshoot.
Mass Spectrometry
•
The general principle of mass
spectrometry is that if you ionize a
group of atoms or molecules, you can
separate them on the basis of charge
to mass ratio, by accelerating them in
an electric field in a vacuum.
–
•
•
The original mass spectrometers were
used to separate isotopes, based on
slightly different masses.
During the ionization process, proteins
tend to break up in characteristic
ways, producing “molecular ions”
whose molecular weights can be
measured very precisely.
Since you are generally working with
an already sequenced genome, you
can predict the size of fragments that
will be generated by any gene. Thus
you can identify the gene product by
matching the actual fragments with list
of predicted fragments.
More Mass Spectrometry
•
For most protein work, the proteins are first digested
into small fragments (say , 5-10 amino acids),
separated by HPLC (high performance liquid
chromatography), and then run individually through
the mass spec.
–
–
•
•
•
Protein sequencing and older protein identification
methods also start with proteolytic digestion
Endopeptidases that digest at known sites are used,
such as trypsin (cleaves after Lys or Arg) and
chymotrypsin (cleaves after Phe, Trp, or Tyr).
Ionizing the peptide needs to be done rather gently.
One common technique is MALDI (matrix-assisted
laser desorption/ionization). The proteins are mixed
with the matrix molecules, which efficiently absorb
the UV laser energy and encourage ionization of the
proteins. When irradiated with the laser, they
vaporize along with the protein, but their small size
makes them easy to detect and ignore.
Time-of-flight mass spectrometry is generally used
(so the whole thing is MALDI-TOF). The moelcular
ions are accelerated in an electric field, and the time
it takes them to cross a chamber of known length is
proportional to their mass 9actaully, charge to mass
ratio). This technique works well for the wide range
of sizes seen with peptides.
Sample comparisons can be done by labeling one
sample with a heavy, stable isotope such as 13C or
15N. The samples are mixed before 2D
electrophoresis and they co-migrate on the gel.
However, mass spec can easily resolve them.
Methods for Microarray Analysis
• Microarray data is subject to a lot of potential errors. These fall into
3 main categories: replication, background subtraction, and data
normalization.
• Replication of each experimental data point is essential. There is a
lot of variation between spot intensities in a typical experiment,
especially with home-created microarrays.
• The background fluorescence level needs to be subtracted from all
data points. Since the background is not necessarily uniform, this
can lead to spots with negative intensities (which can be set to
zero).
• Data normalization means attempting to bring the variance of the
expression level to a constant value. It has been observed that the
variance tends to increase with stronger signals. A way to correct
for that is to include a multiplicative error term as well as an additive
error term in statistical calculations.
Expression Level Ratios
•
Most microarray experiments compare
2 conditions, using red and green
dyes. Thus each gene sequence
gives data that is a ratio of red to
green.
–
–
•
•
The problem is, when plotted on a
regular linear graph, the distance
between ½ and 1 is much smaller than
the distance between 1 and 2, even
though they express the same (but
inverse) ratios.
The solution is to take the base 2
logarithm of the red/green ratio. log2(x)
= -log2(1/x), so increases and
decreases give similar ranges.
Similarly, the expression level can be
expressed as the geometric mean of
the red and green signals: The square
root of red times green. However,
taking the logarithm of this spreads the
data out better.
Other data manipulations can further
improve appearances.
RG  geometric _ mean
Data Analysis
• We now wish to tackle the problem of making sense of
masses of gene expression data. Thousands of genes
tested under several conditions, at different time points,
all going up and down in expression level: it is just too
much to understand at a glance.
• The usual question is whether groups of genes show
similar patterns of expression, suggesting a common
regulatory pathway.
• The mathematical techniques we will discuss are various
forms of cluster analysis, along with some approaches to
optimization problems: how can you efficiently find the
best solution to a problem when you can’t try all
possibilities.
• First, we will look at a useful visualization technique,
principal components analysis.
• centroid = the center position, or center of mass
Principal Components
Analysis
•
In typical microarray experiments, you make several
measurements on the expression level of each
gene. You would like to know whether you can
separate out different groups of genes.
–
•
There is also an issue of whether the data points are
really independent of each other.
PCA is a method of standardizing and rotating data
so most of the variance can be plotted in 2 or 3
dimensions.
•
•
•
•
I want to credit Michael Palmer of the Oklahoma State
University Ecology program for a lot of the discussion
below: it’s a hard topic to describe without resorting to
matrix algebra.
In this example, 3 experimental treatments (x1, x2,
ad x3) have been examined with 25 or so
parameters (genes). You want to know whether
there are groups of genes with a common
expression pattern among the 3 treatments.
The first step is to normalize the data, by subtracting
the mean and dividing by the standard deviation.
This gives every measurement a mean of 0 and a
variance of 1. The centroid of the whole data set is
now at the origin.
The data points are still in the same relative
positions.
More PCA
•
•
•
•
•
•
The next step is to generate a new primary
axis, which maximizes the variance along its
length. This is done by minimizing the sum of
the squared distance from each point to this
line. The axis goes through the centroid.
Next, a second axis is drawn at right angles to
the first one and also passing through the
centroid. It maximizes all the remaining
variance along its length.
This process can be repeated until all variance
has been accounted for, if you like.
However, most PCA is used for data display,
so 2 or 3 dimensions are all that is necessary.
Usually, the percentage of the total variance
that is accounted for by each axis is reported.
All of this is done with matrix algebra.
Specifically, linear combinations of the data are
generated using eigenvalues and eigenvectors.
But we aren’t going to discuss that here.
Quite often, different groups of genes stand
out.
Cluster Analysis: Distance Measures
•
•
Before we can determine whether genes fall
into co-expressed clusters, we need to define a
way of quantitating how similarly they behave
under different experimental treatments. That
is, we need to define a measure of distance.
A common method is to use the Euclidean
distance. Recalling the Pythagorean Theorem,
a2+b2=c2, the distance between 2 points in a
plane is the square root of (x1-x2)2 + (y1-y2)2.
This can be generalized for multiple
dimensions, with each experimental treatment
considered as a separate dimension.
– The Euclidean distance is easy to calculate, and
in very wide usage, but it makes no sense: what
do the results of different experiments have to do
with distance in multidimensional space?
– Also, you get different distances when ratios are
used compared to using log ratios.
d AB 
 X
N
i 1
i, A  X i,B 
2
More Distance Measures
•
Measuring the degree with which two
genes are correlated, whether or not
they are expressed to the same level,
can be done with the Pearson
correlation coefficient.
–
–
–
•
For each experiment, the mean and
standard deviation are calculated.
then distances from each data point to
the mean are calculated, and divided
by the standard deviation.
Then these numbers are multiplied to
get the distance between 2 treatments,
and summed over all the genes.
There are lots more possibilities for
measuring distance. It is worth noting
that sometimes the distance measure
can affect the final results, which
should cause a bit of skepticism or
anxiety, depending on your
relationship with the data.
rAB
1 N  X i , A  X A  X i , B  X B 

 s  s 
N  1 i 1 
A
B


Clustering Techniques
•
UPGMA is a commonly used simple
technique. It is an example of a
hierarchical clustering technique, in
which objects (experimental
treatments) are sequentially
combined into larger groups. (You
can also sequentially partition a large
group into successively smaller
groups). The results of hierarchical
clustering is usually a tree.
– There are a few common variants.
Recall that when two groups were
combined, UPGMA took the distance
between the groups as the average
distance between all members. Thus
UPGMA is sometimes called average
linkage clustering.
– Single linkage clustering makes the
distance between 2 clusters as the
minimum distance between any two
members
– Complete linkage clustering uses the
maximum distance between any 2
members.
K-means Clustering
•
K-means clustering is a partitional clustering technique.
You specify the number of clusters (k) you want, and the
data points are divided up so that each point is assigned
to the cluster whose centroid is closest.
–
–
–
–
•
This is done by randomly assigning data points to clusters
and calculating their centroids.
Then, each data point’s distance to these centroids is
calculated and the data points are re-assigned to the closet
centroid.
This process is repeated until no more changes in cluster
assignment occur.
You can try various values of k and test the results
statistically, using a sum of squared distances to the centroids
as an error function.
Self-organizing maps (SOM) work very similarly to this,
except that the centroids are linked to each other, and
when you move one, they all shift a bit. How much they
move depends on how close they are and how far into the
clustering process you are: as tings proceed, there is less
linkage between the centroids.
–
You can visualize this as starting with a set of centroids
arranged on a grid, then moving them around to minimize
distances between individual data points and their nearest
centroid.
Example of Clustering in a
Microarray Experiment
•
•
•
•
Each row is a
gene, with the
degree of redness
proportional to the
degree of
expression.
Genes have been
rearranged to
match the
clusters.
Clustering was
done for genes
and also for
treatments
The actual
expression profile
for one cluster is
shown.
Example of Various Expression Profiles
• The data here were
clustered using selforganizing maps, set
to create 9 clusters.
• Note that some
profiles are very
similar, implying that it
would be worth while
to try fewer clusters.
Optimization Procedures
•
We have seen many techniques where it is impractical to try out all possible
solutions.
–
–
•
In computational complexity theory, it is generally possible to solve problems that work in
“polynomial time”. That is, if there are N basic objects, the time and computer resources
needed to solve the problem is proportional to N, N2, or N3, etc. (logN also fits in here).
These problems are said to have O(N2) (etc.) proportionality.
On the other hand, some problems grow exponentially with the number of objects to study.
They have O(2N) proportionality. These require much more resources to solve, and are
generally the source of problems of the “more than all the particles in the Universe”
problems.
• This is where optimization techniques are used.
More theory. Problems that can be solved in polynomial time are P problems.
Problems that need exponential time are NP problems. (This is a bit of a
generalization).
–
–
–
–
There is a class of NP problems that need exponential time to solve, but any proposed
solution can be checked in polynomial time. These are NP-complete problems. Most of the
nasty bioinformatic problems are of this nature.
There is a major debate in the computer science community about NP-complete problems:
can some clever algorithm be found that will convert them to P problems (Is P = NP?)
Theory says that if one NP-complete problem is converted to P, they all can be: one of the
current Holy Grail issues in computer science.
There is another class of problems, called NP-hard, which means they require at least as
much time and resources to solve as the worst NP-complete problem.
Simulated Annealing
•
•
Simulated annealing is one of the simplest ways of optimizing a solution to a large
problem where an exhaustive search won’t work. Imagine the set of all possible solutions
as having multiple local maxima. It is easy to get caught in a local maximum and
completely miss better solutions.
It is based on an analogy with slowly cooling a molten material to produce the best
possible crystals:
–
–
•
if you reduce the temperature quickly, many random defects appear as the individual molecules
freeze into sub-optimum positions.
However, if you reduce the temperature slowly, you give the molecules a chance to try many
possibilities before settling into the best position. Even if a molecule has settled into place, the
overall temperature is so high that it has a definite chance of leaving that position and trying
another.
In simulated annealing, you start with a possible solution and generate a similar, but
randomly chosen solution. You replace the old solution with the new solution if it is
better than the old one or (and this is very important) if it is only somewhat worse than
the old one.
–
–
–
A worse solution is assigned a probability of replacing the old solution based on how much worse
it is and on the current “temperature”
The process goes through many iterations, with the temperature slowly being lowered.
Note that we are trying to find the minimum energy state. For clustering, this means the cluster
membership that minimizes the sum of squared distances to the cluster centroids.
More Simulated Annealing
•
•
What happens is that in the beginning the
temperature is high, and the clusters
change almost randomly. As time goes on
and the temperature drops, the clusters
change less and less.
The “acceptance function” determines
whether an new solution replaces the old
one. It is usually based on the Boltzmann
distribution, which describes the probability
that a gas molecule has a specified energy
at a given temperature.
–
–
–
–
In practical terms, you first score the old
solution and the new solution (i.e. sum of
squared distances to the centroid of each
cluster), and subtract them, giving the
difference Δ.
If the new solution is better than the old one,
replace the old with the new.
Otherwise, if the new solution is worse than
the old one, you calculate the “energy
probability” e-Δ/T.
Choose a random number between 0 and 1.
If the energy is greater than this random
number, replace the old solution with the new
one. If it’s less, keep the old solution.
e
 /T
 R(0,1)
Genetic Algorithm
• The genetic algorithm is another technique for finding an optimum
solution in situations where it is impossible to check all possibilities.
• It is a subclass of evolutionary algorithms, based on ideas taken
from biology. The idea is that evolution has found solutions to many
of these problems by random changes coupled with natural
selection.
– A few other subtypes, which we won’t otherwise discuss:
• Particle swarm optimization. Based on animal flocking behavior.
• Ant colony optimization. Based on ants foraging, communicating by
pheromones, and forming a path.
• Invasive weed optimization. Based on weeds finding suitable environments
to grow and reproduce.
• Harmony search. Based on how musicians search for better harmonies.
Data Coding
•
Genetic algorithms start with a population of random solutions to a problem, arranged
in a linear array (i.e. like a chromosome).
–
–
–
–
•
Each chromosome is scored according to a fitness function
These solutions undergo mutations (small changes), selection of individuals to be parents of
the next generation, and recombination between pairs of individuals.
Multiple iterations of this process are performed
It is terminated when there is convergence on a common solution or after a fixed maximum
number of generations.
Encoding example. We have 10 objects that we wish to put into 3 groups. The 10
objects correspond to positions 1-10 in an array, and which group they are in is
designated by the number at that array position.
1
2
3
4
5
6
7
8
9
10
A
1
2
2
1
3
1
3
2
2
1
B
2
3
2
3
1
1
2
3
1
2
C
3
1
1
3
1
2
1
2
1
3
D
1
1
3
1
2
3
1
2
1
2
Parent Selection
•
•
Fitness. For clustering, the fitness function is
usually sum of the squares of distances from each
member to the centroid (the center position, or
center of mass) of the cluster.
– Fitnesses can be converted to relative fitnesses by
ranking them, with the chromosome (set of clusters)
with the smallest sum of squared distances given a
fitness of 1.0 and all others assigned proportionally
lower relative fitness values.
Selection of parents is generally done by favoring the best
individuals of the current generation.
–
–
All chromosomes should have a chance of being included in
the next generation. A simple way to do this is the “Roulette
Wheel” method.
• Assign each chromosome a segment of the wheel
proportional to its relative fitness,
• then choose a random number on the wheel and take the
chromosome corresponding to it.
In order to not lose the best of the current generation, you can
also directly copy (i.e. clone) the best member(s) of the current
generation into the next generation. This is called “elitism”.
Recombination and Mutation
•
Once 2 parents are chosen, you need to produce offspring from them.
–
•
You can choose to have 1, 2, or more offspring from each pair of parents, but you want to
keep the number of chromosomes constant.
Recombination (crossing over). For the coding method we are using, it is best to
think of these chromosomes as circular, because the data points are randomly
assigned to groups.
–
–
So, we use 2 crossover points, chosen randomly.
Copy the data from parent 1 up to the first crossover point, then the data from parent 2 up
to the second crossover point, then finish with data from parent 1.
•
Mutation. After crossing over has generated new offspring, randomly change a
small number of data points form one cluster to another.
•
Do this process of selecting parents, generating offspring, and mutating them until
enough new generation members have been generated.
Then, score them all for fitness and repeat the process.
•