Transcript PPT

Generation of patterns from gene
expression by assigning confidence to
differentially expressed genes
Elisabetta Manduchi, Gregory R. Grant,
Steven E.McKenzie, G. Christian Overton,
Saul Surrey, Christian J. Stoeckert
Presented by Keith Betts
Goal:
• Provide tools to aid in the analysis of data
collected from highly parallel gene
expression experiments.
• Generate descriptive and dependable
expression patterns representing the
differential expression of genes across cell
types.
•Identify those genes that are ‘most likely’ to
be differentially expressed.
•Transform typical ‘raw’ input into easily
interpretable list of patterns
gene
tag
1
2
3
4
G0 I
G0 II
G0 III
G1 I
0.0114 0.0328 0.0151 0.006
0.005 0.0131 0.0061 0.0041
0.0629 0.234 0.0431 0.227
0.025 0.06 0.0264 0.15
gene
tag
1
2
3
4
G1
0
0
2
3
G1 II
0.0236
0.0364
0.212
0.266
G2
7
8
-1
1
G2 I
G2 II
G2 III
G3 I
0.0436 0.564 0.892 0.0639
0.0296 0.883
0.7 0.0199
0.0105 0.14 0.0243 0.0117
0.0134 0.186 0.0851 0.0172
G3
2
1
-1
0
G3 II
0.249
0.105
0.0907
0.0112
Patterns from Gene
Expression
What is
???
•PaGE is free downloadable Perl software (tested
mainly on Unix systems) which can be used as a
statistical test for differentially expressed genes
between two experimental conditions, given replicated
expiriments.
•Available at:
http://www.cbil.upenn.edu/PaGE/
Methods and Algorithm
• Input consists of normalized data (the
normalization procedure depends on the
kind of experiments conducted)
• The input normalized intensities are
subjected to preprocessing steps.
Methods and Algorithm Cont.
•In each gene tag’s expression pattern there
will be one symbol for each homotypic group
(set of samples of the same type)
•For each homotypic group and for each gene
tag, compute the average intensity of that tag
over the group which have values for that
tag.
•This average will represent the intensity of
that tag at that group.
Two Stage Approach
First:
Attach an ordered list of real numbers to each
tag.
Second:
Bin the numbers in this list, resulting in a
pattern of integers.
First Stage
• Fix an ordering of the groups in the
collection.
• Attach to each tag the ordered list of real
numbers obtained by dividing each of its
non-reference group intensities by the
median of its group intensities.
• List of ratios attached to the tag.
Second Stage
• For each non-reference group, partition the
range into disjoint subintervals.
• Number the bins using consecutive integers
–m,…,0,….m (where 0 corresponds to ratio
1)
• Attach the ordered list of integers to each
gene tag.
Example
• For group i: Divide the range into mi + ni +
1 bins.
• The list of ratios from the first stage for a
certain gene tag is (r1, r2,…., rl)
• Each ri belongs to exactly one of the bins
Bi,j.
• The expression pattern associated with this
tag is then (j1, j2,…, jl)
Choose level cutoffs
• Suppose we are taking ratios to a reference
homotypic group (group 0) and are focusing
on a fixed group (group i).
• Suppose also that we have replicate
experiments for each of the two groups.
• Concentrate on up-regulation
Goal
• Goal is to achieve a certain degree of
confidence in the assertion:
‘this gene is up-regulated at group i as
compared to the reference group’
• Each gene will have a distribution of
intensities in a group, whose mean will be
called ‘the true mean intensity of the gene at
that group’
• Denote the Random Variable giving the
intensity of gene g at group j by Xg,j, and
denote the Mean and Std. Dev as g,j, g,j
False Positive Rate
Prob((Xg,I / Xg,0) > Ci
| (g,j / g,0 ) < 1)
• Claim that
(Ave.g,I / g,j) / (Ave.g,0 / g,0) > Ci )
And
(g,j / g,0 ) < 1
Are independent events.
• Seek Ci as small as possible such
that:
Prob(
(Ave.g,I / g,j) / (Ave.g,0 / g,0) > Ci )
< s%
Approximate (Ave.g,j / g,j) for (j = 0, i)
((Xg,j,k / Ave.g,j) – 1)
/ Sqrt(tj – 1) + 1
• Compute the desired Ci through integration
• If fj ( j = 0,i) is the density function for
Ave.g,j / g,j , and C is fixed, then evaluate
using

t s C
f 0(t ) fi (s)dsdt
• If this is above the desired false positive
rate, them C is raised and the integral is
recalculated.
• Repeat process until the desired false
positive rate is attained.
Down-regulation
• Proceed in similar manner
• Seek ci as small as possible such that:
Prob(
(Ave.g,I / g,j) / (Ave.g,0 / g,0) > ci )
< s%
• Once the Ci’s and the ci’s are determined for
each reference group I, if the ratio of the
average intensity of a gene tag at group i,
and the average intensity of the same gene
tag at the reference group is between Ci and
Ci2, we say that the gene tag is up-regulated
one level at this group as compared to the
reference group.
• One can now estimate the probability
Prob(not up | predicted up)

Prob(not up) * Prob(predicted up | not up) /
Prob(predicted up)

Prob(predicted up | not up) / Prob(predicted up)
• As a consequence of this approach, when we see a
level different from 0, we have a certain
confidence in the gene tag being up-regulated or
down-regulated as compared to the reference
group.
• However, when we see a 0 there is no confidence
implied.
• We can only take 0 to mean that we do not have
enough evidence to support a change in level.
Results
Application to an erythroid development
nylon filter dataset
Background
• Erythroid development dataset contains 5
homotypic groups representing an
erythroleukemic cell line and normal cells
under different conditions
• There are repliate data for each of the
groups.
Background Continued
•
1.
2.
3.
4.
5.
The groups are:
CD34 positive cells
Human adult erythroblasts
Cord erythroblasts
HEL cells
HEL cells treated with hemin
Application
• Available replicates
Two CD34
Three adult erythroblasts
Two cord blood erythroblasts
Three HEL
Two HEL + hemin
• The value of d is set at 15
• Only the moderate to highly abundant
mRNA classes are likely to have given
hybridization signals above background on
the filter array.
• Set the HEL group as reference
Two approaches
• PaGE was run once merging the adult and
the cord erythoblasts into one group with
five replicates
• PaGe was run a second time keeping the
adult and cord erythoblasts in separate
groups.
Performance
• Running time always under 90 seconds
when run on a UltraSPARC Iii CPU at
300MHZ with 128MB RAM.
Adult and Cord Merged Results
• Total of 18,123 clones
• 540 were above the minimum useful value
in every group
• 5,063 were above the minimum useful value
in at least one group.
Merged Results Cont.
•
•
•
•
For s% = 1% (false positive rate)
5 levels for CD-34 (0 to 4)
10 levels for erythoblasts (-1 to 8)
6 levels for HEL + hemen (-1 to 4)
Findings
• Clones representing the same gene were
usually found to have identical or very
similar patterns.
• Clones representing genes whose
expression is known in these cells presented
patterns compatible with what was
expected.
s%
CD34 Conf.#
1.00% 85%
0.10% 97%
0.01% 98%
Ery. Conf. #
37 81%
17 92%
3 99%
HEL+h Conf.#
28 0.89
7 0.94
5 0.99
47
9
3
New Application
• Ask what genes are differentially expressed
between Normal and leukemic cells?
• Ask which genes are induced by hemin to
adopt a normal expression pattern.
Findings
• Having more genes available to start with
led to more genes identified as differentially
expressed but at lower confidence.
• At similar confidence levels, starting with
more genes did not necessarily lead to more
genes identified as differentially expressed
between normal and HEL cells.