Transcript quals

Multiple Species Gene Finding
Sourav Chatterji
[email protected]

Predicting Replication Origins in Yeast.


Comparative GeneFinding using SLAM.




Breier AM, Chatterji S, Cozzarelli NR. Genome
Biol. 2004;5(4):R22.
Paired Splice Site Detection in SLAM.
Zhao X, Huang H, Speed TP. Proceedings of
RECOMB 2004; 68-75.
Rat Genome Sequencing Consortium. Nature.
2004 Apr 1;428(6982):493-521.
Multiple Species GeneFinding.


Chatterji S, Pachter L. Proceedings of RECOMB
2004; 187-193.
Evidence Based GeneFinding - Work in Progress.
State of the Genomes

5 Roundworm Genomes



11 Fruitfly Genomes




C.elegans and C.briggsae completed.
3 other worm genomes in progress.
D. Melanogaster - completed
Seuqencing of 7 genomes in progress
3 more genomes in pipleline
18 Mammalian Genomes




Human, Mouse, and Rat Genomes Published
Sequencing of 6 genomes in progress.
9 other genomes in pipeline
4 primate genomes : Orangutan, Macaque,
Chimpanzee and Human.
Outline

GeneFinding by Gibbs Sampling





Ab-Initio GeneFinding in Vertebrates
Gibbs Sampling in HMMs
Gene finding by Gibbs sampling
Results
Evidence Based Multiple Species GeneFinding




Evidence based GeneFinding
ExonAligner : An Exon Alignment Program
Initial Results
Proposals for Future Work
Gene Finding in Vertebrates

Single organism gene finding:
GENSCAN/GENIE/SNAP……



Based on generalized HMMs
Viterbi Sequence (Gene Annotation).
High Sensitivity/Low Specificity
Gene Finding in Vertebrates

Single organism gene finding:
GENSCAN/GENIE/SNAP……





Based on generalized HMMs
Viterbi Sequence (Gene Annotation).
High Sensitivity/Low Specificity
Conserved regions among related
species more likely to be functional
than divergent regions.
IDEA: Comparative-based gene
finding
Comparative (Pairwise) Gene
Finding
 ROSETTA : Global alignment followed by gene finding
[Batzoglou and Pachter et al., 1999]
 SLAM : Simultaneous Global alignment and gene finding
 TWINSCAN : Blast alignment followed by gene finding
 AGENDA : Global alignment followed by gene finding
[Alexandersson et al. 2001]
[Korf et al. 2001]
[Rinner and Morgenstern, 2002]
 DOUBLESCAN : Simultaneous alignment and gene finding [Meyer and Durbin, 2002]
 SGP2 : Blast alignment followed by gene finding
[Parra et al. 2003]
The Good News

Gene structure




Number of exons conserved (86% human/mouse)
Exons have similar lengths (91% identical, remainder almost
all differ by a multiple of 3)
Intron lengths are divergent (~1% identical length)
Sequence similarity


Exons highly conserved (both amino acids & DNA)
Intron sequences dissimilar
Waterston et al., 2003
The Bad News


Difficult to generalize many pairwise methods to multiple
sequence methods
Alignment



Long Conserved Non Coding Sequences


Exons may be misaligned (much shorter than introns)
Multiple sequence alignment is much harder than pairwise
sequence alignment
Confuse methods that rely on conservation in a naive way
Missing Sequence
Multiple Species Comparative Gene Finding
(with Alignment)
 McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding
(with Alignment)
 McAuliffe et al. (2004), Siepel et al. (2004)
Multiple Species Comparative Gene Finding
(without Alignment)
Gibbs Sampling for Biological
Sequence Analysis

Introduced by Lawrence et al. 1993


Extensions




Motif Detection
Multiple Motifs in a Sequence
Multiple Types of Motifs
Phylogenetic Relationships between Sequences
Applications


Alignment
Linkage Analysis
Gibbs Sampling



Aim : To sample from the joint distribution
p(x1,x2,…,xn) when it is easy to sample from
the conditional distributions
p(xi | x1,…xi-1,xi+1,…,xn)
but not from the joint distribution.
Method: Iteratively sample xit from the
conditional distribution
p(xi | x1t,…xi-1t,xi+1t-1 ,…,xnt-1)
Theorem : For discrete distributions, the
distribution of (x1t,x2t …,xnt) converges to
p(x1,x2,…,xn)
Connection to HMMs
Z1
qt
qs
Z2
qs
qs
qt
qt
Y1
Zm
Ym
Y2
 qt = output probabilities
 qs = transition probabilities
 Difficult to sample from P(q,Z | Y)
 Easy to sample q from P(q | Z,Y)
 Easy to sample Z from P(Z | q,Y)
The Motif Finding Problem
?
?
?
?
PSSM
P1 P2 P3 P4 P5
A ? ? ? ? ?
 Fixed width unknown motif.
 1 motif per sequence, unknown location.
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
The Motif Finding HMM
?
?
q
?
?
PSSM
BG
Z
P1
Z1
Z2
Y1
Y2
 q : PSSM parameters
 Y : Observed sequences
 Z : Alignment
P5
…
BG
P1 P2 P3 P4 P5
Zm-1
Zm
Ym-1
Ym
Y
A ? ? ? ? ?
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
Gibbs Sampling for Motif Detection
?
?
q
?
?
PSSM
BG
Z
P1
Z1
Z2
Y1
Y2
P5
…
BG
P1 P2 P3 P4 P5
Zm-1
Zm
Ym-1
Ym
Y
 Sample q from P(q | Z,Y) [sample PSSM from alignment]
 Sample Z from P(Z | q,Y) [find positions from PSSM]
 Samples from P(Z,q|Y)
A ? ? ? ? ?
C ? ? ? ? ?
G ? ? ? ? ?
T ? ? ? ? ?
Gibbs Sampling for HMMs


N sequences independently generated
by an HMM.
Three types of random variables



q : Parameters
Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Gibbs Sampling for HMMs


N sequences independently generated
by an HMM.
Three types of random variables



q : Parameters Zi1,Zi2,…,Zim
Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Yi1,Yi2,…,Yim
Gibbs Sampling for HMMs


N sequences independently generated
by an HMM.
Three types of random variables





q : Parameters
Z = Z1,Z2,…,Zi …,ZN : hidden variables
Y = Y1,Y2,…,Yi …,YN : observed variables
Aim: To Sample from the distribution
P(Z,q|Y)
Iterations of a Gibbs Sampler


Sample Zi from p(Zi| Y,q) ,
Sample q from p(q | Y,Z)
E100
E00
E01
E02
Intron0
EI0
EI1
E200
E10
Ek00
E11
E12
E20
Intron1
EI2
Single
Exon
IG
E21
E22
Intron2
ET0
ET1
ET2
Gibbs Sampling for Gene Finding
Gibbs Sampling for Gene Finding
Initial Predictions
Gibbs Sampling for Gene Finding
Sample Z1 from P(Z1 | Z[-1] , Y)
Gibbs Sampling for Gene Finding
Sample Z2 from P(Z2 | Z[-2] , Y)
Learning the Number of
Exon Classes
Learning the Number of
Exon Classes
Find Significant Hits Among Peptides
Learning the Number of
Exon Classes
Each Connected Component forms a Class of Genes
Testing



1.6 Mb Data from the NISC Comparative
Sequencing Project
Divided into large genomic regions (100-200
kB) some of which contained multiple genes
Selection Criteria

4 mammals roughly equidistant from each other


Human, Mouse/Rat, Dog/Cat, Pig/Cow.
Available RefSeq annotations with no obvious
alternative splicing
Results
Nucl. Sn Nucl. Sp Exon Sn Exon Sp
Gibbs
0.897
0.886
0.714
0.628
Genscan
0.911
0.548
0.777
0.518
Twinscan 0.692
0.856
0.440
0.513
SLAM
0.881
0.632
0.527
0.791
Robustness Results
Genscan(before)
Nucl.
Sn
0.939
0.885
0.911
Genscan(after)
0.866 0.652 0.748 0.594
Gibbs(before)
Gibbs(after)
Nucl.
Sp
0.950
0.910
0.680
Exon
Sn
0.763
0.740
0.771
Exon
Sp
0.735
0.703
0.612
Twinscan(before) 0.694
SLAM(before)
0.895 0.465 0.604
0.665 0.853 0.465 0.598
0.927 0.911 0.718 0.566
SLAM(after)
0.438 0.936 0.250 0.646
Twinscan(after)
Conclusions

Efficient


Running time O(kNL)
Memory requirements O(L)




Converges rapidly.
No Alignment Required !!
Symmetric Prediction for All Species


k=#iterations,N=#sequences, L=max. length
Application : rapid comparative based annotation
of newly sequenced genomes
Robust


Rearrangements
Draft Quality Sequence
Outline

GeneFinding by Gibbs Sampling





Ab-Initio GeneFinding in Vertebrates
Overview of Gibbs Sampling
Gene finding by Gibbs sampling
Results
Evidence Based Multiple Species GeneFinding




Evidence based GeneFinding
ExonAligner : An Exon Alignment Program
Initial Results
Proposals for Future Work
Evidence Based GeneFinding
Evidence
Annotation

Procrustes : cDNA Evidence, DP based Spliced Alignment
[Gelfland et al.
1996]


Genewise : Protein evidence, combines genefinding HMM with protein
profile HMM, part of the ENSEMBL pipeline. [Birney at al. 1996]
Projector : Evidence from orthologous genes in related species, uses
pair HMM based model. [Meyer and Durbin 2004]
Evidence Based GeneFinding

Large scale sequencing efforts.

8 Drosophila genomes very soon


9 mammalian genomes by early 2005



Use well annotated genomes as evidence.
Draft Quality Genomes


Human genome well annotated.
Aim : Rapid annotation of newly sequenced
genomes.


D. Melanogaster well annotated.
Robustness for Sequencing Errors
Using sequences from multiple species will result in
more accurate annotations.
Will also give us high quality multiple alignments.

Data to study the evolution of genomes.
Evidence Based
Multiple Species GeneFinding


Basic Idea : Use annotations from a reference
genome (e.g. D.melanogaster or H. sapiens) as
evidence to annotate newly sequenced genomes.
Use Whole Genome Homology Maps (courtesy Colin
Dewey)



Project exons from reference genome into every other
genome.
Join projections to get multiple alignments.
Use orthologous sequences from multiple species to
get more accurate annotations.


Produce annotations with all supporting evidence.
Exploit phylogenetic relationships among the species.
Projecting Annotated Exons
Annotation
Homology Map
ExonAligner
ExonAligner : An Exon Alignment Program
Evidence
Target

Mixture of global and local alignment.



Exploit the property that they code for homologous
proteins.


Special Dynamic Programming Matrix
Robust.



Penalize overhanging ends in Evidence.
Overhanging ends in Target is OK.
Sequencing Errors.
Phase Shifts.
Chaining Algorithm for large sequences.
The Dynamic Programming Matrix
The figure only shows edges into the black node. The red edges represent
non-codon gaps, i.e. gaps caused by phase shifts/sequencing errors
and are of length which is not a multiple of 3. They are heavily penalized.
Chaining Algorithms

Widely used in large scale alignment
algorithms.






MUMer [Salzberg et al. 1999]
AVID [Bray et al. 2002]
LAGAN [Brudno et al. 2003]
Step 1: Find good local alignments or
fragments.
Step 2: Select a consistent subset of
fragments for chaining and call these
fragments anchors.
Step 3: Join the anchors to get an alignment.
The ExonAligner Chaining Algorithm

Construction of Fragments.



Translate target sequence in the 3 frames.
Find significant hits with (translated) evidence and
use them as fragments.
Selection of Anchors.

Construct weighted DAG from fragments.



Weigh edges by using dynamic programming.
Use nodes in the shortest path in the DAG as
anchors.
Use dynamic programming to join anchors
together.
Exploiting Phylogeny
Recoverable Exons/Genes
?
Recoverable Exons/Genes
Project Using Exon Aligner
Preliminary Results

Created a Homology map of Human, Chimp,
Rat, Mouse and Chicken genomes




266836 exon cliques.
45543 non-convex exon cliques.
27502 of these recoverable.
Used ExonAligner to map 3300 human Refseq
genes into the chimp genome.

Robustness of Algorithm critical.


500 of the 42662 exons had non-codon gaps.
These alignments will in turn be used to learn
parameters for ExonAligner.

Extrapolate parameters for other species.
An Illustrative Example

RefSeq Gene NM_030575



Potential orthologous gene in chimp



Single exon gene, 221 a.a. protein.
No orthologous gene found by Genewise.
2 non-codon gaps in alignment (1 insertion and 1
deletion separated by 60 nt).
212 out of 221 amino acids are matches.
Is this a real ortholog?


Phase Shift/Sequencing Error?
Find orthologous genes in other species and use
multiple alignment.
Future Work

Extend ExonAligner for Multiple Species

Robust realignment



Take into account codon structure
Robust for phase shifts/sequencing errors
Annotation with supporting evidence.



Basic Evidence e.g. RefSeq Gene Annotation
Multiple Alignment with Orthologous Features
Score : Statistical Significance of the Feature
Future Work

Comprehensive Annotation Program



Rapid annotation of Drosophila and
Mammalian genomes


Put Evidence Based and Ab initio methods
together
Try to use alignment/homology in Gibbs Sampler
Berkeley AAA group for Drosophila genomes.
Study the evolution of genes

Find human specific genes