Computational Gene Finding - Department of Computer Science • NJIT

Download Report

Transcript Computational Gene Finding - Department of Computer Science • NJIT

Computational
Gene Finding
CIS786 Intro to Comp Biol
Instructor: Dr. Barry Cohen
Greg Voronin
Hui Zhao
Xueyi(Judy) Xiao
The Challenge
Presented By Greg Voronin
Generate predictions of gene locations
from primary genomic sequence by
computational means
 Two principle means:

– Database searching
– Statistical Methods
The Biological Model
The Computational Model
Representing the biology in a
framework amenable to
mathematical/statistical methods
 Exon classification, sequence features,
signal profiles

– What is an exon and what properties does
the sequence of an exon hold?
– How is an exon recognized and
processed?
Exon Classification Scheme
The Nature of The Data

What is the primary genomic sequence?
• “Nor is the available sequence a single continuous and exact
sequence for each chromosome… [ the HGP ] is represented
by a set of sequences that cover the genome is a statistical
sense but have a very large number of gaps.”
– Many genes are as large or larger than the contigs
in the HGP
– Finding genes will depend on the accuracy of the
scaffold of their contigs
Back to Beginning

What is a gene?
– A biological model, a mathematical model and
computational representation
The programs we evaluate take these factors into account
in their underlying model
MZEF
Michael Zhang’s Exon Finder
 Utilizes quadratic discriminant analysis
(QDA) to classify sequence into gene
and non-gene groups

– QDA is a multivariate statistical pattern
recognition method
– “Draws” a curved boundary between
groups of different classes
QDA
Key Elements of QDA

Entities are represented by an n-dimensional
vector of feature values
 Two classes of entities are categorized by
their respective multinormal distribution
– Each class has its own mean vector
– The mean of each feature

An appropriate distance function is central to
the calculation of the posterior probabillity of
group membership of a given unknown entity
given its specific feature vector.
Mahalanobis Distance

The actual posterior probabillity function
is more complex, but this is the distance
component:
( x – mi
)T
Si ( x – mi )
-1
MZEF Specifics

MZEF uses the following features:
– Exon length, exon-intron transition,
branch site score, 3’ss score, exon
score, strand score, frame score, 5’ss
score, intron-exon transition
9 dimensional feature vector
 Training sets of known exons and
“non-exons” are used to establish the
class characterisitics

– Supervised learning
…GATC… to Gene
Cells recognize genes from DNA
sequence.
Can we??
The Hidden Markov Model Method
HMMgene Presented By Hui Zhao
HMMs are Statistical Models

Definition:
– Any mathematical construct that attempts to
parameterize a random process

Example: A normal distribution
–
–
–
–

Assumptions
Parameters
Estimation
Usage
HMMs are just a little more complicated…
Primary HMM Assumptions

Observations are ordered
 Random processes can be represented by a
stochastic finite state machine with emitting states
– transition probabilities and emission probabilities.
How do we find the model probabilities?

This is called training
 We start with an architecture and a set of observed
sequences
 The training process iteratively alters its
parameters to fit the training set
 The trained model will assign the training
sequences high probability
– but can it generalize?
HMM Usage – two major tasks

Evaluate the probability of an observed sequence
given the model (Forward)

Find the most likely path through the model for a
given observation sequence (Viterbi)
Gene Finding:
An Ideal HMM Application

Our Objective:
– To find the coding and non-coding regions of
an unlabeled string of DNA nucleotides

Our Motivation:
– Assist in the annotation of genomic data
produced by genome sequencing methods
– Gain insight into the mechanisms involved in
transcription, splicing and other processes
Why HMMs might be a good fit for
Gene Finding





The observations within a sequence are ordered
A DNA sequence is a set of ordered observations
Designing the architecture is straight forward:
Easy to measure success
Training data is available from various genome
annotation projects
A HMM genefinder

States represent standard gene features: intergenic
region, exon, intron, perhaps more (promotor,
5’UTR, 3’UTR, Poly-A,..).

Observations are things like state-dependent base
composition.

In a HMM, length of each state must be included as
well.

Finally, reading frames and both strands must be
dealt with.
Several problems can occur
5’
correct gene structure
extended exon
missing exon
additional exon
missing intron
extended gene model
3’
HMMgene
Krogh (1997) In Proc. 5th Conf. Intel. Sys. Mol Biol. pp179-186
•Predicts whole genes in any given stretch of DNA
•Uses Hidden Markov Models (HMM) to maximize
probability of accurate prediction
•This allows confidence levels to be determined and
"Best Prediction" as well as potential alternative
splicing predictions
•Outputs splice sites, start and stop codons,
alternative predictions
•Trained for human and C. elegans
HMMGene





Uses an extended HMM called a CHMM
CHMM = HMM with classes
Takes full advantage of being able to modify the
statistical algorithms
Uses high-order states
Trains everything at once
How does HMMGene work?
1) 5th order HMM assumes:
P(xi | xi-1,xi-2, xi-3, xi-4, xi-5) is different in Introns, Exons, etc..
e.g: P(G, I | A,C,G,G,T)  P(G, E | A,C,G,G,T)
2) Construct the model
2. How does HMMGene work?
3) In a CHMM states emit a  nucl.  pair
 class label 


G  G 
e. g.   or  
 I  E
4) Use Viterbi (n-best) to find
a path through the CHMM = a labeled gene
5) Use the forward algorithm to measure P(gene | model)
–using n-best.
A DNA sequence containing one gene. For each nucleotide its label is
written below. The coding regions are labeled ‘C’, the introns ‘I’, and the
intergenic regions ‘0’. HMMGene calls these class labels in a CHMM.
HMMGene

Does not use the standard ML method which optimizes the
probability of the observed sequence – instead it maximizes
the probability of the correct prediction.

Only one conference paper describes the algorithm. There
is a web site to run the algorithm, and it's performance has
been compared to other algorithms.

No complete description of the algorithm is available – in
the 1997 paper the author states "… the details of
HMMGene will be described elsewhere (in prep)" – but
unfortunately the detailed paper has not been published.
HMMgene
http://www.cbs.dtu.dk/services/HMMgene/)
HMMgene and HMM Disadvantages

Markov Chains
P(x)
…
P(y)
– States should be independent
– P(y) must be independent of P(x) -usually not true

Local maxima
– Model may not converge the optimal parameter set

Over-fitting
– More training is not always good-set may be too small
Summary
• HMMgene finds whole genes in anonymous DNA
with correctly spliced exons.
• It can predict several whole or partial genes in one
sequence.
•If some features of a sequence are known, such as
hits to ESTs, proteins, or repeat elements, these
regions can be locked as coding or non-coding and
then the program will find the best gene structure
under these constraints.
GENSCAN (v1.0)
Presented By Xueyi (Judy) Xiao

A computer program identifying complete exon & intron
structures of genes in genomic DNA.

Developed by Chris Burge (Burge 1997), in the research
group of Samuel Karlin, Dept of Mathematics, Stanford Univ.

Original server @Stanford  New server @MIT (seq_len <= 500 kb);
Servers are also maintained by the Pasteur Institute, Paris and by the
GENSCAN web server at DKFZ/EMBnet, Heidelberg

Implementations



web server http://genes.mit.edu/GENSCAN.html
email server http://genes.mit.edu/GENSCANM.html
local copy downloaded under a license agreement
How does It Work?

Designed to predict complete gene structures
 Introns and exons
 Promoter sites
 Polyadenylation signals

Larger predictive scope
 Partial and Complete genes
 Multiple genes separated by intergenic DNA in a seq
 Consistent sets of genes on either/both DNA strands

Not use similarity-based methods

Based on a general probabilistic model of
genomic sequences composition and gene structure
Model of Genomic Sequence Structure
Fig. 3, Burge and Karlin 1997
Input
http://genes.mit.edu/GENSCAN.html
Output
Graphic View
Initial
Exon
Internal
Exon
Terminal
Exon
Single-Exon
gene
Optimal Exon
Suboptimal Exon
Is It Good?

Accuracy:
Substantially higher accuracies when tested on standardized
sets of human & vertebrate genes, with 75-80% of exons
identified exactly.

Reliability:
Able to indicate fairly accurately the reliability of each predicted
exon.

Consistency:
Consistently high levels of accuracy, for seqs of differing C+G
content and for distinct groups of vertebrates.
Why not Perfect?

Gene Number
usually approximately correct, but may not

Organism
primarily for human/vertebrate seqs; maybe lower accuracy for
non-vertebrates. ‘Glimmer’ & ‘GeneMark’ for prokaryotic or yeast seqs

Exon and Feature Type
Internal exons > Initial or Terminal exons;
Exons > Polyadenylation or Promoter signals(‘NNPP’)

Biases in Test Set
The Burset/Guigó (1996) dataset:
 toward short genes with relatively simple exon/intron structure
The Rogic (2001) dataset:




DNA seqs: GenBank r-111.0 (04/1999 <- 08/1997);
source organism specified;
consider genomic seqs containing exactly one gene;
seqs>200kb were discarded; mRNA seqs and seqs containing pseudo genes or
alternatively spliced genes were excluded.
What are They doing NOW?
The research group @MIT
is currently developing another program,
GenomeScan, which is more accurate
when a moderate or closely related
protein seq is available.
TEST OF METHODS

Sample Tests reported by Literature
 Test on the set of 570 vertebrate gene seqs (Burset&Guigo 1996) as
a standard for comparison of gene finding methods.
 Test on the set of 195 seqs of human, mouse or rat origin (named
HMR195) (Rogic 2001).

Self-Test done by our group
 Dataset: Intron-less(Single-exon), -rich(Multi-exon), -poor(Random)
 Organism: Human
 Methods: all of the three
 Steps
Where to get the dataset for Self-Test?
http://www.ncbi.nlm.nih.gov/genome/guide/human/
Accuracy Measures
Sensitivity vs. Specificity (adapted from Burset&Guigo 1996)
TP
FP
TN
FN
TP
FN
TN
Actual
Predicted
Actual
No Coding / Coding
Predicted
Coding / No Coding
TP
FP
FN
TN
Sensitivity
(Sn)
Fraction of actual coding regions that are correctly
predicted as coding
Specificity
(Sp)
Fraction of the prediction that is actually correct
Correlation
Coefficient (CC)
Combined measure of Sensitivity & Specificity
Range: -1 (always wrong)  +1 (always right)
Results: Accuracy Statistics
Table: Relative Performance (adapted & added from Rogic 2001)
Test By Rogic 2001
Programs
# of
seq
Nucleotide
accuracy
Multi-Exon
# of
ESn
Seq
ESp
Single-Exon
# of
ESn ESp
Seq
ESn
ESp
195(3) 0.95 0.90 0.91
0.70
0.70
5
0.57
0.63
5
0.60 0.50
HMMgene 195(5) 0.93 0.93 0.91
0.76
0.77
5
0.42
0.42
5
0.60 0.30
0.58
0.59
5
0.76
0.62
5
0.40 0.40
MZEF
Sp
Exon
accuracy
CC
Genscan
Sn
Self-Test 2002
119(8) 0.70 0.73 0.66
# of seqs - number of seqs effectively analyzed by each program;
in parentheses is the number of seqs where the absence of gene
was predicted;
Sn -nucleotide level sensitivity; Sp - nucleotide level specificity;
CC - correlation coefficient;
ESn - exon level sensitivity; ESp - exon level specificity
Testing ‘Random’ Sequences
Presented By Greg Voronin

These gene finding programs model
statistical trends and properties
– Can they be fooled by ‘random’ sequences
– Generate a preliminary measure of accuracy
Java program written to generate ‘random’
sequences of a,t,g,c
 3 groups of sequences 5k, 10k & 30K
 Sent to BLAST then GeneMachine

Testing Results

BLAST:
bit score
– 5k
42
– 10k
44
– 30k
42

E-value
5.7
3.0
8.7
GeneMachine:
– MZEF
– GenScan
– HMMgene
5k
1
3
7
10k
5
11
11
30K
14
26
42
New directions
Presented By Hui Zhao
•Computational Gene Finding has rapidly evolved
since it started 20 years ago.
•The advent of full-length genomic sequences has
provided data and increased the requirements.
•Gene annotation has direct medical implications on
the design of pharmaceuticals and the understanding of
the genetic component of diseases.
•Gene finding remains largely an unsolved problem.
New directions
•The growing quantities of training data for the models
should improve their performance.
•Algorithms that combine the inputs from several
models in a weighted voting scheme should be
considered to try to get the best from all of the
methods.
•Many other AI approaches can be used to meet this
challenge including decision trees, neural networks
and rule-based systems
Challenges and Discoveries Ahead

Eukaryotic gene finding continues to be an active
and important area – more research is required
into algorithms with greater accuracy

Expertise in computational biology is also
required – which means training in both:
computer science and molecular biology

More classes like this…