Pattern Recognition in Biological Sequences

Transcript Pattern Recognition in Biological Sequences

Gene Prediction
in silico
Nita Parekh
BIRC, IIIT, Hyderabad
Goal
The ultimate goal of molecular cell biology is to
understand the physiology of living cells in terms
of the information that is encoded in the genome
of the cell
How computational approaches can help
in achieving this goal ?
A gene codes for a protein
DNA
CCTGAGCCAACTATTGATGAA
transcription
mRNA
CCUGAGCCAACUAUUGAUGAA
translation
Protein
PEPTIDE
What is Computational Gene Finding?
Given an uncharacterized DNA sequence, find:
– Which region codes for a protein?
– Which DNA strand is used to encode the
gene?
– Which reading frame is used in that strand?
– Where does the gene start and end?
– Where are the exon-intron boundaries in
eukaryotes?
– (optionally) Where are the regulatory
sequences for that gene?
Search space - 2-5% of Genomic DNA
(~ 100 – 1000 Mbp)
Need for Computational Gene Prediction


It is the first step towards getting at
the function of a protein.
It also helps accelerate the annotation
of genomes.
Deoxyribonucleic acid (DNA)
– is a blueprint of the cell
Composed of four basic units
- called nucleotides
Each nucleotide contains
- a sugar, a phosphate and
one of the 4 bases:
Adenine(A), Thymine(T),
Guanine(G), Cytosine(C)
For all computational purposes, a DNA
sequence is considered to be a string on a
4-letter alphabet: A, T, G, C
ACGCTGAATAGC
The aim is to find grammar & syntax rules of
DNA language based on the 4-letter alphabet,
- similar to English Grammar to form
meaningful sentences
Biological Sequences
Order of occurrence of bases:
not completely random
- Different regions of the genome exhibit
different patterns of the four bases, A, T, G, C
e.g., protein coding regions, regulatory regions, intron/exon
boundaries, repeat regions, etc.
Aim: identifying these various patterns to infer
their functional roles
Assumption in biological sequence analysis:
- strings carrying information will be different
from random strings
If a hidden pattern can be identified in a
string, it must be carrying some functional
information
Example
This is a lectre
bioinformatics
on
This is a lecture on bioinformatics
asjd lkjfl jdjd sjftye
nvcrow nzcdjhsp
asjd lkjfl jdjd sjftye nvcrow nzcdjhspu
Frequency of letters
A.
B.
C.
D.
E.
F.
G.
H.
I.
J.
K.
L.
M.
7.3%
0.9%
3.0%
4.4%
13.0%
2.8%
1.6%
3.5%
7.4%
0.2%
0.3%
3.5%
2.5%
N. 7.8%
O. 7.4%
P. 2.7%
Q. 0.3%
R. 7.7%
S. 6.3%
T. 9.3%
U. 2.7%
V. 1.3%
W. 1.6%
X. 0.5%
Y. 1.9%
Z. 0.1%
Other statistics


Frequencies of the most common first letter
of a word, last letter of a word, doublets,
triplets etc.
20 most used words in
– Written English
the of to in and a for was is that on at he
with by be it an as his
– Spoken English
the and I to of a you that in it is yes was
this but on well he have for
Parallels in DNA language
ATGGTGGTCATGGCGCCCCGAACCCTC
TTCCTGCTGCTCTCGGGGGCCCTGACC
CTGACCGAGACCTGGGCGGGTGAGTG
CGGGGTCAGGAGGGAAACAGCCCCTG
CGCGGAGGAGGGAGGGGCCGGCCCG
GCGGG
GTCTCAACCCCTCCTCGCCCCCAGGCT
CCCACTCCATGAGGTATTTCAGCGCCG
CCGTGTCCCGGCCCGGCCGCGGGGAG
CCCCGCTTCATCGCCATGGGCTACGTG
GACGACACGCAGTTCGTGCGGTTC
Parallels in DNA language
ATG GTG GTC ATG GCG CCC CGA ACC
CTC TTC CTG CTG CTC TCG GGG GCC
CTG ACC CTG ACC GAG ACC TGG GCG
GGT GAG TGC GGG GTC AGG AGG GAA
ACA GCC CCT GCG CGG AGG AGG GAG
GGG CCG GCC CGG CGG…
GTC TCA ACC CCT CCT CGC CCC CAG
GCT CCC ACT CCA TGA GGT ATT TCA
GCG CCG CCG TGT CCC GGC CCG GCC
GCG GGG AGC CCC GCT TCA TCG CCA
TGG GCT ACG TGG ACG ACA CGC AGT
TCG TGC GGT TC…
This task needs to be automated because of
the large genome sizes:
Smallest genome:
Mycoplasma genitalium 0.5 x 106 bp
Human genome: 3 x 109 bp (not the largest)
Finding genes in Prokaryotes
• each gene is one continuous stretch of
bases
• most of the DNA sequence codes for protein
(70% of the H.influenzea bacterium genome is coding)
Finding genes in Prokaryotes
Gene prediction in prokaryotes is considerably
simple and involves:
• identifying long reading frames
• using codon frequencies
Finding genes in Eukaryotes
• the coding region is usually discontinuous
• composed of alternating stretches of exons
and introns
• Only 2-3 % of the human genome (~3 x
109bp) codes for proteins
Finding genes in Eukaryotes
Gene finding problem complicates:

due to the existence of interweaving exons
and introns – stop codons may exist in intronic
regions making it difficult to identify correct ORF

a gene region may encode many proteins –

Exon length need not be multiple of three –

Gene may be intron-less (single-exon genes)

Relatively low gene density - only 2 - 5% of
due to alternative splicing
resulting in frameshift between exons
the human genome codes for proteins
Methods for Identifying Coding Regions
 Finding Open Reading Frames (ORFs)
 Homology Search
•
DNA vs. Protein Searches
 Content-based methods:
•
Coding statistics, viz., codon usage bias,
periodicity in base occurrence, etc.
 Signal-based methods:
•
•
CpG islands
Start/Stop signals, promoters, poly-A sites,
intron/exon boundaries, etc.
 Integration of these methods
Finding Open Reading Frames (ORF)




Once a gene has been sequenced it is
important to determine the correct open
reading frame (ORF).
Every region of DNA has six possible reading
frames, three in each direction
The reading frame that is used determines
which amino acids will be encoded by a gene.
Typically only one reading frame is used in
translating a gene, and this is often the
longest open reading frame
Finding Open Reading Frames (ORF)



Detecting a relatively long sequence
deprived of stop codons indicate a coding
region
An open reading frame starts with a start
codon (atg) in most species and ends with a
stop codon (taa, tag or tga)
Once the open reading frame is known the
DNA sequence can be translated into its
corresponding amino acid sequence using
the genetic code
The codons are triplet of bases
The Genetic Code
Finding Open Reading Frames (ORF)
Consider the following sequence of DNA:
5´ TCAATGTAACGCGCTACCCGGAGCTCTGGG
CCCAAATTTCATCCACT 3´ “Forward Strand”
Its complementary Strand is:
3´ AGTTACATTGCGCGATGGGCCTCGAGACCCGGG
TTTAAAGTAGGTGA 5´
“Reverse Strand”
The DNA sequence can be read in six reading
frames - three in the forward and three in the
reverse direction depending on the start
position
Finding Open Reading Frames (ORF)
5´ TCAATGTAACGCGCTACCCGGAGCTCTG
GGCCCAAATTTCATCCACT 3´
Three reading frames in the forward direction:
Start codon
1.
TCA ATG TAA CGC GCT ACC CGG AGC
TCT GGG CCC AAA TTT CAT CCA CT
2.
CAA TGT AAC GCG CTA CCC GGA GCT
CTG GGC CCA AAT TTC ATC CAC T
3.
AAT GTA ACG CGC TAC CCG GAG CTC
TGG GCC CAA ATT TCA TCC ACT
Finding Open Reading Frames (ORF)
3´ AGTTACATTGCGCGATGGGCCTCGAGACC
CGGGTTTAAAGTAGGTGA 5´
Three reading frames in the reverse direction:
1.
AG TTA CAT TGC GCG ATG GGC CTC
GAG ACC CGG GTT TAA AGT AGG TGA
2.
A GTT ACA TTG CGC GAT GGG CCT CGA
GAC CCG GGT TTA AAG TAG GTG
3.
AGT TAC ATT GCG CGA TGG GCC TCG
AGA CCC GGG TTT AAA GTA GGT
stop codon
Start codon
Finding Open Reading Frames (ORF)
In this case the longest open reading frame
(ORF) is the 3rd reading frame of the
complementary strand :
AGT TAC ATT GCG CGA TGG GCC TCG
AGA CCC GGG TTT AAA GTA
When read 5´ to 3´, the longest ORF is:
ATG AAA TTT GGG CCC AGA GCT CCG
GGT AGC GCG TTA CAT TGA
Finding Long ORFs



First step to distinguish between a coding
and a non-coding region is to look at the
frequency of stop codons
Sequence
search)
similarity
search
(database
When no sequence similarity is found, an
ORF can still be considered gene-like
according to some statistical features:
v the three-base periodicity
v higher G+C content
v signal sequence patterns
Finding Long ORFs
Once a long ORF/ all ORFs above a certain
threshold are identified,
- these ORF sequences are called putative
coding sequences
- translate each ORF using the Universal
Genetic code to obtain amino acid sequence
- search against the protein database for
homologs
Finding genes in Prokaryotes
Drawbacks:

The addition or deletion of one or more
bases will cause all the codons scanned to
be different
 sensitive to frame shift errors


Fails to identify very small coding regions
Fails to identify the occurrence of
overlapping long ORFs on opposite DNA
strands (genes and ‘shadow genes’)
Web-based tools
ORF Finder (NCBI)
http://www.ncbi.nih.gov/gorf/gorf.html
EMBOSS
getorf - Finds and extracts open reading
frames
plotorf - Plot potential open reading frames
Sixpack - Display a DNA sequence with 6frame translation and ORFs
http://www.hgmp.mrc.ac.uk/Software/EMB
OSS/Apps/getorf.html
Homology Search
This involves Sequence-based Database
Searching
• DNA Database searching
• Protein Database searching
Homology Search
Why search databases?
When one obtains a new DNA sequence,
one needs to know:



whether it already exists in the databanks
whether it has any homologous
sequences (i.e., sequences derived from a
common ancestry) in the databases
Given a putative coding ORF, search for
homologous proteins – proteins similar in
their folding or structure or function.
Homology Search
DNA vs. Protein Searches
Use protein for database
similarity searches whenever
possible
Homology Search
Three main search tools used for database
search:
•
BLAST - algorithm by Karlin & Altschul
http://www.ncbi.nlm.nih.gov/BLAST/
•
FastA - algorithm by Pearson & Lipman
http://www.ebi.ac.uk/fasta33/
•
Smith-Waterman (SW) algorithm
- dynamic programming algorithm
Limitations of Homology Search
Only limited number of genes are available in
various databases.
 Currently only 50% of the sequences are
found to be similar to previously known
sequences.

It should always be kept
similarity-based methods are
as the databases that are
apparent homology can be
times
in mind that
only as reliable
searched, and
misleading at
Content-based Methods
At the core of all gene identification programs
– there exist one or more coding measures
A coding statistic - a function that computes
the likelihood that the sequence is coding for
a protein.
A good knowledge of core coding statistics is
important to understand how gene
identification programs work.
Classification of Coding Measures
Coding statistics measure
• base compositional bias
• periodicity in base occurrence
• codon usage bias
Main distinction is between
• measures dependent of a model of
coding DNA
• measures independent of such a model.
Model dependent coding statistics capture
the specific features of coding DNA:
 Unequal usage of codons in the coding
regions - a universal feature of the genomes
 Dependencies between nucleotide positions
 Base compositional bias between codon
positions
requires a representative sample of coding
DNA from the species under consideration to
estimate the model's parameters
-
Markov Models
Dependencies between nucleotide positions in coding
regions
- can be explicitly described by means of Markov Models
In Markov Models
- the probability of a nucleotide at
a particular codon position depends
on the nucleotide(s) preceding it.
Probability of a DNA sequence of
a st  P( xi  t | xi 1  s)
length L:
transition probabilities
L
P ( x )  P ( x 1 ) a xi  1 , xi
i 2
Markov Models
Table III: Probabilities of the four nucleotides at the
different codon positions conditioned to the nucleotide in
the preceding codon position
Model independent coding statistics
capture only the “universal” features of
coding DNA:
 Position Asymmetry – how asymmetric is
the distribution of nucleotides at the 3 triplet
positions
 Periodic Correlation - correlations between
nucleotide positions
- do not require a sample of coding DNA
Signal-based Methods
Signal – a string of DNA recognized by the cellular
machinery
GT
AG
Signals for gene identification
There are many signals associated with
genes, each of which suggests but
does not prove the existence of a gene
Most of these signals can be modeled
using weight matrices
Signals for gene identification
 CpG Islands – identify the 2% of the
genome that codes for proteins
 Start & Stop Codons – signifies the start
& end of a coding region
 Transcription Start Site – to identify the
start of coding region
 Donor & Acceptor Sites - signifies the
start & end of intronic regions
 Cap Site – found in the 5’ UTR
Signals for gene identification
 Promoters – to initiate transcription
(found in 5’ UTR region)
 Enhancers – regulates gene expression,
(found in 5’ or 3’ UTR regions, intronic
regions, or up to few Kb away from the gene)
 Transcription Factor Binding Sites –
short DNA sequences where proteins bind
to initiate transcription /translation
process
 Poly-A Site – identify the end of coding
region (found in 3’ UTR region)
Promoter Detection
Not all ORFs are genes
True coding regions have specific sequences
upstream of the start site known as
promoters where the RNA polymerase binds
to initiate transcription, e.g., in E. coli:
Consensus
patterns
 No two patterns are identical
 All genes do not have these patterns
Positional Weight Matrix
for TATA box
1
2
3
4
5
6
A
2
95
26
59
51
1
C
9
2
14
13
20
3
G
10
1
16
15
12
0
T
79
2
44
13
17
96
Complications in Gene Prediction
The problem of gene identification is further
complicated in case of eukaryotes by the vast
variation that is found in the structure of genes.
On an average, a vertebrate gene is 30Kb long.
Of this, the coding region is only about 1Kb.
The coding region typically consists of 6 exons,
each about 150bp long.
These are average statistics
Complications in Gene Prediction
Huge variations from the average are observed
Biggest human gene, dystrophin is 2.4Mb long.
Blood coagulation human factor VIII gene is ~ 186Kb. It has
26 exons with sizes varying from 69 bp to 3106 bp and its
25 introns range in size from 207 to 32,400 bp.
An average 5’ UTR is 750bp long, but it can be longer and
span several exons (for e.g., in MAGE family).
On an average, the 3’ UTR is about 450bp long, but for e.g.,
in case of the gene for Kallman’s syndrome, the length
exceeds 4Kb
Some facts about human genes






Comprise about 3% of the genome
Average gene length: ~ 8,000 bp
Average of 5-6 exons/gene
Average exon length: ~200 bp
Average intron length: ~2,000 bp
~8% genes have a single exon
Some exons can be as small as 1 or 3 bp
Complications in Gene Prediction
In higher eukaryotes the gene finding becomes far more
difficult because
• It is now necessary to combine multiple ORFs to obtain
a spliced coding region.
• Alternative splicing is not uncommon,
• Exons can be very short, and introns can be very long.
Given the nature of genomic sequence in humans, where
large introns are known to exist, there is definitely a need
for highly specific gene finding algorithms.
Gene prediction programs
GENSCAN: http://genes.mit.edu/GENSCAN.html
Probabilistic model based on HMM. The different states of
the model correspond to different functional units on a
gene, e.g., promoter region, exon, intron etc.
It uses a homogenous 5th order Markov model for noncoding regions and a 3-periodic (inhomogenous) 5th order
Markov model for coding regions
Signals are modeled by weight matrixes, weight arrays and
maximal dependence decomposition techniques.
Species: Vertebrates, Maize Arabidopsis, Trained on human
genes, Accuracy lower for non-vertebrates
Gene prediction programs
Fgene: Uses Dynamic Programming to find the optimal
combinations of exons, promoters, and polyA sites detected
by a pattern recognition algorithm.
Species: Human, Drosophila, Nematode, Yeast, Plant
http://dot.imgen.bcm.tmc.edu.9331/gene-finder/gf.html
GrailExp: Uses HMM as the underlying computational
technique to determine its genomic predictions. Uses Neural
Network to combine the information from various gene
finding signals
Species: Human, Mouse
http://compbio.ornl.gov/grailexp/

Pattern Recognition in Biological Sequences

Transcript Pattern Recognition in Biological Sequences

Directory