Transcript cha2

Promoter Prediction in E.coli using ANN
A.Krishnamachari
Bioinformatics Centre, JNU
[email protected]
Definition of Bioinformatics
• Systematic development and application
of Computing and Computational
solution techniques to biological data to
investigate biological process and make
novel observations
Research in Biology
General approach
Bioinformatics era
Organism
Functions
Cell
Chromosome
DNA
Sequences
intergenic
TSS
TF
-35
-10
TF -> Transcription Factor Sites
TSS->Transcription Start Sites
RBS -> Ribosome Binding sites
CDS - > Coding Sequence (or) Gene
RBS
CDS
Statement of the problem
• Given a set of known sequences
pertaining to a specific biological feature ,
develop a computational method to search
for new members or sequences
Computational Methods
• Pattern Recognition
• Pattern classification
• Optimisation Methods
Sequence Analysis
AND
Prediction Methods
• Consensus
• Position Weight Matrix (or) Profiles
• Machine Learning Methods
– Neural Networks
– Markov Models
– Support Vector Machines
– Decision Tree
– Optimization Methods
Statistical Learning
Consensus sequence
TA TAA T
49%,54% and 58%
14 sites out of 291 sequences [Lisser and Margalitt]
Mismatches but which one?
Describing features using
frequency matrices
• Goal: Describe a sequence feature (or
motif) more quantitatively than possible
using consensus sequences
• Need to describe how often particular
bases are found in particular positions in
a sequence feature
Describing features using
frequency matrices
• Definition: For a feature of length m
using an alphabet of n characters, a
frequency matrix is an n by m matrix in
which each element contains the
frequency at which a given member of
the alphabet is observed at a given
position in an aligned set of sequences
containing the feature
Frequency matrices (continued)
• Three uses of frequency matrices
– Describe a sequence feature
– Calculate probability of occurrence of feature
in a random sequence
– Calculate degree of match between a new
sequence and a feature
Frequency Matrices, PSSMs,
and Profiles
• A frequency matrix can be converted
to a Position-Specific Scoring Matrix
(PSSM) by converting frequencies to
scores
• PSSMs also called Position Weight
Matrixes (PWMs) or Profiles

Methods for converting
frequency matrices to PSSMs
• Using log ratio of observed to expected
score(i)  log m( j,i)/ f ( j)
where m(j,i) is the frequency of character j
observed at position i and f(j) is the overall
frequency of character j (usually in some
large set of sequences)
• Using amino acid substitution matrix (Dayhoff
similarity matrix) [see later]
Finding occurrences of a
sequence feature using a Profile
• As with finding occurrences of a
consensus sequence, we consider all
positions in the target sequence as
candidate matches
• For each position, we calculate a score by
“looking up” the value corresponding to the
base at that position
Positions (Columns in alignment)
Nucleotide
s
1
2
3
4
5
A
x11
x21
x31
x41
x51
T
x12
x22
x32
x42
x52
G
x13
x23
x33
x43
x53
C
x14
x24
x34
x44
x54
TAGCT AGTGC
V1
x12 + x21 + x33 + x44 + x52
if V1 is above a threshold it is a site
Building a PSSM
Set of Aligned
Sequence
Features
Expected
frequencies of
each sequence
element
PSSM
builder
PSSM
Searching for sequences related to a
family with a PSSM
Set of
Aligned
Sequence
Features
Expected
frequencies
of each
sequence
element
PSSM
builder
PSSM
Threshold
Set of
Sequences
to search
PSSM
search
Sequences that match above
threshold
Positions and scores of
matches
Consensus sequences
vs.
frequency matrices
• consensus sequence or a frequency
matrix which one to use?
– If all allowed characters at a given position
are equally "good", use IUB codes to create
consensus sequence
• Example: Restriction enzyme recognition sites
– If some allowed characters are "better" than
others, use frequency matrix
• Example: Promoter sequences
Consensus sequences
vs.
frequency matrices
• Advantages of consensus sequences:
smaller description, quicker comparison
• Disadvantage: lose quantitative
information on preferences at certain
locations
Linear Classification Problems
Measure2
Measure1
Nonlinear Classification Problem
Measure 2
Measure 1
(Artificial) Neural Network
What Is A Neural Network ?
A computational construct based on
biological neuron
. A neural network can:
•Learn by adapting its synaptic weights to
changes in the surrounding environments;
•handle imprecise, fuzzy, noisy, and
probabilistic information; and
•generalize from known tasks or examples to
unknown ones.
Artificial neural networks (ANNs) attempt to
mimic some,or all of these characteristics.
Neural Network
• Characterised by:
- its pattern of connections between the
neurons (Network Architecture)
- its method of determining the weights on
the connections (training or learning
algorithm)
Why Neural Network:Applications
-Little or no incomplete understanding of
the problem to be solved (very little theory)
-Abundant data available
Neural Networks: Applications
•
•
•
•
•
•
Pattern classification
Speech synthesis and recognition
Image compression
Clustering
Medical Diagnosis
Manufacturing
Neural Network:Bioinformatics
•
•
•
•
•
Binding sites prediction
Protein Secondary Structure prediction
Protein folds
Micro array data clustering
Gene prediction
Neural Networks
• Supervised Learning
• Unsupervised Learning
Output
inputs
Layer 1 (input)
1
Layer 2 (output)
W1,3
3
2
W2,3
Direction of information flow
Summation Operation
xi * wij=x1*w1+x2*w2+x3w3….+xnWnj
Thresholding functions
Output = 0 if
Output = 1 if
Output
1
1
x*w < T
x*w >T
Output 1
Threshold=0
0 

0
T
Output
Input
Logistic Transfer function
1
Output =
-
1+e
Weight updates
W(k+1) = w(k)+ µ[T9k) – w(k)x(k)]x(k)
for 0 ≤ k≤ N-1
Learning Concepts
• Generally
– the target output is 1 for +ve
– The target output is 0 for –ve
– But practically (0.9, 0.1) combination
• Stopping criterion
Based on certain epochs or cycles
Based on certain error estimates
Nucleic Acid
A T G C
1
2
Position
In a sequence
Of K nucleotides
K-1
K
let the following binary values represent each base
A="0001
C="0010
G="0100
T="1000
then
G=4
A or C = "0011 = 3
A,G or T = "1101 = 13
etc.
Nucleic Acid
1A
2G
A T G
C
0 0
0 1
1
0
0
0
Position
In a sequence
Of K nucleotides
K-1 G 0
K T 1
1
0 0
0
0 0
Wi,j
P=50
N=50
TP + TN =100
Note: Training and Test sequences are fixed length
1 2 …………50
1 2 …………50
n=10
N=500
TRAINING
CGTAGCTATAGTGGG
TTTAAACCCAAGAAT
TATGGAATTTGGAAG
TTTAGGATAGCACAG
GATAAGGCCTAGATA
TTTATGCATGAGATG
TEST
CCTGAACTGAG
ATGATATATA A
GTGAAATTCCG
Prediction
Method
Input
Output
Input Layer
-1
Hidden layer
Output Layer
Error
function
A
(local)
weight
B
(global)
Learning
Recognition
<x,f(x)>
x
NN
NN
h(x)
Example
Disease A diagnosis:
x – gene expression data
(vector of numbers)
f(x) – A positive / A negative (boolean 0/1)
<x,f(x)> - set of known values
Evaluation Mechanism
• Sensitivity =
TP
TP + FN
Specificity =
TN
TN + FP
C
=
(TPxTN – FPxFN)
(TP+FP)x(FP+TN)x(TN+FN)x(FN+TP)
Cross - Validation
• Benchmarking the Network Performance
• Step:1
Divide the training set into “N” partitions
• Step: 2
Train the “N-1” partitions and
Test the Left out
• Step: 3
Evaluate the Performance
1. Nucleic Acids Res. 1992 Aug 25;20(16):4331-8.
An assessment of neural network and statistical approaches for prediction of E. coli
promoter sites.
Horton PB, Kanehisa M.
2. Nucleic Acids Res. 1994 Jun 11;22(11):2158-65.
Analysis of E.coli promoter structures using neural networks.
Mahadevan I, Ghosh I.
O’Neill MC
Training back-propagation neural networks to define and detect DNA-binding sites.
Nucleic Acids Res. 1991 Jan 25;19(2):313-8.
Escherichia coli promoters: neural networks develop distinct descriptions in learning
to search for promoters of different spacing classes.
Nucleic Acids Res. 1992 Jul 11;20(13):3471-7.
A general procedure for locating and analyzing protein-binding sequence motifs
in nucleic acids.
Proc Natl Acad Sci U S A. 1998 Sep 1;95(18):10710-5.
Pedersen AG, Engelbrecht Investigations of Escherichia coli
promoter sequences with artificial neural networks: new signals discovered
upstream of the transcriptional startpoint.
Proc Int Conf Intell Syst Mol Biol. 1995;3:292-9.
Basic Gene Grammars and DNA-ChartParser for language processing of
Escherichia coli promoter DNA sequences
Siu-wai Leung, Chris Mellish and Dave Robertson
E.Coli PROMOTER DATA
Earlier Studies
• Data set corresponding to 17 spacing
class
• Very high threshold values
Network
Model
Performance
pBR322(CW)
17-1
34/35
-5,
125,805,1021,12
339,477,1584,16 26,4278
57,1970,4130
17-2
33/35
….
….
17-3
34/35
17-4
32/35
-5, 1584,
1970,4130
125,807,1226,42
78
Poll(4/4)
pBR322(CCW)
New
New
New
New
NEW
Disadvantages of MLP
• i) Slow convergence,
• ii) Training relies heavily on the choice of
the number of hidden layers and
• iii) Mode of prediction is generally based
on high threshold values.
Improvements in Prediction
• Pre processing of the data based on DNA
structure
• Clean Model
Structural Atlas of E.coli
: J Mol Biol. 2000 Jun 16;299(4):907-30.
A DNA structural atlas for Escherichia coli.
Pedersen AG, Jensen LJ, Brunak S, Staerfeldt HH, Ussery DW.
1) The size of the positive data set may be increased by incorporating point
mutations in non-sensitive positions
2) Negative data sets are generated by several ways i.e.
a) Shuffling or randomising the positive data set. This does not destroy
the the correlations between the bases completely.
b) Using random sequences with a biased composition,
c) Extracting the sequences from gene coding segments.
3) For the learning phase, the number of positive and negative input vectors
are not generally proportionate and there is no standard prescription
4) Convergence factor and predictive ability depend on the size and the
number of input vectors .