X - TIGP Bioinformatics Program

Download Report

Transcript X - TIGP Bioinformatics Program

Neural Networks in Bioinformatics
I-Fang Chung
[email protected]
Institute of Bioinformatics, YM
4-27-2006
Experience and Education
• 1989-2000 Electrical and Control Engineering in NCTU
• 2000-2003 (Postdoc) ECE: Laboratory of Intelligent Control
• 2003-2004 (Postdoc) Laboratory of DNA Information Analysis
of Human Genome Center, Institute of Medical Science, Tokyo
University
• 2004-now Institute of Bioinformatics, Yang-Ming
Outline
• Motivation
• To solve one problem in bioinformatics
– Identification of RNA-Interacting Residues in
Protein
• Current projects
Neural Networks
 Neural networks are constructed to resemble the behavior of
human brains (neurons)
 Characterizes the ability to learn, recall, and generalize from
training patterns
x1
x2
wi1
Weights
wi2
neti a(.)
yi
Output path
xm
wim
Neural Networks (cont’d)
 Good at tasks such as pattern matching, classification,
function approximation, and data clustering
 Good at tasks in bioinformatics such as coding region
recognition, protein structure prediction, gene clustering
y
w
v
x
1
x
2
x
n
Basic Principles of Discrimination
• Each object associated with a class label (or response) Y  {1, 2, …,
K} and a feature vector (vector of predictor variables) of G
measurements: X = (X1, …, XG)
Aim: predict Y from X.
1
2
K
Predefined Class
{1,2,…K}
Objects
Y = Class Label = 2
X = Feature vector
{colour, shape}
Classification rule ?
X = {red, square}
Y=?
Example
Learning set
Predefine
classes
Clinical
outcome
Bad prognosis
recurrence < 5yrs
Good Prognosis
recurrence > 5yrs
Good Prognosis
?
Matesis > 5
Objects
Array
Feature vectors
Gene
expression
new
array
Reference
L van’t Veer et al (2002) Gene expression
profiling predicts clinical outcome of breast
cancer. Nature, Jan.
Classification
rule
Design Issues
Human brain
Domain knowledge,
e.g. biology (molecule, chemistry)
Problem definition
(desired input/output mapping)
Output encoding
Neural Network
Applications
Molecular Structure
DNA:ATGCGCTC
Protein:MASSTFYI Pre-Processing
:
:
:
Sequence discrimination
Feature detection
Post-Processing Classification
Structure prediction
Training Data Sets
Testing Data Sets
Feature representation
(knowledge extraction)
Input encoding
System Evaluation
Network Architecture
Learning Algorithm
Parameter adjustment
Prediction of Protein 2nd Structures
Adopted from Qian and Sejnowski, 1988
Sliding Window
Chain_1
y1
y2
y3
Chain_2
Chain_3
…
w
x1
 Sliding window concept
– Considering a piece of strings as inputs
– Only looking at central position in a piece
of strings to detect what kind of 2-D info. happens
x2
x3
Binary Bit Encoding Method
 Input encoding for each input pattern
000001000000000000000
– Unary encoding scheme for protein sequence
 21 binary bits for 20 kinds of amino acid type (1 bit for overlapped
terminal)
 Input layer with multiple Input patterns
– A window size ‘w’ of consecutive residues been considered.
– ‘21 * w’ units for sequence only
 Output layer with 3 units
– To describe what kind of 2-D info. Happens
(‘1, 0, 0’ for helix, ‘0, 1, 0’ for sheet, ‘0, 0, 1’ for coil)
 One hidden layer for non-linear 2-class pattern
classification
w
More Complex NN Structure: PHD
Multiple sequence Alignment,
it is a way to compare multiple
sequence, the result is called
alignment profile.
breakthrough:use evolutionary information in MSA instead of single sequence
Adopted from Rost and Sander, 1993
Outline
• Motivation
• To solve one problem in bioinformatics
– Identification of RNA-Interacting Residues in
Protein
• Current projects
Identification of RNA-Interacting
Residues in Protein
Task
– Predicting putative RNA-interacting sites within a
protein chain
• Given a protein sequence
 Finding the RNA-binding positions (residues)
Method
– Using feedforward neural network based on
sequence profiles
– Analyzing and qualifying a large set of the network
weights trained on sequence profiles
Data Generation
 Source: Protein Data Bank (PDB)
– Collect Protein-RNA complexes, resolved by X-ray with ≤
3.0Å
 Remove redundant protein structures with sequence identity
over 70%
 86 non-homologous protein chains (21990 residues)
 Residues in interaction sites
– The closest distance between atoms of the protein and the
partner RNA is less than 7Å.
– hydrogen bonds, stacking, electrostatic, hydrophobic, and
van der Waals, interactions considered
 Residues in interaction sites: 21.7% (4782)
Classifier
y1
Chain_1
y2
Chain_2
w
Chain_3
…
Amino acids
x1
2D info.
Appearance probability
x2
x3
PSSM
 Position Specific Iterative BLAST (PSI BLAST)
– A strong measure of residue conservation in a given location
 Position specific scoring matrix (PSSM)
– A 20-dimensional vector representing probabilities of conservation
against mutations to 20 different amino acids including itself
 The position of the important function of protein will be kept
in the course of evolving
Experimental Results (cont’d)
 Agreement with structural studies of protein-RNA interactions
– Arg, Lys, Ser, Thr, Asp and Glu prefer to be in hydrogen bonding
– Phe and Ser are frequently located in van der Waals interacting and
stacking interacting
 Some conflicting situations
– Ala, Leu and Val known to less preferred types in interactions
– Asn typically though of one of the most preferred amino acid types in
hydrogen bonding
Adopted from Jeong and Miyano, 2006
Saliency Factor
 Objective: Define a matrix to represent the importance of the
presence of specific residues at specific positions
 Step1: Normalization of weight xij
for each input unit aij


H
xij
| wijk |
k 1
H
M : the window size, 1 ≤ i ≤ M
N : the # of distinct residue symbols,
1≤j≤N
H : the # of hidden units, 1 ≤ k ≤ H
Adopted from Jeong and Miyano, 2006
Saliency Factor (cont’d)
 Weight conservation : the amount of weight
information represent at each position i in
the given window, defined as the difference
between the maximum entropy and the
entropy of the observed weight distribution
Ri  log X i  Ei
where X i   j 1 xij and Ei   j 1 (
N
N
xij
Xi
) log(
xij
Xi
)
 Saliency factor of residue j at window
xij
position i
f 
R
ij
 New input
Xi
pij  aij f ij
i
Adopted from Jeong and Miyano, 2006
M : the window size, 1 ≤ i ≤ M
N : the # of distinct residue symbols,
1≤j≤N
H : the # of hidden units, 1 ≤ k ≤ H
Notations
 Four kinds of measuring parameters are defined:
– True Positive (TP):
the number of accurately predicted interaction sites
– True Negative (TN):
the number of accurately predicted not-interaction sites
– False Positive (FP):
the number of inaccurately predicted interaction sites
– False Negative (FN):
the number of inaccurately predicted not-interaction sites
– Examples: (1: positive, 0: negative)
0101000010011001111000  Observed
1100001110001111110011  Predicted
TP
TN
FP
FN
Measuring Performance
 Total accuracy: (TP  TN ) (TP  TN  FP  FN )
– Percentage of all correctly predicted interaction and not-interaction sites
 Accuracy (Specificity): TP (TP  FP)
– To measure the probability that how many of the predicted interaction sites
are correct
 Coverage (Sensitivity): TP (TP  FN )
– To measure the probability that how many of the correct interaction sites
are predicted
 Mattews correlation coefficient (MCC):
– Takes into account both under- and over-predictions
– ranges between 1 (perfect prediction) and -1 (completely wrong prediction)
(TP  TN  FP  FN )
(TP  FP)(TP  FN )(TN  FP)(TN  FN )
Receiver Operating Characteristic
(ROC) Curve
Our
method
ATGpr
Experimental Results
Adopted from Jeong and Miyano, 2006
Experimental Results (cont’d)
Adopted from Jeong and Miyano, 2006
Experimental Results (cont’d)
interaction
overpredicted
underpredicted
not-interaction
Adopted from Jeong and Miyano, 2006
References
 E. Jeong, I F. Chung, and S. Miyano, “Prediction of Residues in Protein-RNA
Interaction Sites by Neural Networks,” Proc. of the 14th International
Conference on Genome Informatics, pp. 506-507, 2003.
 E. Jeong, I F. Chung, and S. Miyano, “A Neural Network Method for
Identification of RNA-Interacting Residues in Protein,” Proc. of the 4th
International Workshop on Bioinformatics and Systems Biology, pp. 105-116,
2004.
 E. Jeong and S. Miyano, “A weighted profile based method for protein-RNA
interacting residue prediction,” Trans. on Comput. Syst. Biol., IV, LNBI 3939,
pp. 123 - 139, 2006.
Current Projects
 To discover the relationship between protein sequence
and protein structure
– To identification of RNA-interacting residues in protein
– To perform protein metal binding residue prediction
– To predict the phosphorylation sites
 Microarray data analysis
– Significant gene selection, clustering, classification
 Prediction of the polymorphic short tandem repeats
Mini-Workshop: Knowledge Discovery
Techniques for Bioinformatics
Dr. Limsoon Wong
Hierarchy of Protein Structure
2nd structure prediction
3rd structure prediction
Protein Secondary Structures
Alpha helix
Parallel beta sheet
Anti-parallel beta sheet
loop