Essential Bioinformatics and Biocomputing (LSM2104

Transcript Essential Bioinformatics and Biocomputing (LSM2104

Lecture 3: Protein Families
and Family Prediction Methods
Prof. Chen Yu Zong
Tel: 6874-6877
Email: [email protected]
http://xin.cz3.nus.edu.sg
Room 07-24, level 7, SOC1,
National University of Singapore
Protein Evolution:
SARS coronavirus as an example
2
SARS Coronavirus
A novel coronavirus
Identified as
the cause of
severe respiratory
syndrome (SARS )
3
SARS Infection
How SARS
coronavirus
enters
a cell and
reproduce
4
Protein Evolution
Generation of different species
5
Protein Families
• Sequence alignment-based families.
– Based on Principle of Sequence-structure-function-relationship.
– Derived by multiple sequence alignment
– Database: PFAM (Nucleic Acids Res. 30:276-280)
• Structure-based families.
– Derived by visual inspection and comparison of structures
– Database: SCOP (J. Mol. Biol. 247, 536-540)
• Functional Families.
– Databases:
• G-protein coupled receptors: GPCRDB (Nucleic Acids Res. 29: 346349), ORDB (Nucleic Acids Res. 30:354-360)
• Nuclear receptors: NucleaRDB (Nucleic Acids Res. 29: 346-349)
• Enzymes: BRENDA (Nucleic Acids Res. 30, 47-49)
• Transporters: TC-DB (Microbiol Mol Biol Rev. 64:354-411)
• Ligand-gated ion channels: LGICdb (Nucleic Acids Res. 29: 294295)
• Therapeutic targets: TTD (Nucleic Acids Res. 30, 412-415)
• Drug side-effect targets: DART (Drug Safety 26: 685-690)
6
Protein Families
Sequence families
=\= Structural families
=\= Functional families
Sequence similar, structure different
Sequence different, structure similar
Sequence similar, function different (distantly related proteins)
Sequence different, function similar
Homework: find examples
7
Protein Family Prediction Methods
Sequence alignment-based families:
• Multiple sequence alignment (HMM): HMMER;
JMB 235, 1501-153; JMB 301, 173-190
Structure-based families:
• Visual inspection and comparison of structures
Functional Families.
• Statistical learning methods:
– Neural network: ProtFun (Bioinformatics, 19:635-642)
– Support vector machines: SVMProt (Nucleic Acids Res., 31: 3692-3697)
8
Sequence Comparison as a
Mathematical Problem:
Example:
Sequence a: ATTCTTGC
Sequence b: ATCCTATTCTAGC
Best Alignment:
ATTCTTGC
ATCCTATTCTAGC
/|\
gap
Bad Alignment:
AT TCTT
GC
ATCCTATTCTAGC
/|\
/|\
gap
gap
Construction of many alignments => which is the best?
9
How to rate an alignment?
• Match: +8 (w(x, y) = 8, if x = y)
• Mismatch: -5 (w(x, y) = -5, if x ≠ y)
• Each gap symbol: -3 (w(-,x)=w(x,-)=-3)
C - - - T T A A C T
C G G A T C A - - T
+8 -3
-3
-3 +8 -5 +8 -3
-3
+8 = +12
Alignment score
10
Alignment Graph
Sequence a: CTTAACT
Sequence b: CGGATCAT
C G G A
C
T
T
C
A
T
C---TTAACT
CGGATCA--T
T
A
A
C
T
11
An optimal alignment
-- the alignment of maximum score
• Let A=a1a2…am and B=b1b2…bn .
• Si,j: the score of an optimal alignment
between
a1a2…ai and b1b2…bj
• With proper initializations, Si,j can be
computed
si 1, j  w(ai ,)
as follows.

si , j  maxsi , j 1  w(, b j )
s
 i 1, j 1  w(ai , b j )
12
Computing Si,j
j
w(ai,bj)
w(ai,-)
i
w(-,bj)
Sm,n
13
Initializations
0
C
-3
T
-6
T
-9
C
G
G
A
T
C
A
T
-3
-6
-9 -12 -15 -18 -21 -24
A -12
A -15
C -18
T -21
14
S3,5 = ？
C
G
G
A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
T
-9
2
0
-2
-5
?
-10 -13
1
-2
A -12
A -15
C -18
T -21
15
S3,5 = ？
C
G
G
A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
1
-2
T
-9
2
0
-2
-5
5
-1
-4
9
A -12 -1
-3
-5
6
3
0
7
6
A -15 -4
-6
-8
3
1
-2
8
5
C -18 -7
-9
-11
0
-2
9
6
3
T -21 -10 -12 -14 -3
8
6
4
14
-10 -13
optimal
score
16
C T T A A C – T
C G G A T C A T
8 – 5 –5 +8 -5 +8 -3 +8 = 14
C G G A
T
C
A
T
0
-3
-6
-9
-12 -15 -18 -21 -24
C
-3
8
5
2
-1
-4
-7
T
-6
5
3
0
-3
7
4
1
-2
T
-9
2
0
-2
-5
5
-1
-4
9
A -12 -1
-3
-5
6
3
0
7
6
A -15 -4
-6
-8
3
1
-2
8
5
C -18 -7
-9
-11
0
-2
9
6
3
T -21 -10 -12 -14 -3
8
6
4
14
-10 -13
17
Global Alignment vs. Local Alignment
• global alignment:
• local alignment:
18
An optimal local alignment
• Si,j: the score of an optimal local alignment
ending at ai and bj
• With proper initializations, Si,j can be
computed
as follows. 0
si , j
s

w
(
a
,

)
i

1
,
j
i

 maxsi , j 1  w(, b j )
s
i 1, j 1  w( ai , b j )


19
local alignment
Match: 8
Mismatch: -5
Gap symbol: -3
0
C
0
G
0
G
0
A
0
T
0
C A T
0 0 0
C
0
8
5
2
0
0
8
5
2
T
0
5
3
0
0
8
5
3
13
T
0
2
0
0
0
8
5
2
11
A
0
0
0
0
8
5
3
?
A
0
C
0
T
0
20
A – C - T
A T C A T
8-3+8-3+8 = 18
C G
0 0 0
local alignment
G
0
A
0
T
0
C A T
0 0 0
C
0
8
5
2
0
0
8
5
2
T
0
5
3
0
0
8
5
3
13
T
0
2
0
0
0
8
5
2
11
A
0
0
0
0
8
5
3
13 10
A
0
0
0
0
8
5
2
11
8
C
0
8
5
2
5
3
13 10
7
T
0
5
3
0
2
13 10
8
The
best
score
18
21
Multiple sequence alignment (MSA)
• The multiple sequence alignment problem is to
simultaneously align more than two sequences.
Seq1: GCTC
GC-TC
Seq2: AC
A---C
Seq3: GATC
G-ATC
22
How to score an MSA?
• Sum-of-Pairs (SP-score)
GC-TC
Score
GC-TC
Score
A---C
G-ATC
+
A---C
GC-TC
= Score G-ATC
+
A---C
Score
G-ATC
23
Functional Classification by SVM
• A protein is classified as either belong (+) or not belong (-) to a
functional family
• By screening against all families, the function of this protein can be
identified (example: SVMProt)
• What is SVM? Support vector machines, a machine learning method,
learning by examples, statistical learning, classify objects into one of
the two classes.
• Advantage of SVM: Diversity of class members (no racial
discrimination). Use of sequence-derived physico-chemical features
as basis for classification. Suitable for functional family
classifications.
24
SVM References
• C. Burges, "A tutorial on support vector machines for pattern
recognition", Data Mining and Knowledge Discovery, Kluwer
Academic Publishers,1998 (on-line).
• R. Duda, P. Hart, and D. Stork, Pattern Classification, John-Wiley,
2nd edition, 2001 (section 5.11, hard-copy).
• S. Gong et al. Dynamic Vision: From Images to Face Recognition,
Imperial College Pres, 2001 (sections 3.6.2, 3.7.2, hard copy).
• Online lecture notes
25
Goal:
Introduction to Machine Learning
To “improve” (gaining knowledge, enhancing
computing capability)
Tasks:
•Forming concepts by data generalization.
•Compiling knowledge into compact form
•Finding useful explanations for valid concepts.
•Clustering data into classes.
Reference:
Machine Learning in Molecular Biology Sequence Analysis .
Internet links:
http://www.ai.univie.ac.at/oefai/ml/ml-resources.html
26
Introduction to Machine Learning
Category:
•
Inductive learning.
•
•
Analytic learning.
•
•
Use of existing knowledge to derive new useful concepts
(explanation based learning).
Connectionist learning.
•
•
Forming concepts from data without a lot of knowledge from
domain (learning from examples).
Use of artificial neural networks in searching for or representing
of concepts.
Genetic algorithms.
•
To search for the most effective concept by means of Darwin’s
“survival of the fittest” approach.
27
Machine Learning Methods
Inductive learning:
Concept learning and example-based learning
Concept
learning:
28
Machine Learning Methods
Analytic
learning:
29
Machine Learning Methods
Neural network:
30
Machine Learning Methods
Genetic algorithms:
Pattern
Strength
Classification
31
32
SVM
33
SVM
34
SVM
35
SVM
36
SVM
37
SVM
38
SVM
39
SVM
40
SVM
41
SVM
42
SVM
43
SVM for Classification of Proteins
How to represent a protein?
• Each sequence represented by specific feature vector assembled
from encoded representations of tabulated residue properties:
– amino acid composition
– Hydrophobicity
– normalized Van der Waals volume
– polarity,
– Polarizability
– Charge
– surface tension
– secondary structure
– solvent accessibility
• Three descriptors, composition (C), transition (T), and distribution
(D), are used to describe global composition of each of these
properties.
Nucleic Acids Res., 31: 3692-369744
SVM for Classification of Proteins
Descriptors for amino acid composition of protein:
C=(53.33, 46.67)
T=(51.72)
D=(3.33, 16.67, 40.0, 66.67, 96.67, 6.67, 26.67, 60.0, 76.67, 100.0)
Nucleic Acids Res., 31: 3692-3697
45

Essential Bioinformatics and Biocomputing (LSM2104

Transcript Essential Bioinformatics and Biocomputing (LSM2104

Directory