Machine Learning - Institute of Microbial Technology

Download Report

Transcript Machine Learning - Institute of Microbial Technology

Support Vector Machine and its
Appliactions
G P S Raghava
Institute of Microbial Technology
Sector 39A, Chandigrah, India
Email: [email protected]
Web: http://www.imtech.res.in/raghava/
http://crdd.osdd.net/
1
Why Machine Learning ?
• Similarity based methods
• Linear seperations
• Statistical methods (static)
• Unable to handle non-linear data
2
Supervised & Unsupervised
• Learn an unknown function f(X) = Y, where X
is an input example and Y is the desired
output.
• Supervised learning implies we are given a
training set of (X, Y) pairs by a “teacher”
• Unsupervised learning means we are only
given the Xs and some (ultimate) feedback
function on our performance.
3
Concept learning or classification
• Given a set of examples of some
concept/class/category, determine if a given example
is an instance of the concept or not
• If it is an instance, we call it a positive example
• If it is not, it is called a negative exampl
• Or we can make a probabilistic prediction (e.g.,
using a Bayes net)
4
Supervised concept learning
• Given a training set of positive and negative
examples of a concept
• Construct a description that will accurately classify
whether future examples are positive or negative
• That is, learn some good estimate of function f given
a training set {(x1, y1), (x2, y2), ..., (xn, yn)} where each
yi is either + (positive) or - (negative), or a probability
distribution over +/5
Major Machine Learning Technoques
• Artificial Neural Networks
• Hidden Markov Model
• Nearest Neighbur Methods
• Support Vector Machines
6
Introduction to Neural Networks
• Neural network: information processing paradigm inspired by
biological nervous systems, such as our brain
• Structure: large number of highly interconnected processing
elements (neurons) working together
• Like people, they learn from experience (by example)
• Neural networks are configured for a specific application
7
Neural networks to the rescue
• Neural networks are configured for a
specific application, such as pattern
recognition or data classification, through
a learning process
• In a biological system, learning involves
adjustments to the synaptic connections
between neurons
 same for artificial neural networks (ANNs)
8
Where can neural network systems help
• when we can't formulate an algorithmic
solution.
• when we can get lots of examples of the
behavior we require.
‘learning from experience’
• when we need to pick out the structure
from existing data.
9
Mathematical representation
The neuron calculates a weighted sum of inputs and
compares it to a threshold. If the sum is higher than
the threshold, the output is set to 1, otherwise to -1.
Non-linearity
10
Artificial Neural Networks
• Layers of nodes
 Input is transformed into
numbers
 Weighted averages are
fed into nodes
• High or low numbers
come out of nodes
 A Threshold function
determines whether high
or low
• Output nodes will “fire” or
11
12
A widely used machine learning
approach: Markov models
•Markov chain models (1st order, higher order and
inhomogeneous models; parameter estimation; classification)
• Interpolated Markov models (and back-off models)
• Hidden Markov models (forward, backward and BaumWelch algorithms; model topologies; applications to gene
finding and protein family modeling
13
Example of Markov Model
0.3
0.7
Rain
Dry
0.2
0.8
• Two states : ‘Rain’ and ‘Dry’.
• Transition probabilities: P(‘Rain’|‘Rain’)=0.3 ,
P(‘Dry’|‘Rain’)=0.7 , P(‘Rain’|‘Dry’)=0.2,
P(‘Dry’|‘Dry’)=0.8
• Initial probabilities: say P(‘Rain’)=0.4 , P(‘Dry’)=0.6 .
14
k Nearest-Neighbors Problem
• Example based learning
•Weight for examples
•Closest examples for decision
•Time consuming
•Fail in absence of sufficient examples
•Performance depend on closesness
15
SVM: Support Vector Machine
• Support vector machines (SVM) are a group of
supervised learning methods that can be
applied to classification or regression. Support
vector machines represent an extension to
nonlinear models of the generalized portrait
algorithm developed by Vladimir Vapnik. The
SVM algorithm is based on the statistical
learning theory and the Vapnik-Chervonenkis
(VC) dimension introduced by Vladimir Vapnik
and Alexey Chervonenkis in 1992.
Classification Margin
wT x  b
r
w
• Distance from example to the separator is
• Examples closest to the hyperplane are support vectors.
• Margin ρ of the separator is the width of separation between classes.
ρ
r
17
Maximum Margin Classification
•
•
•
Maximizing the margin is good
Implies that only support vectors are important;
other training examples are ignorable.
18
SVM implementations
• SVMlight
• Simple text data format
• Fast, C routines
• bsvm
• Multiple class.
• LIBSVM
• GUI: svm-toy
• SMO
• Less optimization
• Fast
• Weka implemented
Differences: available Kernel functions, optimization, multiple class., user
interfaces
Subcellular Locations
PREDICTION OF PROTEINS TO BE LOCALIZED IN MITOCHONDRIA
(MITPRED)
Mitochondrial
Located proteins
Non-mitochondiral Located proteins
(Positive dataset)
>Mit1
DRLVRGFYFLLRRMV
SHNTVSQVWFGHRYS
(Negative dataset)
fasta2sfasta.pl program
>Mit1
##DRLVRGFYFLLRRMVSHNTVSQVWFGHRY
S
>Mit2 ##RMVKNRNTKVGDRLVRGFYFLLRR
>Non-Mit1
KNRNTKVGSDRLVRG
WFGHRYSMVHS
>Non-Mit1
##KNRNTKVGSDRLVRGWFGHRYSMVHS
>Non-Mit2 ##LVRGFYFLLRRMVKNRNSHRVSQ
pro2aac.pl program
# Amino Acid Composition of Mit proteins
#A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
0.0,0.0,3.3,0.0,10.0,6.7,6.7,0.0,0.0,10.0,3.3,3.3,0.0,3.3,16.7,10.0,3.3,13.3,3.3
,6.7
0.0,0.0,4.2,0.0,8.3,8.3,0.0,0.0,8.3,12.5,4.2,8.3,0.0,0.0,25.0,0.0,4.2,12.5,0.0,4.
2
# Amino Acid Composition of Non-Mit proteins
#A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y
0.0,0.0,3.8,0.0,3.8,11.5,7.7,0.0,7.7,3.8,3.8,7.7,0.0,0.0,15.4,11.5,3.8,11.5,3.8,3.
8
0.0,0.0,0.0,0.0,8.7,4.3,4.3,0.0,4.3,13.0,4.3,8.7,0.0,4.3,21.7,8.7,0.0,13.0,0.0,4.3
col2svm.pl program
-1 1:0.0 2:0.0 3:3.8 4:0.0 5:3.8 6:11.5 7:7.7
8:0.0 9:7.7 10:3.8 11:3.8 12:7.7 13:0.0 14:0.0
15:15.4 16:11.5 17:3.8 18:11.5 19:3.8 20:3.8
-1 1:0.0 2:0.0 3:0.0 4:0.0 5:8.7 6:4.3 7:4.3
8:0.0 9:4.3 10:13.0 11:4.3 12:8.7 13:0.0 14:4.3
15:21.7 16:8.7 17:0.0 18:13.0 19:0.0 20:4.3
+1 1:0.0 2:0.0 3:3.3 4:0.0 5:10.0 6:6.7 7:6.7
8:0.0 9:0.0 10:10.0 11:3.3 12:3.3 13:0.0 14:3.3
15:16.7 16:10.0 17:3.3 18:13.3 19:3.3 20:6.7
+1 1:0.0 2:0.0 3:4.2 4:0.0 5:8.3 6:8.3 7:0.0
8:0.0 9:8.3 10:12.5 11:4.2 12:8.3 13:0.0 14:0.0
15:25.0 16:0.0 17:4.2 18:12.5 19:0.0 20:4.2
SVM-input file
PREDICTION OF PROTEINS TO BE LOCALIZED IN MITOCHONDRIA (MITPRED)
+1 1:0.0 2:0.0 3:3.3 4:0.0 5:10.0 6:6.7 7:6.7 8:0.0 9:0.0 10:10.0 11:3.3 12:3.3 13:0.0 14:3.3 15:16.7 16:10.0 17:3.3 18:13.3 19:3.3
20:6.7
+1 1:0.0 2:0.0 3:4.2 4:0.0 5:8.3 6:8.3 7:0.0 8:0.0 9:8.3 10:12.5 11:4.2 12:8.3 13:0.0 14:0.0 15:25.0 16:0.0 17:4.2 18:12.5 19:0.0
20:4.2
-1 1:0.0 2:0.0 3:3.8 4:0.0 5:3.8 6:11.5 7:7.7 8:0.0 9:7.7 10:3.8 11:3.8 12:7.7 13:0.0 14:0.0 15:15.4 16:11.5 17:3.8 18:11.5 19:3.8
20:3.8
-1 1:0.0 2:0.0 3:0.0 4:0.0 5:8.7 6:4.3 7:4.3 8:0.0 9:4.3 10:13.0 11:4.3 12:8.7 13:0.0 14:4.3 15:21.7 16:8.7 17:0.0 18:13.0 19:0.0
20:4.3
Test file
Training file
+1 1:0.0 2:0.0 3:4.2 4:0.0 5:8.3 6:8.3 7:0.0 8:0.0
9:8.3 10:12.5 11:4.2 12:8.3 13:0.0 14:0.0 15:25.0
16:0.0 17:4.2 18:12.5 19:0.0 20:4.2
-1 1:0.0 2:0.0 3:3.8 4:0.0 5:3.8 6:11.5 7:7.7 8:0.0
9:7.7 10:3.8 11:3.8 12:7.7 13:0.0 14:0.0 15:15.4
16:11.5 17:3.8 18:11.5 19:3.8 20:3.8
+1 1:0.0 2:0.0 3:3.3 4:0.0 5:10.0 6:6.7 7:6.7 8:0.0
9:0.0 10:10.0 11:3.3 12:3.3 13:0.0 14:3.3 15:16.7
16:10.0 17:3.3 18:13.3 19:3.3 20:6.7
-1 1:0.0 2:0.0 3:0.0 4:0.0 5:8.7 6:4.3 7:4.3 8:0.0
9:4.3 10:13.0 11:4.3 12:8.7 13:0.0 14:4.3 15:21.7
16:8.7 17:0.0 18:13.0 19:0.0 20:4.3
svm_learn trainng file model
svm_classify test-file model result
This result file contains a numeric value, using this value
we can evaluate the model performance by varying
threshold
SVM_light training/testing pattern
•
•
•
•
•
•
•
•
•
•
•
•
Output Input (frequency)
0.902 1:3 2:8 3:6 4:4 5:0 6:0 7:2
0.897 1:3 2:5 3:6 4:7 5:0 6:0 7:2
0.545 1:3 2:7 3:5 4:6 5:0 6:0 7:2
0.850 1:6 2:4 3:6 4:5 5:2 6:0 7:1
0.408 1:6 2:9 3:2 4:4 5:3 6:2 7:1
0.019 1:4 2:8 3:4 4:5 5:1 6:1 7:1
0.834 1:3 2:7 3:2 4:9 5:0 6:1 7:1
0.323 1:3 2:9 3:3 4:6 5:0 6:2 7:1
0.862 1:8 2:2 3:5 4:6 5:4 6:0 7:2
0.284 1:9 2:2 3:3 4:7 5:4 6:0 7:1
1.341 1:5 2:6 3:4 4:6 5:2 6:0 7:1
svm_learn train.svm model
svm_classify test model output
Options
-z c
for classification
-z r for regression
-t 0 linear kernel
-t 1 polynomial
-t 2 RBF
Important Points in Developing New
Method
•Importance of problem
•Acceptable dataset
•Dataset should be large
•Recently used in any other study
•Realistic, balance & independent
• Level of redundancy
• Develop standalone and/or web service
•Cross-validation (Benchmarking)
24
Important Points in Developing New
Method (Cont.)
• Integrate BLAST with ML techniques
• Using PSIBLAST profile
• Discover exclusive domian/motif present or
absent in proteins.
• Features from proteins (fixed length pattern)
• Amino acid composition (split composition)
• Dipeptide composition (higher order)
• Pseudo amino acid composition
• PSSM composition
• Select significant compositions
25
Important Information in Manual for
Develpers
26
27
Creation of Pattern
• Fix the length of pattern
 For example protein (composition)
 Represent Segment by vector
28
Example of Features generation
•
29
30
GPSR: A Resource for Genomics Proteomics
and Systems Biology
Small programs as building unit
•
•
•
•
Why PERL?
Why not BioPerl?
Why not PERL modules?
Advantage of independent programs
 Language independent
 Can be run independently
31
32
33
34
Modelling of Immune System for Designing Epitope-based Vaccines
Adaptive Immunity
(Cellular Response) :
Thelper Epitopes
Propred: for promiscuous MHC II binders
MMBpred:for high affinity mutated binders
MHC2pred: SVM based method
MHCBN: A database of MHC/TAP binders
and non-binders
Pcleavage: for proteome cleavage sites
Adaptive Immunity
(Cellular Response) :
CTL Epitopes
Adaptive Immunity
(Humoral Response)
:B-cell Epitopes
Innate Immunity :
Pathogen Recognizing
Receptors and ligands
Signal transduction in
Immune System
TAPpred: for predicting TAP binders
Propred1: for promiscuous MHC I binders
CTLpred: Prediction of CTL epitopes
BCIpep: A database of B-cell eptioes;
ABCpred: for predicting B-cell epitopes
ALGpred: for allergens and IgE eptopes
HaptenDB: A datbase of haptens
PRRDB: A database of PRRs & ligands
Antibp: for anti-bacterial peptides
Cytopred: for classification of Cytokines
35
Computer-Aided Drug Discovery
Searching Drug Targets: Bioinformatics
Comparative genomics
Genome Annotation
FTGpred: Prediction of Prokaryotic genes
EGpred: Prediction of eukaryotic genes
GeneBench: Benchmarking of gene finders
SRF: Spectral Repeat finder
Subcellular Localization Methods
PSLpred: localization of prokaryotic proteins
ESLpred: localization of Eukaryotic proteins
HSLpred: localization of Human proteins
MITpred: Prediction of Mitochndrial proteins
TBpred: Localization of mycobacterial proteins
GWFASTA: Genome-Wide FASTA Search
GWBLAST: Genome wide BLAST search
COPID: Composition based similarity search
LGEpred: Gene from protein sequence
Prediction of drugable proteins
Nrpred: Classification of nuclear receptors
GPCRpred: Prediction of G-protein-coupled receptors
GPCRsclass: Amine type of GPCR
VGIchan: Voltage gated ion channel
Pprint: RNA interacting residues in proteins
GSTpred: Glutathione S-transferases proteins
Protein Structure Prediction
APSSP2: protein secondary structure prediction
Betatpred: Consensus method for -turns prediction
Bteval: Benchmarking of -turns prediction
BetaTurns: Prediction of -turn types in proteins
Turn Predictions: Prediction of / / -turns in proteins
GammaPred: Prediction of-turns in proteins
BhairPred: Prediction of Beta Hairpins
TBBpred: Prediction of trans membrane beta barrel proteins
SARpred: Prediction of surface accessibility (real accessibility)
PepStr: Prediction of tertiary structure of Bioactive peptides
36
37