Transcript Promoter

Optimization of SVM Parameters
for Promoter Recognition
in DNA Sequences
Robertas Damaševičius
Software Engineering Department,
Kaunas University of Technology
Studentų 50-415, Kaunas, Lithuania
Email: [email protected]
Data: genetic (DNA) sequences



Meaning: represent genetic information stored in
DNA molecule in symbolic form
Syntax: 4-letter alphabet {A, C, G, T}
Complexity: numerous layers of information






protein-coding genes
regulatory sequences
mRNA sequences responsible for protein structure
directions from DNA packaging and unwinding, etc.
Motivation: over 95% - “junk DNA” (biological
function is not fully understood)
Aim: identify structural parts of DNA

introns, exons, promoters, splice sites, etc.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
2
What are promoters?



Promoter: a regulatory region of DNA located upstream
of a gene, providing a control point for gene transcription
Function: by binding to promoter, specific proteins
(Transcription Factors) can either promote or repress
the transcription of a gene
Structure: promoters contain binding sites or “boxes” –
short DNA subsequences, which are (usually) conserved
Gene
Promoter
exon1
intron
exon2
intron exon3
Start
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
Stop
3
Promoter recognition problem
Multitude of promoter “boxes” (nucleotide patterns)



(a)
(b)
(c)
TATA, Pribnow, Gilbert, DPE, E-box, Y-box, …
“Boxes” within a species are conserved, but there are
many exceptions to this rule
Exact pattern = TACACC
CAATGCAGGATACACCGATCGGTA
Pattern with mismatches = TACACC + 1 mismatch
CAATGCAGGATTCACCGATCGGTA
Degenerate pattern = TASDCC (S={C,G}, D={A,G,T})
CAATGCAGGATAGTCCGATCGGTA
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
4
Support Vector Machine (SVM)


g x j   sgn    i yi K xi , x j   b 
 x SV

i
xi  X are training data vectors, x j  X are unknown data vectors
yi  Y, Y  1,1 is a target space
K xi , x j  is the kernel function.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
5
Quality of classification

Training data


Mapping of data into feature space



linear, polynomial, RBF, sigmoid
Kernel function parameters
SVM learning parameters


Orthogonal, single nucleotide, nucleotide grouping, ...
Selection of an optimal kernel function


size of dataset, generation of negative examples, imbalanced
datasets
Regularization parameter, Cost factor
Selection of SVM parameter values – an optimization
problem
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
6
SVM optimization strategies

Kernel optimization



Parameter optimization




Putting additional parameters
Designing new kernels
Learning parameters only
Kernel parameters only
Learning & kernel parameters
Optimization decisions


Optimization method
Objective function
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
7
SVM (hyper)parameters

Kernel parameters

Learning parameters
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
8
SVM parameter optimization methods
Method
Advantages
Disadvantages
Random Simplicity.
search
Depends on selection of random
points and their distribution.
Very slow as the size of the parameter
space increases
Grid
Simplicity.
search A starting point is not
required.
Box-constraints for grid are necessary.
No optimality criteria for the solution.
Computationally expensive for a large
number of parameters.
Solution depends upon coarseness of
the grid.
Nelder- Few function
Mead
evaluations.
Good convergence and
stability.
Can fail if the initial simplex is too
small.
No proof of convergence.
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
9
Dataset

Drosophila sequence datasets:




Datasets for SVM classifier:



Promoter dataset: 1842 sequences, each 300 bp length, from -250
bp to +50 bp with regards to the gene transcription site location
Intron dataset: 1799 sequences, each 300 bp length
Coding sequence (CDS) dataset: 2859 sequences, each 300 bp
length
Training file: 1260 examples (372 promoters, 361 introns, 527 CDS)
Test file: 6500 examples (1842 promoters, 1799 introns, 2859 CDS)
Datasets are unbalanced:


29.5% promoters vs. 70.5% non-promoters in the training dataset
28.3% promoters vs. 71.7% non-promoters in the test dataset
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
10
Classification requisites

Feature mapping: orthogonal
{ A  0001, C  0010, G  0100, T  1000}

Kernel function: power series kernel
xi , x j    ak xi
n
K npow
T
xj c

k
k 1

Metrics:




Specificity (SPC) SPC  nTN
 100%
nFP  nTN
Sensitivity (TPR)
nTP
TPR 
100%
nTP  n FN
SVM classifier: SVMlight
SVM parameter optimization method:

Modified Nelder-Mead (downhill simplex)
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
11
Modification of Nelder-Mead

Optimization time problem:



Call to SVM training and testing function is very time-costly
for large datasets
Requires many evaluations of objective function
Modifications:


Function value caching
Normalization after reflection step
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
12
Classification results
Kernel
No. of
Type of
Classification
optimized optimized
evaluation metric
parameters parameters Specificity Sensitivity
(SPC)
(TPR)
Linear
-
none
84.83%
58.25%
Linear
3
learning
91.23%
81.38%
Polynomial
-
none
81.81%
44.90%
Polynomial
6
learning +
kernel
87.64%
67.48%
Power series (2)
3
kernel
94.85%
89.69%
Power series (3)
4
kernel
94.92%
89.95%
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
13
ROC plot
100
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
14
Conclusions




SVM classifier alone can not achieve satisfactory
classification results for a complex unbalanced dataset
SVM parameter optimization can improve classification
results significantly
Best results can be achieved when SVM parameter
optimization is combined with kernel function modification
Power series kernel is particularly suitable for optimization
because of a larger number of kernel parameters
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
15
Ongoing work and future research





Application of SVM parameter optimization for splice site
recognition problem [presented in CISIS’2008]
Selection of rules for optimal DNA sequence mapping to
the feature space [accepted to WCSB’2008]
Analysis of the relationships between data characteristics
and classifier behavior [accepted to IS’2008]
Automatic derivation of formal grammars rules [accepted
to KES’2008]
Structural analysis of sequences using SVM with grammar
inference [accepted to ITA’2008]
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
16
Thank You.
Any questions?
Continuous Optimization and Knowledge-Based Technologies – EurOPT’2008
17