Transcript Document

Final Report (30% final score)
Bin Liu, PhD, Associate Professor
Contents

There are two parts: project+report


Project (remote homology detection)
Report




Review the methods for remote homology detection.
Point out their advantages and disadvantages.
How did you do the experiments? Information for each
step.
What are your results?
What are the advantages, disadvantages, and novelty of
your methods?
Protein Remote Homology Detection

Background

Problem definition:classification problem:
f : X  {H , N }
All α
class
fold
Long alpha-hairpin
Globin-like
……
Sequence similarity
are from high to low
superfamily
family
Dihydropyrimidine
dehydrogenase, Nterminal domain
alphahelical
ferredoxin
Fumarate reductase/
Succinate dehydogenase
iron-sulfur protein,
Truncated
hemoglobin
Nerve tissue
minihemoglobin
……
Phycocyanin-like
phycobilisome
proteins
……
……
The schematic plot of the hierarchy for the SCOP database
Overview
Dataset



http://noble.gs.washington.edu/proj/svmpairwise/
54 families and 4352 proteins.
For More information about the dataset, refer
to: Li Liao and William Stafford Noble.
"Combining pairwise sequence similarity and
support vector machines for detecting remote
protein evolutionary and structural
relationships." Journal of Computational
Biology. 10(6):857-868, 2003.
Data set
Dataset

Tab-delimited table





0
1
2
3
4
=
=
=
=
=
not present;
positive train;
negative train;
positive test;
negative test.
Feature extraction


Extracting the features from the protein sequences,
which can be found at “Sequence file” file in the
supplementary.
Using your imagination to extract the features that
can capture the character of the protein sequences.
Dataset construction


Based on supplementary files “Tabdelimited table ” and “Sequence file ”,
the training sets and test sets can be
constructed.
There are totally 54 datasets.
Classifiers

You are free to choose any classifiers,
such as Support Vector Machines
(SVMs), Artificial Neural network (ANN),
Random Forest (RF), etc.
Performance measure


ROC score (AUC)
The average ROC
scores of all the 54
families should be
given.
Scoring function for the
project and report




Novelty and completeness: new features,
new machine learning models, etc. Write
down what makes your method different
from others in this field. Does your method
work? (40%)
Mid results and source code (20%)
Results (based on average ROC score) (10%)
Report (30%)
Important information


This is individual work, not team work, so do
it alone, but you are free to discuss with
others.
Due date: 30th April, 2015 (1 month later),
all the data should be stored in one ZIP or
RAR file and sent to TA via email or QQ. The
title of the email and your data: your name +
student ID. (If your data is too large, contact
TA directly). The slides of your presentation
should be attached too.
Other topic you can choose

DNA binding protein identification



Dataset is available at
http://bioinformatics.hitsz.edu.cn/iDNAProt_dis/data.jsp
Fold recognition
Enhancer prediction
Problem description
DNA-binding proteins are very important
components of both eukaryotic and prokaryotic
proteomes. As approximately at least 2% of
prokaryotic and 3% of eukaryotic proteins are able to
bind to DNA, these proteins are important for various
cellular processes.
Problem description
Therefore Developing an efficient model for
identifying DNA-binding proteins from non
DNA-binding proteins is an urgent research
problem. Up to now, Although many efforts
have been made in this regard, further effort is
needed to enhance the prediction power.
Dataset description
There are two datasets in this project,
including a benchmark dataset and an
independent dataset, which are available at
course website
http://bioinformatics.hitsz.edu.cn/iDNAProt_dis/data.jsp
For more information, see the following paper:
Task and evaluation
Task:
Identify DNA-binding proteins from non DNAbinding proteins.
Evaluation scheme:
1.Use validation techniques to optimize the parameters
of your methods (if any), and obtain the results on the
benchmark dataset
2. Train your classifiers on the benchmark dataset, and
predict the proteins in the independent dataset.
3. Analysis the feature, and find some interesting
patterns.
Task and evaluation
Task and evaluation
TP refers to the number of positive samples that are
classified correctly;
FP denotes the number of negative samples that are
classified as positive sample;
TN denotes the number of negative samples that are
classified correctly;
FN denotes that number of positive samples that are
classified as negative samples.
Students from other majors.






If you are not in CS department, please select one
computational task in the field of bioinformatics.
Write a review of the state-of-the-art predictors for
this task.
Discuss their advantages and disadvantages.
Discuss the relationship between bioinformatics
and your major.
Can you use the idea from bioinformatics to your
own project?
At least 4000 words.
Data Driven Machine
Learning Approaches for
Bioinformatics
Case study--protein remote
homology detection
outline


Overview
Feature extraction





Sequence-based features
Profile-based features
Other features
Classifiers
Feature analysis
Data Driven Machine Learning
Approaches for Bioinformatics
Training
Protein
Function
Data
Training
Data
Classifier:
Map Input to
Output
Split
Input: sequence features
Output: function category
Prediction
Test Data
New
Data
Test
Training: Build a classifier
Test: Test the model
Key idea: Learn from known data and Generalize to unseen data
Several important
components in this model


Feature extraction.
Given a protein, how to extract features
only based on the primary sequence?
Brainstorming?
A study case: remote homology
detection and protein-protein
interaction

Features derived from the primary sequence
only.



Ngrams. Leslie et al. 2002 (possible subsequences
of amino acids of a fxed length N);
SVM-npeptide. Ogul et al. 2007 (reduced amino
acid alphabets)
Mismatch kernel and Pattern (TEIRESIAS
algorithm) Leslie CS et al. 2004 and Dong et al
2005.
Feature extraction


Distance-based approach. Lingner et al
2006
Word correlation matrics. Lingner et al
2008
SVM-pairwise

Feature vector is a list of pairwise
sequence similarity scores. Liao et al.
2002
Profile-based features

Brainstorming. How to use
Profiles the profile feature?
A
C
D
E
F
G
H
I
K
L
M
N
P
Q
R
S
T
V
W
Y
1
I
0.06
0.01
0.02
0.03
0.03
0.02
0.01
0.20
0.03
0.13
0.02
0.02
0.02
0.02
0.02
0.03
0.04
0.25
0.00
0.02
2
V
0.06
0.01
0.01
0.02
0.02
0.02
0.01
0.25
0.03
0.10
0.02
0.01
0.01
0.01
0.01
0.02
0.05
0.30
0.00
0.02
3
E
0.05
0.00
0.04
0.12
0.01
0.50
0.02
0.01
0.03
0.02
0.00
0.06
0.02
0.02
0.02
0.04
0.02
0.02
0.00
0.01
4
G
0.08
0.01
0.04
0.04
0.02
0.49
0.01
0.02
0.03
0.03
0.01
0.04
0.02
0.02
0.03
0.06
0.03
0.03
0.01
0.01
5
Q
0.04
0.03
0.06
0.11
0.06
0.03
0.05
0.02
0.05
0.03
0.02
0.03
0.01
0.14
0.06
0.06
0.07
0.04
0.02
0.08
6
D
0.05
0.00
0.17
0.26
0.02
0.02
0.01
0.02
0.04
0.03
0.01
0.08
0.05
0.05
0.02
0.06
0.06
0.06
0.00
0.01
7
A
0.18
0.23
0.02
0.02
0.01
0.03
0.01
0.05
0.02
0.04
0.01
0.02
0.02
0.01
0.02
0.09
0.09
0.07
0.00
0.01
8
E
0.08
0.00
0.08
0.16
0.01
0.04
0.02
0.02
0.09
0.04
0.01
0.06
0.07
0.05
0.08
0.09
0.05
0.06
0.00
0.01
9
V
0.05
0.00
0.04
0.08
0.02
0.03
0.02
0.07
0.10
0.07
0.01
0.02
0.20
0.04
0.04
0.03
0.02
0.14
0.01
0.01
10
G
0.05
0.00
0.05
0.04
0.01
0.22
0.16
0.01
0.02
0.02
0.00
0.19
0.01
0.02
0.04
0.08
0.03
0.02
0.01
0.02
11
L
0.08
0.00
0.05
0.18
0.01
0.02
0.02
0.02
0.05
0.13
0.01
0.03
0.01
0.05
0.05
0.22
0.03
0.02
0.00
0.01
12
S
0.04
0.03
0.01
0.02
0.05
0.02
0.06
0.02
0.02
0.03
0.01
0.01
0.01
0.09
0.03
0.12
0.03
0.04
0.30
0.04
13
P
0.06
0.00
0.03
0.04
0.01
0.03
0.01
0.02
0.04
0.03
0.01
0.02
0.48
0.02
0.08
0.04
0.03
0.03
0.00
0.01
14
W
0.03
0.01
0.02
0.02
0.12
0.03
0.05
0.03
0.02
0.05
0.01
0.01
0.01
0.02
0.02
0.02
0.02
0.03
0.30
0.17
…
…
Binary profile

Dong et
al. 2007
N-profile Liu et al. 2008
Order profile

Liu et al. 2009
Top-n-grams Liu et al. 2008
ACC Dong et al. 2009

AC

ACC
Other features
(AAindex-based features)

Physicochemical Distance
Transformation (PDT) Liu
et al. 2012
LSA (latent semantic analysis)
Dong et al.
2006

Classifiers
SVM
kernel combination
methodology


VBKC
Damoulas et
al. 2008
Summary





To establish a really useful statistical
predictor for a biological system:
(i) Benchmark dataset;
(ii) Feature extraction;
(iii)Machine learning algorithm;
(iv)Web server or stand alone tools