Transcript gif anim

Bioinformatics Programming
EE, NCKU
Tien-Hao Chang (Darby Chang)
1
Final Project
2
Topic

Sequence alignment

Protein clustering

Classification

Other analysis techniques
– association rule
– frequent pattern
– network
3
Must be
A web server
4
Sequence alignment

First class
– a novel sequence alignment algorithm
– ClustalW
• http://www.ebi.ac.uk/Tools/clustalw2/index.html
• Thompson,J.D., Higgins,D.G. and Gibson,T.J. (1994) Clustal-W—
Improving the sensitivity of progressive multiple sequence alignment
through sequence weighting, position-specific gap penalties and weight
matrix choice. Nucleic Acids Res., 22, 4673–4680.

Second class
– an application using ClustalW
– E1DS
• http://e1ds.ee.ncku.edu.tw/
• Chien,T.Y., Chang,D.T.H., Chen,C.Y., Weng,Y.Z. and Hsu,C.M. (2008)
E1DS: catalytic site prediction based on 1D signatures of concurrent
conservation. Nucleic Acids Res., 36, W291–W296.
5
Protein clustering

First class
– CD-HIT
• http://weizhong-lab.ucsd.edu/cdhit_suite/cgi-bin/index.cgi
• Li,W., Jaroszewski,L. and Godzik,A. (2001) Clustering of highly
homologous sequences to reduce the size of large protein
databases. Bioinformatics, 17, 282–283.

Second class
– Protemot
• http://protemot.ee.ncku.edu.tw/
• Chang,D.T.H., Weng,Y.Z., Lin,J.H., Hwang,M.J. and Oyang,Y.J.
(2006) Protemot: prediction of protein binding sites with
automatically extracted geometrical templates. Nucleic Acids
Res., 34, W303–W309.
6
First class

There might be some state-of-the-art packages
– sequence alignment
• BLAST (1990), ClustalW, FASTA, HMMER (1998), HHpred/HHsearch
(2005), PSI-BLAST (1997), T-coffee, SSEARCH and so on
– overtaking them is very difficult, but there still some room,
especially for special purpose alignment
• Abascal,F., Zardoya,R., and Telford,M.J. (2010)
TranslatorX: multiple alignment of nucleotide sequences guided by
amino acid translations. Nucleic Acids Res. Advance Access published
on April 30, 2010.

Some possible direction
– add some constrains (special purpose), speed the algorithm
– combine multiple tools, ex: domain-conserved alignment
– instead of implementing from scratch
• manipulate the input to existing packages (preprocessing)
• start from the output of existing packages (postprocessing)
7
Second class


The programming part is less challenging, but is still
heavy and probably more niggling
You need a good/interesting theme
–
–
–
–
–

predicting DNA-binding protein
predicting protein-protein interaction
mapping any ID to a specific database
connecting predicted TFBS to DNA/RNA sequences
…
Implement a specific algorithm and web-lize it might
be okay
– http://nar.oxfordjournals.org/papbyrecent.dtl has many
update-to-date web servers
8
http://www.flickr.com/photos/meteorry/3452536272/
In either class, you need to discuss with me
9
http://www.sxc.hu/photo/544232
Final project schedule
10
Discuss with me
Before 5/12 (a soft deadline)
11
What is machine learning?
12
A very trivial machine learning tool
K-Nearest-Neighbors (KNN)
O
X
O
O
? X
O
X
O
O
X
X
X
X
O

The predicted class of the query sample
depends on the voting among its k nearest
neighbors
13
O
X
O
O
X
O X
O
X
O
O
X
X
X
O

k=3
14
O
X
O
O
X
X X
O
X
O
O
X
X
X
O

k=5
15
Although KNN is very trivial, it can

Example: in vitro fertilization





Given: embryos described by 60 features
Problem: selection of embryos that will survive
Data: historical records of embryos and outcome
Given a set of known instances, predict
outcome for newly coming instances
So, KNN learnt something related to “the
definition of a good embryo”
16
Although KNN is very trivial, it can

Example: in vitro fertilization





Given: embryos described by 60 features
Problem: selection of embryos that will survive
Data: historical records of embryos and outcome
Given a set of known instances, predict
outcome for newly coming instances
So, KNN learnt something related to “the
definition of a good embryo”?
17
Can machines really learn?


Notice that here we call KNN a machine
Definitions of “learning” from dictionary:





To get knowledge of by study,
experience, or being taught
To become aware by information
or from observation
To commit to memory
To be informed of, ascertain;
to receive instruction
Difficult to measure
Trivial for computers
Operational definition:

Things learn when they change
their behavior in a way that makes
them perform better in the future
Does a slipper learn?
18
Shortly speaking, machine learning is
Knowledge/
Information
Training data
A set of known
instances
Testing data
A query
instance
Machine
E.g. KNN
Outcome
Class of the
query instance
19
Furthermore, learning is
When
training data
increases
Testing data
A query
instance
Knowledge/
Information
Training data
A set of known
instances
Machine
E.g. KNN
It delivers better
(e.g. higher
accuracy) outcome
Outcome
Class of the
query instance
20
Classifier
In
Out
two sets of samples, tr and te
accuracy of using tr to predict te
Requirement
- implement KNN with a parameter k
- invoke RVKDE
- complexity/teamwork report
- using Perl would be the best
Bonus
- invoke LIBSVM
- a script to decide the best k in a range
21
Deadline
2010/5/11 23:59
Zip your code, step-by-step README,
complexity analyses and anything worthy
extra credit. Email to
[email protected].
22
Materials for the exercise 9

Input sample (Iris)
–
–

Test your program on satimage
–
–

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.tr
http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/satimage.scale.t
RVKDE
–
–
–

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass/iris.scale
1 1:-0.555556 2:0.25 3:-0.864407 4:-0.916667
1 1:-0.666667 2:-0.166667 3:-0.864407 4:-0.916667
1 1:-0.777778 3:-0.898305 4:-0.916667
.
http://mbi.ee.ncku.edu.tw/wiki/doku.php?id=rvkde
$ wget http://mbi.ee.ncku.edu.tw/rvkde/res/rvkde-current-linux32.tgz
$ tar zxvf rvkde-current-linux32.tgz
$ rvkde-0.2.3-final/rvkde --classify --predict -v tr -V te -a 1 -b 1 --ks 10 --kt 10
rvkde has a built-in function of parameter tuning (see --cv)
LIBSVM
–
–
–
http://www.csie.ntu.edu.tw/~cjlin/libsvm/
see the manual
LIBSVM provides a script of parameter tuning (see grid.py)
23
Machine Learning and
Bioinformatics
24
Why these two fields?

From biologists’ view


There are abundant data to analyze
From computer guys’ view



The data are suitable (large and well-studied)
Biomedical problems are important
There are various computer science techniques
for various Bioinformatics applications
25
Circuit simulation
Computer graphics
Information retrieval
Network analysis
http://www.sophion.dk/sophion/Open-close2.jpg &
http://alford.bios.uic.edu/Images/586%20images/circuit%20model
http://www.dashboardinsight.com/CMS/e01d472c-862e4e13-8b23-591f8938889a/text_mining340x220.png
http://healthbolt.net/wp-content/uploads/2006/09/cell-animation.jpg
& http://www.osmosis.com.au/animate/images/bloodcells.jpg
http://upload.wikimedia.org/wikipedia/commons/thumb/
6/68/Social-network.svg/430px-Social-network.svg.png
26
Applications, concepts and
our approaches
27
Our online services



Secondary structure prediction
Catalytic site prediction
Protein-ligand docking
28
Secondary structure prediction

In biochemistry and
structural biology,
secondary structure
(SSE) is the general
three-dimensional form
of local segments of
biopolymers such as proteins
and nucleic acids (DNA/RNA)
http://upload.wikimedia.org/wikipedia/commons/thumb/
6/60/Myoglobin.png/542px-Myoglobin.png
29
Prote2S
http://prote2s.csie.ntu.edu.tw/
30
Prote2S
http://prote2s.csie.ntu.edu.tw/
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAE
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEH
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLT
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
31
Concept of Prote2S
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLG…
<f1, f2, … fd>
Knowledge/
Information
Training data
Residues with
known SSE
Testing data
A residue as a
vector
Machine
E.g. KNN
Outcome
SSE of the
query residue
32
Feature encoding



Most classifiers deal
with a vector space
Feature encoding
means to generate
the vector
representation of an
real world instance
An instance is
represented as
several important
attributes, or say,
features
>1A0OB
RQLALEAKGETPSAVTRLSVVA
KSEPQDEQSRSQSPRRIILS…
PSI-BLAST
PSSM
A
R N
R  2 6 1
Q 1 0 1
L 2 3 4
A
5 2 2
L 2 2 3


D
C

2 4
 1  4 Feature vector
4 2 
2
1
3 2

33
Similar applications to Prote2S
Disorder region
Conformational switch
Protein-protein interaction
Solvent accessibility
http://www.oup.com/uk/orc/bin/97801
99265114/resources/anim/figs/f2-9.gif
http://stke.sciencemag.org/content/jbc/vol280/
issue7/images/large/zbc0060586670006.jpeg
34
A family tree
Ian H. Witten and Eibe Frank, Data Mining: Practical
Machine Learning Tools and Techniques (Second Edition)
35
Family tree represented as a table
The “sister-of” relation
Ian H. Witten and Eibe Frank, Data Mining: Practical
Machine Learning Tools and Techniques (Second Edition)
36
Catalytic site prediction

The catalytic site is usually a
small pocket at the surface
of the enzyme that contains
residues responsible for the
substrate specificity and
catalytic residues which
often act as proton donors or
acceptors or are responsible
for binding a cofactor
http://fig.cox.miami.edu/~cmallery/
255/255enz/ES_complex.jpg
37
E1DS
http://e1ds.ee.ncku.edu.tw/
38
E1DS
http://e1ds.ee.ncku.edu.tw/
>Paste your sequence in FASTA format to replace the sample FASTA here
MIFSVDAVRADFPVLSREVNGLPLAYLDSAASAQKPSQVIDAEAEFYRHGYAAVHAGAHTLSAQATEKMENVRKRASLFI
NARSAEELVFVRGTTEGINLVANSWGNSNVRAGDNIIISQMEHHANIVPWQMLCARVGAELRVIPLNPDGTLQLETLPTL
FDAATRLLAITHVSNVLGTENPLAEMITLAHQHGAKVLVDGAQAVMHHPVDVQALDCDFYVFSGHKLYGPTGIGILYVKE
ALLQEMPPWEGGGSMIATVSLSEGTTWTKAPWRFEAGTPNTGGIIGLGAALEYVSALGLNNIAEYEQNLMHYALSQLESV
PDLTLYGPQARLGVIAFNLGAHHAYDVGSFLDNYGIAVRTGHHCAMPLMAYYNVPAMCRASLAMYNTHEEVDRLVTGLQR
IHRLLG
39
Concept of E1DS
http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg
40
Concept of E1DS
http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg
http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg
41
Concept of E1DS
http://www.biomedcentral.com/conten
t/figures/1471-2105-8-S5-S8-3-l.jpg
http://www.biomedcentral.com/content/figures/1471-2105-8-S5-S8-6-l.jpg
http://jkweb.berkeley.edu/external/pdb/2004/heme/fig1.jpg
42
Allowing large flexible gaps

>1RPX:A
SRVDKFSKSDIIVSPSILSANFSKLGEQVKAIEQAGCDWIHVDVMDGRFVPNITIGPLVVDSL
RPITDLPLDVHLMIVEPDQRVPDFIKAGADIVSVHCEQSSTIHLHRTINQIKSLGAKAGVVLN
PGTPLTAIEYVLDAVDLVLIMSVNPGFGGQSFIESQVKKISDLRKICAERGLNPWIEVDGGVG
PKNAYKVIEAGANALVAGSAVFGAPDYAEAIKGIKTSKRPE

PROSITE pattern


[LIVMA]-x-[LIVM]-M-[ST]-[VS]-x-P-x(3)-[GN]-Qx(0,1)-[FMK]-x(6)-[NKR]-[LIVMC]
Our pattern

H-x-D-x-M-D-x(94,144)-M-x-V-x-P-G-x(3)-Qx(22,32)-D-G-G
43
Applications using pattern mining
Transcription factor
binding site
Protein-protein interaction
http://www.cs.uiuc.edu/homes/sinhas/img/DAILYILLINI.jpg & http://upload.wikimedia.org/
wikipedia/en/thumb/8/8d/ChIP-on-chip_wet-lab.png/400px-ChIP-on-chip_wet-lab.png
http://stke.sciencemag.org/content/jbc/vol280/
issue7/images/large/zbc0060586670006.jpeg
44
Protein-ligand docking

The goal
of protein-ligand
docking is to predict
the position and
orientation of a ligand
(a small molecule)
when it is bound to a protein receptor
http://www-ucc.ch.cam.ac.uk/research/
images/docking-small.jpg
45
MEDock
http://medock.csie.ntu.edu.tw/
46
MEDock
http://medock.csie.ntu.edu.tw/
http://gemdock.life.nctu.edu.tw/dock/images/1jff-sum.gif
47
Concept of MEDock
http://www.mathworks.com/cms
images/op_main_wl_3250.jpg
48
Genetic algorithm
http://wellington.pm.org/archive/200505/oo-perl/images/mutation.jpg
Mutation
Crossover
http://www.mannosidosis.org/images/inheritance.jpg
Nature selection
http://myconstructionphotos.smugm
ug.com/photos/58510454-M-1.jpg
49
Applications using optimization
Protein folding
Gene network
Microarray analysis
http://cnx.org/content/m11461/latest/protein_folding.jpg
http://research.microsoft.com/users/manuelrg/microarray.gif
http://www.ehponline.org/members/2007/10358/fig2.jpg
50
Quick summary

What is machine learning


Three real Bioinformatics applications




A very simple classifier, KNN
Secondary structure prediction
Catalytic site prediction
Protein-ligand docking
The dependent techniques and related
problems
51
Conclusion


Biologists identify new problems
Informatists enhance algorithms
Bioinformatists transform biomedical
problems into computer science problems
52
Appendix
53
RAME (Rank-based Adaptive Mutation
Evolutionary) algorithm
Current generation
Generate reference Gaussian
distributions according to sample rank
Rank these samples
Generate the next generation according
to these reference Gaussian distributions 54
Diversity of sampling locations
RAME algorithm
LGA in AutoDock
55
Suppose that the distances from “?” to its
nearest “O” and “X” are equal
X
O
OOO
OO
O O
O O
O
O
O O
X
X
d
?
X
X
d
X
X
X
X
X
X
X
X
X
56
It is more likely to be a “X” because of the
density
X
O
OOO
OO
O O
O O
O
O
O O
X
X
X
X
X
X
X
X
X
X
X
X
X
X
57
It looks like an outlier when being
predicted as a “O”
X
O
OOO
OO
O O
O O
O
O
O O
X
X
O
X
X
X
X
X
X
X
X
X
X
X
58