Mika-FA-090904 - RNA Informatics @ UGA

Transcript Mika-FA-090904 - RNA Informatics @ UGA

Fragment Assembly method
Mika Takata
outline

Fragment Assembly






Basic theory
Process
Techniques
David Baker’s group approaches
Other top ranked approaches at CASP7
Discussion
Fragment theory
 Short
fragments(5~9 residues) have tendency to have
specific conformation
 These tendency is repeated in the structure of proteins
 Sequences with particular short structural motif is
somewhat similar.
[Unger et al., 1989][Rooman st al., 1990]
 Protein can be “re-constructed” by using fragments
excised from other proteins in library [Jones and Thirup,
1986, Claessems et al. 1989]
Appendix; Other basics


Levinthal’s paradox (Levinthal 1968)
Local structural bias


Recurrent sequence pattern


Pauling and Corey (1951); Rooman et al. 1990; Bystroff
etal. 1996; han et al. 1997; Bystroff and Baker 1998;
Camproux et al. 1999; Gerard 1999; Li et al. 2008
HMMSTR (Bystroff et al. 2000) : probabilistic version of a
structural motif library
Ramachandran basin technique: both φ and ψ angle
intervals into six ranges
FA: Fragment Assembly – basic theory

High evaluation @CASP



David Baker’s team developed
Assemble consecutive short fragments
Applied for low identical protein (less than 30%)
Fragment extraction
sampling
Lowest Free
energy
• Choose several candidates
• Consecutive about 10
residues
Assemble local
structures
Global optimization
Whole
structure
Building blocks library

How many data set do we need to cover all protein
structure?


78% of all hexamer structures in test group were covered
by a library of 81 hexamers [Unger et al, 1989].
What size of fragment data do we need to use?

A fixed size





Most studies use 6 amino-acids[Unger et al]
Others use 5~8
9 is greater than other fragment length of less than 15 amino
acids[Bystroff et al., 1996]
Combination of length 3,9, and 12[Baker et al, 1998]
Adjusted size


Library of “natural” building blocks
3~4 to 10~12
Library of building blocks

Polypeptide chain was represented by a sequence of
rigid fragments and concatenated without any
degree of freedom [Koloduy et at; 2002]

The quality of total conformation depends on
The length (f) of the utilized fragment
The size (s) of the library
1
Complexity of a library = S ( f 1)
i.
ii.
•

2.9Å RMSD (2.7 complexity, f=7) ~ 0.76Å (15 complexity, f=5)
Overlapping manner
Using library
Concatenating building blocks in an overlapping manner


superimposition; one block is fused to other one
The level of “superimposability”; how well matching


Too low; two fragments in query do not belong together
Too high; two fragments are connected in a rigid manner,
which means the chain are not flexible enough to reconstruct
the overall conformation
Residue to All-atom conformation
concatenation
Non-local interactions
I.
II.
Knowledge-based protein functions derived from the
protein database
Potential functions based on chemical intuition
C-alpha
conformation
search
Non-local interaction

Backbone evaluation



Global evaluation
I.
II.
Backbone
structure
Naϊve approach[Unger et al.,
1989]
Fragments clustering &
clustering algorithm[Bonneau
et al., 2001]
Approach based on graph
algorithm
Optimization algorithm such
as Monte Carlo and Genetic
Algorithm[Unger and Moult,
1993, Pedersen and Moult,
97a, 97b, Yadgari et al,
1998]
Simple
Energy
function
high
Full all-atom
models
All-atom energy
evaluation
Best
low
Baker’s method I (top of Free Modeling)
CASP 7


Low resolution fragment assembly ( backbone structure)
+ full chain refinement
1.sampling




2. all-atom energy function
High computational Power : ROSETTA@Home
Simple topology targets with about 100 residues are treatable
Good Secondary Prediction -> High accuracy of targets with about
100 residues (<3Å)



Query + its homology (max 30 sequences)
105 ~ 106of short fragments
Long-range Beta Strand Pairing based on Secondary structure
prediction
Baker’s method II
-Constructing main-chain structure
fragment i+1
•Nearest neighbors of a
fragment i
fragment i-1
segment demonstrate the
structure of the sequence
mapping around the
segment [Han & Baker,
1995]
•Reliable even without
knowledge of the true
structure [Yi & Lander, 1993]
Overlapping:All-atom refinement
•25 nearest neighbors used
 Two atoms rejected within
[Baker et al, 1997] [eq.(1)]
2.5Å
 Metropolis criterion[eq.(2)]
equation
The nearest neighbor
f
20
DISTANCE   S (aa, i)  X (aa, i)
i
(1)
aa
All-atom refinement
P( stryctyre sequence)  e
 radius of gyration2
(2)

i j
P(rij aai , aa j )
P(rij )
Baker’s method III –overlapping
Side-chain refinement
 Expected neighbor density around each residue

The number of C  atoms of other residues within 10 Å of

C
the
atom of the residue
Baker’s method IV ; All-atom refinement
 CASP7
Improvement
 Main chain accuracy is not widely different, but the
final conformation is greatly improved
Problem
 500k CPU hours per domain
 140k computers with performance of 37TFLOPS
 Long protein is not treatable
 All-atom energy landscape is rugged
Top of CASP7 (FM section)
CASP7
2 sections: Template Based Modeling (TBM), Free Modeling (FM)
Baker, Zhang, Zhang-server predominated in both sections
Group
Method
Baker
ROSETTA (FA)
Zhang
I-TASSER (FA, Replica Exchange, Lattice Model)
Zhang-server
I-TASSER (FA, Replica Exchange, Lattice Model)
SBC
Server results (Meta Selector)
POEM-REFINE
ROSETTA(FA), Full-atom Refinement
GeneSilice
ROSETTA(FA)
ROBETTA
ROSETTA(FA)
ROKKO
SimFold + FA
Jones-UCL
FRAGFOLD (FA)
SAM-T06
Frag finder, Undertaker (FA)
TASSER
FA, Replica Exchange, Lattice Model
CASP 7
I-TASSER protocol (top of template-based modeling)



Various lengths
1~2 days for a sequence to submit a final prediction
~4 Å (TBA), ~11Å (FA) RMSD
[8]
Summary




Fragment assembly method simplify protein folding
problem
Not require a new structure for a query, but select the
correct parts to be fit in building the accurate
conformation
Local compactness is considered by using known data
Baker’s high success

all-atom refinement by using high computational power
Problems
 High Computational cost performance


Computational distribution, ex. Rosetta@home
Sampling methods..
To improve..

Fragment Assembly



How to choose fragments
 where to cut and separate
 what is the optimal length
How to constraint
 Competitive learning?
Scoring function
 Cf. statistic potential energy, Bayesian scoring
function..
19
Reference
1.
Ron Unger, THE BUILDING BLOCK APPROACH TO PROTEIN STRUCTURE
PREDICTION, The New Avenues in Bioinformatics, 2004, 177-188;
http://www.springerlink.com/content/h63474928680757x/
2.
Kim T. Simons, Charles Kooperberg, Enoch Huang and David Baker, Assembly of
Protein Tertiary Structures from Fragments with Similar Local Sequences using
Simulated Annealing and Bayesian Scoring Functions, J. Mol. Biol. (1997) 268, 209225
3.
Shuai Cheng Li, Dongbo Bu, Jinbo Xu, and Ming Li, Fragment-HMM: A new approach
to protein structure prediction, Protein Science (2008), 17: 1925-1934
4.
Vladimir Yarov-Yarovoy, Jack Schonbrun, and David Baker, Multipass Membrane
Protein Structure Prediction Using Rosetta, Proteins. 2006 March 1; 62(4): 10101025.
5.
Rhiju Das and David Baker, Prospects for de novo phasing with de novo protein
models, Biological Crystallography ISSN 0907-4449
6.
Arthur M. Lesk, Loredana Lo Conte, and Tim J.P. Hubbard, Assessment of Novel Fold
Targets in CASP4: Predictions of Three-Dimensional Structures, Secondary
Structures, and Interresidue Contacts, Proteins: Structure, Function, and Genetics
Suppl 5:98-118 (2001)
7.
David Baker, CASP 7 ;
http://www.cs.nott.ac.uk/CFNJC/slides/Presentation_Pawel_JC_04-2008.pdf
8.
Y. Zhang, I-TASSER; http://zhang.bioinformatics.ku.edu/I-TASSER/
Previous approach and experiments
by using Fragment Assembly
Main backbone



Need to improve main bone structure（Cα
conformation)
Need to apply FA theory
Need to use classification

SCOP: Class, Fold, superfamily, family, domain level
22
Cα conformation prediction
Remote homology profiling
・PSI-BLAST profile
Classification
・Fold, superfamily, Family Level
• SCOP
Fragment assembly
• ５～１１residues
23
Previous Searching approach based on FA
１
２ Use
３
24
From previous experiment…

Target


Fragment


7 amno-acids fragment
Classification



1aa2 (108 residues)
Family, superfamily level
training data: e-value low hit10
Global scoring function

HCF: Hydrophobic compactness function
(( xi  x) 2  ( yi  y) 2  ( zi  z ) 2 ))

HCF 
i
N
25
result (1)
dRMS(Å)
Family level classification
Lattice model
Cubic FCC
lattice lattice
14.09
8.68
Low energy
Top 10
Best mean SD
All FA sets (68)
Best mean max
1.56 0.870 0.886 0.20
8.49
SD
20.96 5.98
2
 2 J 1  J


|
X

X
|

|
X

X
|
ai
aj
bi
bj

j 1
i  j 1
dRMS  
J ( J  1)

26






1
2
result(2)
dRMS(Å)
Family + Superfamily level
Low energy
Top 10
Best mean SD
1.56
0.870
0.886
All FA sets (68)
Best mean max
0.20
5.83
21.0
27
SD
5.98
Appendix(ii) -Experiment –all atom
purpose


Data


All-atom complexity
10 relatively small data
Lattice Model potential energy function

Scoring function based on chemical features


Hydrophilic, Hydrophobicity, Electric charge, tendency of
side-chain and electric charge
Accuracy measurement

RMSD
Face Centered Cubic Lattice Model


Nearest neighbor：12
residues
Nearest real model
-> considering space
among residues
difficulties
 Accuracy of the Model
 How to evaluate energy;
energy function
 How to search optimization
29
Appendix(ii) – lattice to all-atom result
Protein
(PDB id)
dRMS (Å)
Size
Cubic(main
chain)
FCC(main
chain)
※ref.)(main
chain)
All-atom
※ref.)
(All-atom)
１alg
24
9.44
8.58
10.33
1ku5
70
19.11
16.24
20.41
1aa2
108
14.09
8.68
6.06
16.12
12.09
1beo
98
13.82
11.52
6.36
15.89
12.01
1ctf
68
12.58
9.40
5.45
14.19
9.20
1dkt-A
72
13.75
10.66
5.59
15.62
10.98
1fca
55
11.18
8.34
5.16
12.30
9.00
1fgp
70
12.37
8.20
5.98
14.02
11.16
1jer
110
15.64
10.99
7.53
16.90
13.79
1nkl
78
12.68
9.78
5.70
14.61
10.13
13.47
10.24
5.98
15.04
11.05
average
Discussion

Classification should be applied to improve
accuracy


To choose fragment data
Accuracy of Energy function
31
Reference




Yu Xia, Enoch S.Huang, Michael Levitt and Ram
Samudrala , Ab Initio Construction of Protein Tertiary
Structures Using a Hierarchical Approach. Journal
Molecular Biology, (2000), 300,171-185.
G. Raghunathan and R.L.Jernigan, Ideal architecture of
residue packing and its observation in protein structures,
Cambridge University, 1997, Protein Science
Feng Jiao, Jinbo Xu, Libo Yu, Dale Schuurmans, Protein
Fold Recognition Using the Gradient Boost Algorithm,
University of Alberta, May 22 2006, WSPC
http://www.ecosci.jp/amino/amino2.html

Mika-FA-090904 - RNA Informatics @ UGA

Transcript Mika-FA-090904 - RNA Informatics @ UGA

Directory