Transcript Proteins
Nonparametric Density
Estimation in
High Dimensions
Diego Rother, Guillermo Sapiro (University of Minnesota)
and Vijay Pande (Stanford University).
Interruptions, Questions and Comments
are Welcomed!
Aside 1: Proteins
►
►
Proteins are the “machines” that do things.
Amino Acids (AAs) are the building blocks of Proteins.
Tail
Head
Side
Chain
►
►
There are 20 types. They differ only in the Side Chains.
The Side Chains “can do” things…
►
Proteins are chains of Amino Acids (AAs).
►
►
►
To do their job, proteins must fold.
The sequence determines the shape.
The shape determines the function.
Dataset: Protein Ensembles
►
Protein: a thread in 3D space.
►
The shape determines the function.
►
Conformation = Snapshot
►
It’s not rigid, fluctuates around the
Native conformation.
►
Ensembles (= set of conformations
of the same protein).
►
Often only one representative is
kept, and the variability is lost…
►
Why don’t we find the pdf instead?
Representing Conformations…
X i x1 , x2 ,..., xd , x j ,
►
Use the Φ and Ψ angles.
►
Need d ≈ 2R variables to
represent each conformation (R
is the of AAs).
… & Ensembles
N structures, d features each
Goal and difficulties
►
►
Goal:
We want to estimate the Probability Density:
From the Dataset:
Χ X 1 ,..., X N
pX
X i d
But:
Densities in high dimensions are very difficult to estimate.
Require exponentially (in the dimension) big samples.
Estimating the density
►
We can factorize the density and try to estimate the factors:
px1 , x2 ,..., xd p xi1 xi2 ,..., xid . p xi2 ,...,
xi3 ,...,
xid xid ....ppxix3 ,...,
xid
id
►
nd
Using the locality of the influences
(only a small set of other variables
affect a given variable).
p xi1 xi1 ,..., xi n
1
1
px1 , x2 ,..., xd p xi1 xi1 ,..., xin . p xi2 xi1 ,..., xi n ... p xid n1 ,..., xid
►
1
1
Each factor is an independent “expert” (in
Hinton’s terminology) that favors or vetoes a
conformation according to his local knowledge.
2
2
The order of
“factored out”
variables is arbitrary.
Note
Conditioning
restrictions.
Estimating the density
px1 , x2 ,..., xd p xi1 xi1 ,..., xin . p xi2 xi1 ,..., xin ... p xid n1 ,..., xid
1
1
2
2
arg min H xi1 xi1 ,..., xi n H xi2 xi1 ,..., xin ... H xid n1 ,..., xid
►
1
1
2
2
Questions to answer:
1.
How many Conditioning Variables (CV) should be included in
each factor?
2.
Which ones?
a.
b.
3.
In which order are the variables factored out?
CVs for each individual factor?
Once we have chosen the CV, How do we estimate the factors?
3. How do we Estimate the Factors?
p(a b, c,...)
p(a, b, c,...)
p(b, c,...)
►
Note that:
►
To estimate terms of this form we use Kernel Density Estimation (KDE)
(assuming continuity).
►
KDE “spreads” the sample points using a Kernel.
►
How much “spread” is determined by ML.
Aside: Density Estimation
2.a. Order of factorization?
►
The order of the factorization matters (since we are dropping variables):
pa, b, c, d , e pb a, c, d , e . pa, c, d , e
pb a, c, d , e . pd | a, c, e . pe | a, c . pa, c
pd a, c, e . pa, c, e
pe a, c . pa, c
►
The most informative variables go last.
►
A Genetic Algorithm is used to find the “best” order.
2.b. CVs for each factor?
H b c, e H d | a, e H e | a, c Ha, c
►
To find the best model: minimize the entropy
►
If the permutation is known, minimize each term independently:
choose for each variable on the left,
the most (n) informative variables to put on the right
1. How many CV should be included?
►
So, which model do we choose?
pb a . pd | e . pe | c . pa, c ?
pb a, c . pd | a, e . pe, a, c ?
pb a, c, d . pd , a, c, e ?
n0
n 1
n2
n3
►
Infinite # samples: can afford to put all dependencies.
►
Finite # samples, compromise.
►
The compromise is automatically made by ML.
Think of this as a Black Box that computes a pdf.
Now to the uses of this…
More CV’s
pa, b, c, d , e pb . pd . pe . pa . pc ?
Compromise:
captures
better the
dependencies.
deteriorates
the density
estimate of
each factor.
Results: Synthetic Dataset
1)
►
Four examples:
2)
Blocks show dependencies between variables.
Graphs compare models with different #CVs.
The band represents the 99% confidence interval
(found using bootstrap).
1) px1 , x2 ,..., x6 px1 . px2 . px3 . px4 . px5 . px6
2) px1 , x2 ,..., x6 px4 x1 . px5 x2 . px6 x3 . px1 . px2 . px3
3) px1 , x2 ,..., x6 px5 x1 , x2 . px6 x3 , x4 . px1 . px2 . px3 . px4
4) px1 , x2 ,..., x6 px5 x1 , x2 , x3 . px6 x2 , x3 , x4 . px1 . px2 . px3 . px4
3)
4)
Results: Real Data
►
Dataset 1:
►
β-hairpin tryptophan zipper.
12 Residues (22 torsion angles).
481 conformations.
Dataset 2:
Villin protein.
36 Residues (70 torsion angles).
1543 conformations.
The optimal number depends on the Sample size.
Data provided by Bojan Zagrovic (Pande’s Group).
Conditioned Variable
Applications: DS1
φ1
φ2
φ3
φ4
φ5
φ6
φ7
φ8
φ9
φ10
φ
φ φ φ φ φ φ φ φ φ φ φ 11 ψ ψ ψ ψ ψ ψ ψ ψ ψ ψ ψ
1
2
3
4
5
6
7
8
9
10 11
ψ1 1 2 3 4 5 6 7 8 9 10 11
ψ2
ψ3
ψ4
ψ5
ψ6
ψ7
ψ8
ψ9
ψ10
ψ11
Conditioning Variable
►
Dependencies between variables.
: p4 | 4 , 5
►
Angles are often conditioned on
adjacent angles of the same kind
or corresponding angles of the
other kind.
►
more conditioning on psi’s
agrees with roles of phi & psi in
the Ramachandran map
►
Similar results on DS2.
Applications: DS1 & DS2
►
Method to find the most and least
likely conformations.
►
Some conformations are very
unlikely.
►
The experimental structures are
among the middle and least likely.
►
Few conformations are much
more likely than the rest.
Applications: Generate New Structures
►
To generate new structures follow the gradient of p.
Probability significantly improved.
► The most probable structures get closer to the native.
► We can detect the closest structures to the experimental by their probability.
►
►
The process is assembling popular parts.
Conclusions…
►
Method to estimate Probability Densities in high dimensions (under certain hypothesis).
►
“Dimensionality reduction”: The sample size should be increased (exponentially, though)
according to the number of dependencies between variables, not the total number of
variables.
►
Can be used to:
►
Create new conformations
Compare ensembles
So far: Computationally expensive (will improve with DualTree algorithm).
Thank you.
Validation?
Model/
Hypothesis
Method
Samples
Simulated Structures
►
Validation?
True pdf
from Folding@Home
Experimental
Structures
(virtually unlimited)
(very few)
We have to resort to indirect clues.
Incidentally…
►
Distance used to estimate the contribution (q) of
point x (observation) on y (query point):
q x , y e
d x , y 2
K
►
The closer y is to x, the stronger the
contribution.
►
Can be inverted to obtain a distance from the
contribution:
2
q x , y
d x , y log
q x , x
Local Distance
Local Pdf
Global Distance
Global Pdf
Dataset 2: Cow muscle shapes
X i x1 , x2 ,..., xd , x j ,
(From Arias et. Al.)