A Study of Residue Correlation within Protein Sequences
Download
Report
Transcript A Study of Residue Correlation within Protein Sequences
A Study of Residue Correlation within
Protein Sequences and its Application
to Sequence Classification
Christopher Hemmerich
Advisor: Dr. Sun Kim
Outline
Background & Motivation
Data
Methods
Experiments
Conclusions
Acknowledgements
Bibliography
Background & Motivation
Prior works have studied correlation between
positions in protein families (Cline et al.,
2002 , Martin et al., 2005)
Used multiple sequence alignments to detect
correlation, and make links to protein
structure and residue co-evolution
Correlation across an MSA has been tied to
co-evolution and contact points in the protein
structure.
Less work on correlation between residues in
a sequence, significance is less clear
Background & Motivation
“protein sequences can be regarded as
slightly edited random strings” (Weiss et al.
2000)
Can we detect the increased correlation in
protein sequences vs random sequences?
Is there correlation between distant residues?
Is correlation characteristic of the protein
structure?
Can we measure correlation for hydropathy
or other residue non-specific interactions?
The Protein Families Database
We use the Pfam-A subset, consisting of
around 8000 curated families
Pfam-A contains families with a wide
variety of sequence length and number of
sequences
Pfam-A contains multiple sequence
alignments for families
Limit experiments to sequences containing
100 or more residues to reduce sampling
effects
Methods: Measuring Correlation
Let's look towards Information Theory
How can we predict the next residue?
Methods: Measuring Correlation
Let's
look towards Information Theory
How can we predict the next residue?
Pick the most frequently printed residue
We feel more certain about our guess with the
second sequence as it seems less random
Methods: Measuring Correlation
We can quantify the uncertainty in a
sequence with Shannon Entropy
Entropy is maximal when Pi is uniform for
all i
Entropy is 0 when Pi = 1 for some i
The lower the Entropy, the better our
prediction should be
Methods: Measuring Correlation
Should we guess 'N'? Is there a correlation
between 'V' and 'K'? Between 'N' and 'N'?
We can measure the correlation with
Mutual Information for the sequence
Substitute frequencies for probabilities
Mutual Information Example
Sequence:
AANANK
Mutual Information Example
Sequence:
AANANK
Mutual Information Example
Sequence:
AANANK
Mutual Information Example
Sequence:
MI(
AANANK
AANANK ) = MI( JJCJCL )
Experiment: Measuring Correlation
Sample 100 sequences from PFAM
Shuffle each sequence 100 times
use shuffle command from HMMER package
preserves length and residue frequency of
sequence
randomly re-orders residues
Compare MI score for each sequence to the
MI scores of its shuffles
Results: Correlation
Results : Normalized Correlation
Methods: Correlation Classification
Nearest Neighbor classification algorithm
plot N-dimensional vector in space
3 Training Classes
Methods: Correlation Classification
Nearest Neighbor classification algorithm
Plot N-dimensional vector in space
3 Training Classes
Test Vector
Methods: Correlation Classification
Measure the distance from the new point
to each existing point
Assign the family of
nearest training point
to the test vector
Methods: NCBI BLAST Classification
Build BLAST database from training
sequences with formatdb
Blast test sequence about database with
default parameters
Classify test sequence according to the
highest scoring match (High Scoring
Sequence Pair )
If no sequence match is found,
classification fails
Methods: Experimental Method
Randomly Select 10 families from PFAM
database
Evaluate classification techniques on each
possible combination of 3 families from the
10
The results of all sub-experiments are
summed
Accuracy is measured by:
# of correct classifications
# of classification attempts
Methods: Leave-one-out Validation
Comprehensive
Validation
Results: Neighbor Correlation
Experiment: Long Range Correlation
Extend correlation measure beyond
neighboring residues
gap: number of residues between the
residues we are comparing
we are considering the pairing of all
residues within 20 positions of each other
MI Vector = [ MI(0), MI(1), … MI(19) ]
Results: 20D-Correlation Vector
Experiment: Physical Properties
Not all intra-protein interactions are
residue specific
Cline(2002) explores information attributed
to hydropathy, charge, disulfide bonding,
and burial
Hydropathy was found to contain half the
information as the 20-element amino acid
alphabet, and its 2-element alphabet is
more resistant to finite-sample size effects
Hydropathy Alphabet
Hydrophobic:
Hydrophilic:
C,I,M,F,W,Y,V,L
R,N,D,E,Q,H,K,S,T,P,A,G
This partitioning from Weiss, et al. (2000)
Converting every residue in a sequence to
a ‘+’ or ‘-’
Results: Hydropathy Correlation
Experiment: Combined Vectors
Combine residue and hydropathy
correlation vectors
A single 40 dimensional vector per
sequence
Results: Combined Vectors
Conclusions
Correlation was strong enough for building
sequence classifiers without using
sequence
Significant Long Range Correlation between
protein sequence residues
Correlation exists in terms of residues and
physical properties
Future Work
More comprehensive study of long range
interactions
how much distance should we consider?
analyze gap distances individually and compare
look for combination of distances and methods to
most improve classification power
Explore other physical properties
Measure correlation of residue groups
Investigate normalization or correction
techniques to reduce sampling effects
Acknowledgements
Dr. Sun Kim
Dr. Mehmet Dalkilic
The Center for Genomics and
Bioinformatics
Computing resources
Support throughout this process
References
Aha D, Kibler D, Albert M. Machine Learning 6 1991
Bateman A, Coin L et al. Nucleic Acids Research 32 2004
Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures,
Functions, and Genetics 49 2002
Kohavi R. International Joint Conference on AI 1995
Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005
Shannon C.E., The Bell System Tech. Journal 71 1948
Weiss O, Jiménez-Montaño M, Herzel H. J. theor Biol. 206 2000