A Study of Residue Correlation within Protein Sequences

Download Report

Transcript A Study of Residue Correlation within Protein Sequences

A Study of Residue Correlation within
Protein Sequences and its Application
to Sequence Classification
Christopher Hemmerich
Advisor: Dr. Sun Kim
Outline
Background & Motivation
 Data
 Methods
 Experiments
 Conclusions
 Acknowledgements
 Bibliography

Background & Motivation




Prior works have studied correlation between
positions in protein families (Cline et al.,
2002 , Martin et al., 2005)
Used multiple sequence alignments to detect
correlation, and make links to protein
structure and residue co-evolution
Correlation across an MSA has been tied to
co-evolution and contact points in the protein
structure.
Less work on correlation between residues in
a sequence, significance is less clear
Background & Motivation





“protein sequences can be regarded as
slightly edited random strings” (Weiss et al.
2000)
Can we detect the increased correlation in
protein sequences vs random sequences?
Is there correlation between distant residues?
Is correlation characteristic of the protein
structure?
Can we measure correlation for hydropathy
or other residue non-specific interactions?
The Protein Families Database
We use the Pfam-A subset, consisting of
around 8000 curated families
 Pfam-A contains families with a wide
variety of sequence length and number of
sequences
 Pfam-A contains multiple sequence
alignments for families
 Limit experiments to sequences containing
100 or more residues to reduce sampling
effects

Methods: Measuring Correlation

Let's look towards Information Theory
How can we predict the next residue?
Methods: Measuring Correlation
 Let's
look towards Information Theory
How can we predict the next residue?
Pick the most frequently printed residue
We feel more certain about our guess with the
second sequence as it seems less random
Methods: Measuring Correlation

We can quantify the uncertainty in a
sequence with Shannon Entropy
Entropy is maximal when Pi is uniform for
all i
 Entropy is 0 when Pi = 1 for some i
 The lower the Entropy, the better our
prediction should be

Methods: Measuring Correlation
Should we guess 'N'? Is there a correlation
between 'V' and 'K'? Between 'N' and 'N'?
 We can measure the correlation with
Mutual Information for the sequence


Substitute frequencies for probabilities
Mutual Information Example
 Sequence:
AANANK
Mutual Information Example
 Sequence:
AANANK
Mutual Information Example
 Sequence:
AANANK
Mutual Information Example
 Sequence:
MI(
AANANK
AANANK ) = MI( JJCJCL )
Experiment: Measuring Correlation
Sample 100 sequences from PFAM
 Shuffle each sequence 100 times





use shuffle command from HMMER package
preserves length and residue frequency of
sequence
randomly re-orders residues
Compare MI score for each sequence to the
MI scores of its shuffles
Results: Correlation
Results : Normalized Correlation
Methods: Correlation Classification
Nearest Neighbor classification algorithm
 plot N-dimensional vector in space

3 Training Classes
Methods: Correlation Classification
Nearest Neighbor classification algorithm
 Plot N-dimensional vector in space

3 Training Classes
Test Vector
Methods: Correlation Classification

Measure the distance from the new point
to each existing point

Assign the family of
nearest training point
to the test vector
Methods: NCBI BLAST Classification
Build BLAST database from training
sequences with formatdb
 Blast test sequence about database with
default parameters
 Classify test sequence according to the
highest scoring match (High Scoring
Sequence Pair )
 If no sequence match is found,
classification fails

Methods: Experimental Method
Randomly Select 10 families from PFAM
database
 Evaluate classification techniques on each
possible combination of 3 families from the
10
 The results of all sub-experiments are
summed
 Accuracy is measured by:

# of correct classifications
# of classification attempts
Methods: Leave-one-out Validation

Comprehensive
Validation
Results: Neighbor Correlation
Experiment: Long Range Correlation
Extend correlation measure beyond
neighboring residues
 gap: number of residues between the
residues we are comparing
 we are considering the pairing of all
residues within 20 positions of each other


MI Vector = [ MI(0), MI(1), … MI(19) ]
Results: 20D-Correlation Vector
Experiment: Physical Properties
Not all intra-protein interactions are
residue specific
 Cline(2002) explores information attributed
to hydropathy, charge, disulfide bonding,
and burial
 Hydropathy was found to contain half the
information as the 20-element amino acid
alphabet, and its 2-element alphabet is
more resistant to finite-sample size effects

Hydropathy Alphabet
Hydrophobic:
Hydrophilic:
C,I,M,F,W,Y,V,L
R,N,D,E,Q,H,K,S,T,P,A,G
This partitioning from Weiss, et al. (2000)
 Converting every residue in a sequence to
a ‘+’ or ‘-’

Results: Hydropathy Correlation
Experiment: Combined Vectors
Combine residue and hydropathy
correlation vectors
 A single 40 dimensional vector per
sequence

Results: Combined Vectors
Conclusions
Correlation was strong enough for building
sequence classifiers without using
sequence
 Significant Long Range Correlation between
protein sequence residues
 Correlation exists in terms of residues and
physical properties

Future Work

More comprehensive study of long range
interactions



how much distance should we consider?
analyze gap distances individually and compare
look for combination of distances and methods to
most improve classification power
Explore other physical properties
 Measure correlation of residue groups
 Investigate normalization or correction
techniques to reduce sampling effects

Acknowledgements
Dr. Sun Kim
 Dr. Mehmet Dalkilic
 The Center for Genomics and
Bioinformatics



Computing resources
Support throughout this process
References







Aha D, Kibler D, Albert M. Machine Learning 6 1991
Bateman A, Coin L et al. Nucleic Acids Research 32 2004
Cline M, Karplus K, Lathrop R, et al. PROTEINS: Structures,
Functions, and Genetics 49 2002
Kohavi R. International Joint Conference on AI 1995
Martin L, Gloor G, Dunn S, Wahl L. Bioinformatics 21(22) 2005
Shannon C.E., The Bell System Tech. Journal 71 1948
Weiss O, Jiménez-Montaño M, Herzel H. J. theor Biol. 206 2000