Transcript file1
Identifying functional residues of proteins from sequence info
Using MSA (multiple sequence alignment)
- search for remote homologs using HMMs or profiles
Remote homologs with no known structure
- Given a large, diverse superfamily
- protein may evolve different function or subtype
- different substrate specificity or activity
- proteins with similar fold but different function
Past methods used phylogenetic trees
- map unknown protein to one of the branches of the tree produced
- but- maybe diverged to long ago to be clearly identified
- co-evolution of multiple features
- possible convergent evolution of molecular function at aa level
Other methodologies:
Analysis/prediction of subtype from sequence alignments
-characterization of aa residues, looking for significant substitutions
- gathering sequences into subgroups, comparing each subgroup
Principal component analysis (Casari et al, 1995)
- looks for functional residues conserved in protein families
Evolutionary Trace (Lichtarge et al)
Phylogenetic Inference (Sjolander et al)
Goal: identify regions conferring sub-family specificity
-Secondary goal: predict subtypes of orphan sequences
Input to algorithm:
- multiple sequence alignment (MSA) of sequences in a protein family
- classification of subfamilies of sequences from above MSA
For the given subtypes (or subfamilies) provided:
- get the MSA subalignment for each subfamily
- build a HMM profile for each sub-family MSA
- Rationale: generate pseudocounts and account for statistical bias
For each subalignment profile
The profile value for amino acid x at position i for subfamily j over all amino acids at a
given position will sum to 1. (probability of finding an amino acid x at position i in the
subfamily j)
Relative Entropy
- measure of “distance” between two probability distributions
- Relative entropy produces a value >= 0. (value of 0 for two identical distributions)
- for each position i in a subfamily s
For each position, a RE value for a subfamily s vs s-bar (all other subfamilies)
Cumulative Relative Entropy
- given a set of relative entropies for each subfamily for each position
-To produce a CRE for a given position i in the MSA across all subfamilies.
Given this set of cumulative relative entropy measures
- one for each position in MRA- you take the Z score.
- Standard statistical measure- the number of std dev’s above/below the mean
- tells you which residue positions vary strongly in aa distribution between families
- empirically, Z > 3 correlates with functional residue
For position i, which amino acid is dominant in a given subfamily
- find probability of observing aa x at position in subfamily s vs not-s
- Take the aa with probability >= 0.5
- We now have a small set of aa residues which differ strongly between subfamilies of a
protein family.
Subfamily data
What exactly constitutes a family or subfamily?
- not always clear
- automated tree generation could not separate data into clear subfamilies
- use of PFAM alignments and SWISSPROT data
Subfamilies are not clearly defined in databases
- divided proteins from PFAM database into subfamilies based on SWISSPROT data
- keyword search limited to enzymatic activity string in SWISSPROT
- put into groups, then checked for obvious mistakes
- also eliminated divisions “easily discernable by sequence comparison”
- 62 groupings from 42 alignments remained
- randomly pick 1:1 to produce 42 groups over 42 alignments
Subfamilies
Four very large families to test their results on
- nucleotidyl cyclases
- eukaryotic protein kinases
- lactate/malate dehydrogenases
- trypsin-like serine proteases
Nucleotidyl cyclases
- membrane-attached or cytosolic, cyclize (GTP -> cGMP) or (ATP -> cAMP)
- found residues 1018, 938, which correlate with previous results
- also identified residues which have not been tested experimentally
Protein kinases
- phosphorylate serine/threonine or tyrosine residues
- compare to experimental result- some ser/thr vs tyr kinase differences not detected
- inconsistency (no conservation) within the subfamily
- residues which were common to both ser/thr and tyr kinases
Subfamilies (cont)
Lactate/Malate Dehydrogenases
- common to a very wide variety of organisms- highly divergent
- results mostly as expected- but a few residues identified outside of active site
Serine Proteases
- cut protein backbone- differing specificity as to where (what aa precedes cut)
- specificity pocket determines where protease can bind
- identified 2 out of 3 of experimentally-determined pocket residues
- (third had a low z-score because of tolerance in one protein family)
- also identified a few residues outside of the active site
Prediction of Protein Subfamily
Sequence Similarity
- straight % similarity with other sequences (ignoring gaps)
BLAST
- database search, assign to nearest subfamily with best alignment
HMM method
- align sequence of sub-type to all HMMs of subfamilies and assign it to best alignment
- will attempt to do iterative optimization of match…
Profile method
- take original HMM, and probability profile
-Sub-profile method
- only use residues in above formula that have a positive Z-score
- to reduce noise, restrict to values that have above average positive relative entropy
Casari, et al. (1995) A method to predict functional residues in proteins
Input: a multiple-sequence alignment
- each sequence is converted to a vector of size (20 * l) where l is length of the alignment
Generation of of N x (20*l) matrix
- one sequence produces a vector of dimensions 20*l
- N sequences to produce N vectors of dimension 20*l
Use Principal Component Analysis
- get the covariance matrix- tells you how factors are correlated to one another
- eliminate covariance by finding eigenvectors/eigenvalues of covariance matrix
- largest eigenvalues and corresponding eigenvectors give you principal components
- ie the largest factors determining distribution of your dataset
- they take the three largest (the largest of which represents consensus sequence)
- project their 20*l dimensional data onto those 3 dimensions
- this can be used to predict a protein subfamily for a given protein
General Weirdness
Construction of a “comparison matrix”
- take matrix x (matrix transpose)
- solve for eigenvectors and eigenvalues as before
Columns of f represent amino acid values and positions
- becomes possible to examine individual amino acid residues and positions
- plotted on graph, shows residue correlation to type of protein subfamily
- does this actually work?