Protein Evolution and Sequence Analysis

Download Report

Transcript Protein Evolution and Sequence Analysis

Protein Evolution and
Sequence Analysis
Central Premise
Significant sequence similarity allows one to
assign function to an unknown protein(s) based
on properties of known proteins and is a direct
consequence of evolutionary relationships.
Speciation- Evolution of a new gene/protein that is genetically independent of
the ancestral gene from which it arose.
Homolog- A gene/protein related to a second gene/protein by descent from a
common ancestral gene by speciation.
Ortholog- Genes/proteins in different species that evolved from a common
ancestral gene by speciation and that retain the same function.
Paralog- Genes/proteins related by duplication of a common ancestral gene
that evolves new functions even if related to that of the ancestor.
Convergent evolution- Evolution of similar features or properties in
genes/proteins of different genetic lineages.
Divergent and Convergent Evolution Among
the Serine Proteases
Trypsin
Chymotrypsin
3NKK
1ACB
Subtilisn
1SBT
Overlay
Mechanisms Involved in Molecular
Evolution of Genes/Proteins
Mutation- Stochastic single point changes in the genetic material due to
errors in DNA replication during mitosis, radiation exposure, chemical or
environmental stressors, or viruses and transposable elements. Slow but
constant rate (molecular clock) of 10-9 to 10-8 mutations per base per
generation. Splicing errors in eukaryotes that retain introns.
Recombination- Exchange of genes or portions of genes between different
chromosomes to create new combinations of elements.
Gene duplication- Duplication of a gene or portions of a gene, one of
which continues the original function and the other is free to evolve and
acquire new functions.
Retrotransposition- Incorporation of mRNA sequences back into DNA,
frequently inserting into new locations with different expression patterns.
The mechanism by which new genes/proteins arise allow for the
possibility of sequence analysis to infer functional and structural
relationships among different sequences.
Sequence alignments are methods to arranging DNA, RNA, or protein
sequences to identify regions of similarity or identity with the goal of
inferring structure, function, or both.
Sequence searches and alignments using DNA/RNA are usually not as
informative as searches and alignments using protein sequences.
However. DNA/RNA searches are intuitively easier to understand:
AGGCTTAGCAAA........TCAGGGCCTAATGCG
|||||||| |||
||||||||||| |||
AGGCTTAGGAAACTTCCTAGTCAGGGCCTAAAGCG
The above pairwise alignment could be scored giving a “1” for each
identical nucleotide, A zero for a mismatch, and a -4 for “opening a “gap”
and a -1 for each extension of the gap. So score = 25 – 11= 14
Protein sequence alignments are much more complicated but are
more informative because they involve 20 degrees of freedom (total
possible amino acids) rather than 4 (total possible bases).
ARDTGQEPSSFWNLILMY.........DSCVIVHKKMSLEIRVH
|
| | |
|
||| | | ||
|||
AKKSAEQPTSYWDIVILYESTDKNDSGDSCTLVKKRMSIQLRVH
Unlike nucleotide sequence alignments, which are either identical or
not identical at a given position, protein sequence alignments include
“shades of grey” where one might acknowledge that a T is sort of
equivalent to an S. But how equivalent? What number would you
assign to an S-T mismatch? And what about gaps? Since alanine is
a common amino acid, couldn’t the A-A match be by chance? Since
Trp and Cys are uncommon, should those matches be given higher
scores?
Therefore, accurately aligning sequences and accurately finding
related sequences are approximately the same problem?
Multiple Sequence Alignments
Sequence comparisons fall into two categories: Local alignment in
which regions of a large sequences are compared to identify regions of
similarity such as in domains and global alignments in which similar
sequences of similar length are compared to analyze overall similarity.
Various methods are available depending on the assumptions of the
algorithm and the types of sequences to be analyzed. All require a
scoring matrix for dealing with similarities, gaps, and insertions.
Clustal is a commonly used global alignment algorithm for performing
multiple sequence alignments. Algorithm is executed in three stages:
(1) A pairwise sequence comparison is performed across all sequences
starting from the most similar; (2) The pairwise information is used to
create a guide tree; (3) The guide tree is used to perform the final
alignment.
PAM (Percent Acceptable Mutation) matrices
• Are derived from studying global alignments of well-characterized protein families.
• PAM1 = only 1% of residues has changed (ie short evolutionary distance)
• Raise this to 250 power to get 250% change of two sequences (greater
evolutionary distance), or about 20% sequence identity.
• Therefore,
a PAM 30 would be used to analyze more closely related proteins,
a PAM 400 is used for finding and analyzing distantly related proteins.
• PAMx = PAM1x
Block substitution matrices (BLOSUM)
Are derived from studying local alignments (blocks) of sequences from related proteins
that differ by no more than X%.
1) In other words, one might use the portions of aligned sequences from related
proteins that have no more than 62% identity (in the portions or blocks) to derive
the BLOSUM 62 scoring matrix.
2) One might use only the blocks that have <80% identity to derive the BLOSUM 80
matrix.
3) BLOSUM and PAM substitution matrices have the opposite effects:
a) The higher the number of the BLOSUM matrix (BLOSUM X), the more closely
related proteins you are looking for.
a) The higher the number of the PAM matrix (PAM X), the more distantly related
proteins you are looking for.
Gap penalties – Intuitively one recognizes that there should be a penalty
for introducing (requiring) a gap during identification/alignment of a given
sequence. But if two sequences are related, the gaps may well be located
in loop regions which are more tolerant of mutational events and probably
have little impact on structure. Therefore, a new gap should be penalized,
but extending an existing gap should be penalized very little.
Filtering – many proteins and nucleotides contain simple repeats or regions
of low sequence complexity. These must be excluded from searches and
alignments.
Significance of a “hit” during a search - More important than an arbitrary
score is an estimation of the likelihood of finding a hit through pure chance
(lower the value to more certainty of a match). Ergo the “Expectation value”
or E-value. E-values can be as low as 10-70.
Useful Bioinformatics Sites
National Center for Biotechnology Information (NCBI)- National Institutes of
Health sponsored sites with rich array of resources and data bases.
[http://www.ncbi.nlm.nih.gov/pubmed]
ExPASy (Swiss Institute of Bioinformatics)- Large number of different
tools for sequence and function analysis. [http://www.expasy.org/tools/]
RCSB Protein Data Bank- Largest data base for curated of protein structures.
[http://www.rcsb.org/pdb/home/home.do]
BioGRID- Large data base of curated protein interaction datasets.
[http://thebiogrid.org/]
Osprey- Software and interactome analysis tools for visualizing interaction
data sets. [http://en.bio-soft.net/protein/Osprey.html]
Tree of Life website- Database information on phylogenetic relationships
among organisms with useful link outs. [http://tolweb.org/tree/]