BIO2093_DMS4_sequence_similarity

Download Report

Transcript BIO2093_DMS4_sequence_similarity

Phylogeny IV
BIO2093 – Sequence Similarity
Darren Soanes
Central Dogma
Open reading frame
Sequence similarity
• Protein sequence determines function.
• Proteins with similar sequences have
similar functions.
• Sequence similarity may also suggest
evolutionary relationship.
• Function of unknown protein can be
inferred by similarity of sequence to known
proteins.
Protein sequence determines
function
Protein databases
• Protein Information Resource (PIR) was the first protein
sequence database.
• Proteins organised into families based on degree of
sequence similarity.
• PIR-International Protein Sequence database.
• Swiss-Prot, manually annotated protein database, crossreferenced, literature citations.
• TrEMBL - (Translated EMBL Nucleotide Sequence Data
Library), automated annotations for those proteins not in
Swiss-Prot.
• Uniprot – combination of PIR+Swiss-Prot+TrEMBL.
• Most sequences in protein databases translated from
DNA sequences.
DNA sequence databases
• GenBank (1974), European Molecular
Biology Laboratory (EMBL) Data Library
(1980), DNA Databank of Japan (DDBJ)
(1984).
• Genbank, EMBL and DDBJ formed
Nucleotide Sequence Database
Collaboration – data exchanged daily.
Sequence alignment (1)
• Sequence alignment is a way of arranging the primary sequences
of DNA, RNA, or protein to identify regions of similarity. Aligned
sequences of nucleotide or amino acid residues are typically
represented as rows within a matrix. Gaps are inserted between the
residues so that residues with identical or similar characters are
aligned in successive columns.
Sequence alignment (2)
• Pairwise alignment – comparing two sequences.
Generally a query sequence is compared to
every sequence in a database to find the best
match.
* = identical amino acid
: = conserved substitution (same chemical property)
. = semi-conserved substitution (same shape)
Families of amino acids
Sequence alignment (2)
• Global alignment – attempt to match every
residue in two sequences, most useful when
sequences are of equal length.
• Local alignment - more useful for dissimilar
sequences that are suspected to contain regions
of similarity or similar sequence motifs within
their larger sequence context.
BLAST
• BLAST (Basic Local Alignment Search Tool): Local
alignment algorithm that has been designed for speed,
with a minimal sacrifice of sensitivity to distant sequence
relationships.
• Useful in large-scale database searches where most of
the candidate sequences will have no significant match
with the query sequence.
– Exact matches found to short sections of query
sequence in database (3 amino acids in protein
alignment).
– Match extended in each direction (ungapped).
– Matches with a score over a certain threshold are
subjected to more sensitive gapped alignment
algorithm.
BLAST programs (1)
•
•
•
•
•
blastn – nucleotide query v nucleotide database
blastp – protein query v protein database
blastx – nucleotide query v protein database
tblastn – protein query v nucleotide database
tblastx - nucleotide query v nucleotide database
(translated)
• Low complexity sequences can be filtered out,
reduces the likelihood of false positives in some
situations.
BLAST programs (2)
• Best to compare protein sequences between
species – they evolve more slowly than
nucleotide sequences.
• Many changes in nucleotide sequence don’t
change protein sequence (neutral mutations).
• Use blastn when mapping mRNA or gene
sequences to genomic DNA from the same
organism, or comparing RNA sequences.
Genetic Code
BLASTP 2.2.13 [Nov-27-2005]
Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer,
Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997),
"Gapped BLAST and PSI-BLAST: a new generation of protein database search
programs", Nucleic Acids Res. 25:3389-3402.
Query= YJL052W Chr 10
(332 letters)
Database: magnaporthe_grisea_2.3_proteins_nt.fas
11,109 sequences; 5,221,248 total letters
Searching..................................................done
Sequences producing significant alignments:
Score
E
(bits) Value
MG01084.4 hypothetical protein similar to (AL670003) glyceraldeh...
449
>MG01084.4 hypothetical protein similar to (AL670003) glyceraldehyde
3-phosphate dehydrogenase (ccg-7) [Neurospora crassa]
33022 34253 Length = 336
Score = 449 bits (1156), Expect = e-127
Identities = 215/331 (64%), Positives = 262/331 (79%)
Query: 1
Sbjct: 1
Query: 61
Sbjct: 61
MIRIAINGFGRIGRLVLRLALQRKDIEVVAVNDPFISNDYAAYMVKYDSTHGRYKGTVSH 60
M++ INGFGRIGR+V R A++ D E+VAVNDPFI
YA YM++YDSTHGR+KGTV
MVKCGINGFGRIGRIVFRNAIEHPDCEIVAVNDPFIEPKYAKYMLEYDSTHGRFKGTVEV 60
DDKHIIIDGVKIATYQERDPANLPWGSLKIDVAVDSTGVFKELDTAQKHIDAGAKKVVIT 120
++++G K+ Y ERDPAN+PW
+ V+STGVF
D A H+ GAKKV+I+
SGSDLVVNGKKVKFYTERDPANIPWSETGAEYVVESTGVFTTTDKASAHLKGGAKKVIIS 120
Query: 121 APSSSAPMFVVGVNHTKYTPDKKIVSNASCTTNCLAPLAKVINDAFGIEEGLMTTVHSMT 180
APS+ APM+V+GVN
Y
++SNASCTTNCLAPLAKVIND FGI EGLMTTVHS T
Sbjct: 121 APSADAPMYVMGVNEKSYDGSASVISNASCTTNCLAPLAKVINDKFGIVEGLMTTVHSYT 180
Query: 181 ATQKTVDGPSHKDWRGGRTASGNIIPSSTGAAKAVGKVLPELQGKLTGMAFRVPTVDVSV 240
ATQKTVDGPS KDWRGGR A+ NIIPSSTGAAKAVGKV+P L GKLTGM+ RVPT +VSV
Sbjct: 181 ATQKTVDGPSAKDWRGGRGAAQNIIPSSTGAAKAVGKVIPALNGKLTGMSMRVPTANVSV 240
Query: 241 VDLTVKLEKEATYDQIKKAVKAAAEGPMKGVLGYTEDAVVSSDFLGDTHASIFDASAGIQ 300
VDLT +LEK A+Y++IK A+K AA+GP+KG+L YTED VVSSD +G+ +SIFDA AGI
Sbjct: 241 VDLTCRLEKGASYEEIKAAIKEAADGPLKGILEYTEDDVVSSDMIGNNASSIFDAQAGIA 300
Query: 301 LSPKFVKLISWYDNEYGYSARVVDLIEYVAK 331
L+ KFVKL+SWYDNE+GYS RV+DL+ Y++K
Sbjct: 301 LNDKFVKLVSWYDNEWGYSRRVIDLVTYISK 331
e-127
Output values
• Score – value calculated from number of
matching or similar amino acids in
alignment.
• Expect – probability that alignment could
happen by chance.
• Identities – number of identical amino
acids in alignment.
• Positives – number of similar amino acids
in alignment.
Families of amino acids
Protein family
• A protein family is a group of evolutionarilyrelated proteins.
• Members of a protein family have similar threedimensional structures, functions and sequence
similarity.
• Families can include proteins with the same
function in different organisms (orthologues).
• Can also include members of multigene families
derived from gene duplication and
rearrangements (paralogues).
Gene duplication
• Gene duplication due to unequal crossing
over during meiosis can create gene
families.
• Sequence and function of different
members of a gene family can diverge.
Gene duplication
Cytochrome P450s
• A group of enzymes involved in the
oxidative metabolism of a large number of
natural compounds, as well as drugs,
carcinogens and mutagens.
• Contains haem group.
• Found in animals, plants, fungi and
bacteria.
Cytochrome P450
Functions of cytochrome P450
• Detoxification of drugs, carcinogens and
toxins.
• Biosynthesis of steroids, fatty acids and
bile acids.
• Biosynthesis of toxins.
• Bioconversion of polyaromatic
hydrocarbons.
• Alkane assimilation.
Two fungi
Magnaporthe oryzae –
rice blast fungus –
pathogen (invades
living plant)
Neurospora crassa –
red bread mould –
saprophyte (lives on
dead organic matter)
Number of cytochrome P450s
• M. oryzae – 122
• N. crassa – 37
• Cytochrome P450s important for
pathogens.
• Needed to detoxify anti-fungal chemicals
produced by plant and to synthesise toxins
to help M. oryzae invade the host-plant.
Cytochrome P450s
• Cytochrome P450s classified into families
based on sequence homology.
• Amino acid sequence not well conserved
between cytochrome P450 families.
• 3D structure of members of different
cytochrome P450 families are similar.
Cytochrome P450 structure
Pfam
• Pfam is protein family database based on
hidden Markov models (HMMs).
• http://pfam.xfam.org/
• HMM is a statistical model that considers all
possible combinations of matches, mismatches
and gaps to generate an alignment of a set of
sequences.
• Used to represent protein families at Pfam.
Domains
• A segment of a polypeptide chain that can fold
into a three-dimensional structure irrespective of
the presence of other segments of the chain.
• Different domains in the same protein may have
specific functions.
• Example – myosin family, a family of ATPdependent motor proteins involved in muscle
contraction and motility.
Myosin V (involved in actin-dependent
transport of vesicles)
Head domain (motor) – binds actin, nucleotide-binding
IQ – calmodulin-binding motif (calcium sensing)
Coiled-Coiled – dimerisation
Globular domain – binding of myosin to vesicles
Summary
• Protein sequence determines function.
• BLAST can be used to search for protein /
DNA sequences that are similar.
• Proteins can be grouped into families
based on sequence / phylogeny.