Transcript Lecture
Readings for this week
Gogarten et al Horizontal gene transfer…..
Francke et al. Reconstructing metabolic networks…..
Sign up for meeting next week for proposal
feedback/progress checkup
Inferring protein function
By genomic context………….
Inferring protein function
By homology……
COGs—Clusters of Orthologous Groups
(Eukaryotic versions are KOGs)
Identified using all-all against all sequence comparisons on
collection of complete genomes. Includes genes with
orthologous and paralogous relationships
COGS are grouped into large scale functional categories
Looking at Parts of Proteins
Domains--Conserved structural entities with distinctive
secondary structure content and an hydrophobic core
Example: Protein kinase domain
Motifs-- A pattern of amino acids that is conserved
across many proteins and confers a particular function on
the protein.
Example: Zinc finger CX2-4C....HX2-4H
How to identify domains?
PFAM—Protein Families Database
Based on Hidden Markov Models (HMM)
statistical probability models of multiple sequence
alignments
Uses a seed alignment of manually curated alignments
(PFAM-A)
Based on these alignments a Position Specific Scoring
Matrix (PSSM) is created
Position Specific Scoring Matrix (PSSM)
PFAM—Protein Families Database
Searching a protein against PFAM results in an E value
with meaning similar to BLAST evalues (the probability that
a sequence would score that well for that domain by
chance)
Other Protein Databases
SMART—uses HMMs, focus is signalling and regulatory
proteins (tend to be more divergent than enzymes)
TIGR FAMs– TIGR curated alignments used to generated
HMMs, one advantage is names should be functionally
accurate for all proteins they represent
PRINTS—not HMM based, uses “fingerprints” of conserved
motifs
Ecumenical solution—InterPro—
collection of multiple databases under one umbrella
Still more kinds of BLAST
PSI-BLAST– Position Specific Iterated BLAST
Use to: find members of a protein family or build a custom position-specific
score matrix
most sensitive BLAST program, making it useful for finding very distantly
related proteins or new members of a protein family
1st round: Standard BLASTP search, then a PSSM is built with all hits with E
values better than inclusion threshold
2nd round: PSSM is used to evaluate the alignment in this search. Additional
hits better than inclusion threshold are incorporated into an updated PSSM
3rd + rounds: as second round. Search reaches convergence when no new
hits are found.
Can save PSSM for use in later searching
Still more kinds of BLAST
PHI-BLAST– Pattern Hit Initiated BLAST
Find proteins similar to the query around a given pattern
Must enter both a query sequence containing the pattern AND a pattern to search
on
Example Pattern: (easy)
FGELA
(harder) [LIVMF]-G-E-x-[GAS]-[LIVM]-x(5,11)-R-[STAQ]-A-x-[LIVMA]-x-[STACV]
Matching peptide: FGELALMYNTPRAATIVA
Enzyme Nomenclature
EC Numbers: A hierachical classification scheme for enzymes
enzymes are named and classified according to the reactions they
catalyze
1. Oxidoreductases
2. Transferases
3. Hydrolases
4. Lyases
5. Isomerases
6. Ligases
Putting it all together….
KEGG– Kyoto Encyclopedia of Genes and Genomes
Collection of manually drawn metabolic/cellular pathway
maps, based on most up to date biochemical information
Metabolic maps are strongest feature--use EC numbered
enzymes as key players, allowing pathways of different
genomes to be easily mapped based on their predetermined
EC content
Also has a growing collection of signalling/cellular process
maps