Diapositiva 1

Download Report

Transcript Diapositiva 1

Prediction of protein function
from sequence analysis
Rita Casadio
BIOCOMPUTING GROUP
University of Bologna, Italy
The “omic” era
Genome Sequencing Projects:
Archaea : 74 species
In Progress:52
Bacteria: 973 species
In Progress: 2266 species
Eukaryotic:
Complete-23
Draft Assembly–318
In Progress-359
http://www.ncbi.nlm.nih.gov/genomes/static/gpstat.html
Update:
January 2010
The Data Bases of Biological Sequences and
Structures
GenBank:
>BGAL_SULSO BETA-GALACTOSIDASE Sulfolobus solfataricus.
MYSFPNSFRFGWSQAGFQSEMGTPGSEDPNTDWYKWVHDPENMAAGLVSG
DLPENGPGYWGNYKTFHDNAQKMGLKIARLNVEWSRIFPNPLPRPQNFDE
SKQDVTEVEINENELKRLDEYANKDALNHYREIFKDLKSRGLYFILNMYH
WPLPLWLHDPIRVRRGDFTGPSGWLSTRTVYEFARFSAYIAWKFDDLVDE
YSTMNEPNVVGGLGYVGVKSGFPPGYLSFELSRRHMYNIIQAHARAYDGI
KSVSKKPVGIIYANSSFQPLTDKDMEAVEMAENDNRWWFFDAIIRGEITR
GNEKIVRDDLKGRLDWIGVNYYTRTVVKRTEKGYVSLGGYGHGCERNSVS
LAGLPTSDFGWEFFPEGLYDVLTKYWNRYHLYMYVTENGIADDADYQRPY
YLVSHVYQVHRAINSGADVRGYLHWSLADNYEWASGFSMRFGLLKVDYNT
KRLYWRPSALVYREIATNGAITDEIEHLNSVPPVKPLRH
NR(*):
108,431,692 sequences
106,533,156,756 nucleotides
10,381,779 sequences
3,542,056,219 residues
35,5 HGE!
SwissProt:
PDB:
(*) CDS translations+PDB+SwissProt+PIR+PRF
514,212 sequences
180,900,945 residues
60,654 structures
membrane proteins <2%
Update:
January 2009
(about 30,000 in the human
genome)
…with
different effects
depending on
variability
Genes in
DNA...
>protein kinase
acctgttgatggcgacagggactgtatgctgatct
atgctgatgcatgcatgctgactactgatgtgggg
gctattgacttgatgtctatc....
Over 20 millions of
single mutations are
known in genes
…code for
proteins...
…proteins correspond to
functions...
From Genotype
to Phenotype
From 5000 to 10000
proteins per tissue
…when they are expressed
Proteins
interact
….in methabolic pathways
STRING 8—a global view on proteins and their
functional interactions in 630 organisms-
Jensen et al., 2009, Nucleic Acids Research, Vol 37.
The Human Interactome in STRING
22,937 proteins and 1,482,533 interactions
http://string.embl.de
One problem of the “omic
era”:
Protein functional
annotation
The Protein Data Bank
http://www.rcsb.org/pdb/home/home.do
No of Proteins with known structure: 57529
SCOP: Structural Classification
of Proteins
Domains are hierarchically
classified:
- class
- fold: proteins with secondary
structures in same arrangement
with the same topological
connections
- superfamily: structures and
functional features suggest a
common evolutionary origin
- family: proteins with identities
≥30%; with identities <30% but
with similar structures and
functions
From the Protein Sequence to the Structure and Function
space
Lesk A., 2004
100%
•Sequence comparison
PDB
New Folds
•Fold recognition
•Machine-learning aided
alignment
•Threading
•Ab initio and de novo modelling
•Machine-learning prediction of
structural features
0%
Sequence Identity (%)
30%
From the
Protein
Sequence to
the Structure
space
From the Protein Sequence to the Structure and
Function space
What is protein function?
What is a function?
For enzymes: function can be defined on the basis of the catalysed molecular reaction.
e.g. aspartic aminotransferase (AST)
In biochemistry, a transaminase or an aminotransferase is an enzyme that catalyzes a
type of reaction between an amino acid and an α-keto acid.
Specifically, this reaction (transamination) involves removing the amino group from the
amino acid, leaving behind an α-keto acid, and transferring it to the reactant α-keto acid
and converting it into an amino acid. The enzymes are important in the production of
various amino acids, and measuring the concentrations of various transaminases in the
blood is important in the diagnosing and tracking many diseases. Transaminases require
the coenzyme pyridoxal-phosphate, which is converted into pyridoxamine in the first
phase of the reaction, when an amino acid is converted into a keto acid.
Enzyme-bound pyridoxamine in turn reacts with pyruvate, oxaloacetate, or alphaketoglutarate, giving alanine, aspartic acid, or glutamic acid, respectively.
The presence of elevated transaminases can be an indicator of liver damage.
Enzyme Commission (E.C.) classification
A hierarchical classification for enzymes
EC 2.6 Transferring nitrogenous groups
EC 2.6.1Transaminases
EC 2.6.1.1 Aspartate transaminase
Other name(s): glutamic-oxaloacetic transaminase; glutamic-aspartic transaminase; transaminase A; AAT; AspT; 2oxoglutarate-glutamate aminotransferase; aspartate α-ketoglutarate transaminase; aspartate aminotransferase;
aspartate-2-oxoglutarate transaminase; aspartic acid aminotransferase; aspartic aminotransferase; aspartyl
aminotransferase; AST; glutamate-oxalacetate aminotransferase; glutamate-oxalate transaminase; glutamic-aspartic
aminotransferase; glutamic-oxalacetic transaminase; glutamic oxalic transaminase; GOT (enzyme); L-aspartate
transaminase; L-aspartate-α-ketoglutarate transaminase; L-aspartate-2-ketoglutarate aminotransferase; L-aspartate2-oxoglutarate aminotransferase; L-aspartate-2-oxoglutarate-transaminase; L-aspartic aminotransferase;
oxaloacetate-aspartate aminotransferase; oxaloacetate transferase; aspartate:2-oxoglutarate aminotransferase;
glutamate oxaloacetate transaminase
Systematic name: L-aspartate:2-oxoglutarate aminotransferase
Problems:
Isoforms
e.g How to differentiate the function of the cytoplasmic aspartate amintransferase from
that of mitochondrial isoform?
Non enzymatic proteins
GO function vocabulary:
http://www.geneontology.org/
The Ontologies
•
Cellular component
•
Biological process
•
Molecular function
Gene Ontology classification:
The human cytoplasmic aspartate transaminase
GO:0004069
GO:0005829
GO:0006533
One BIG problem of the
“omic era”:
Protein functional
annotation
Functional annotation in silico by homology search
ADH1_SULSO
ADH_CLOBE
ADH_THEBR
ADH1_SOLTU
ADH2_LYCES
ADH1_ASPFL
----------MRAVRLVEIGKP--LSLQEIGVPKPKGPQVLIKVEAAGVCHSDVHMRQGRFGNLRIVE
----------MKGFAMLGINKLG---WIEKERPVAGSYDAIVRPLAVSPCTSDIHTVFEGA----------------MKGFAMLSIGKVG---WIEKEKPAPGPFDAIVRPLAVAPCTSDIHTVFEGA------MSTTVGQVIRCKAAVAWEAGKP--LVMEEVDVAPPQKMEVRLKILYTSLCHTDVYFWEAKG------MSTTVGQVIRCKAAVAWEAGKP--LVMEEVDVAPPQKMEVRLKILYTSLCHTDVYFWEAKG----------MSIPEMQWAQVAEQKGGP--LIYKQIPVPKPGPDEILVKVRYSGVCHTDLHALKGDW-------
Sequence comparison is performed with alignment programs
Sequence identity  40 %
Similar structure and function
(??)
Methods for similarity searches:
BLAST, Psi-BLAST (http://www.ncbi.nlm.nih.gov/BLAST/) sequence
Altschul et al., (1990) J Mol Biol 215:403-410
Altschul et al., (1998) Nucleic Acids Res. 25:3389-3402
Pfam (http://pfam.wustl.edu/hmmsearch.shtml) sequence/structure
Bateman et al., (2000) Nucleic Acids Research 28:263-266
Transfer by inheritance:
Function annotation transfer from sequence
through homology
http://www.uniprot.org/
PDB
The annotation
process at
UniProt
Open problems of “inheritance through homology “
•Not all UniProt files are GO annotated
•The optimal threshold value of sequence identity for function transfer is not known
•Proteins contain multiple domains
•Proteins can share common domains and not necessarily the same function
•In proteins different combination of shared domains lead to different biological roles