Protein World

Download Report

Transcript Protein World

Protein World
SARA
12-12-2002 Amsterdam
Tim Hulsen
Genome sequencing
• Since 1995: sequencing of complete
‘genomes’ (DNA): A/C/G/T order
ACGTCATCGTAGCTAGCTAGTCGTACGTATG
TGCAGTAGCATCGATCGATCAGCATGCATAC
• At this moment more than 80 genomes
have been sequenced and published, of
all kinds of organisms:
–
–
–
–
Animals
Plants
Fungi
Bacteria
Genomes  Proteins
• ‘Transcription’ and ‘translation’ of specific regions of the
genome leads to proteins, consisting of twenty types of
‘amino acids’:
ATG ACG CTG AGC TGC GGA CGT TGA -> TLSCGR
• Proteins are responsible for all kinds of life processes
• All the proteins that can be produced in an organism
together are called the ‘proteome’
• Sequence comparisons make
possible the classification of
proteins
Protein families
• e.g. The GPCR family:
• Sequence comparison helps in predicting the function of new
proteins
Determining protein functions
• Function of 40-50% of the new proteins is
unknown
• Understanding of protein functions and
relationships is important for:
– Study of fundamental biological processes
– Drug design
– Genetic engineering
Sequence comparison
• Smith-Waterman dynamic programming
algorithm (1981): calculates similarity/distance
between two sequences:
Query
---PLIT-LETRESVSubject
NEQPKVTMLETRQTAD
(bold=similar)
• Results in a SW-score that is a measure for how
similar the two sequences are to each other
• Disadvantage: score is dependent of length
• After the alignments, the proteins are ‘clustered’
(divided into families) according to their similarity
Existent databases
• Domain-based clusterings: Prosite, Pfam,
ProDom, Prints, Domo, Blocks
• Protein-based clusterings: ProtoMap,
COGs, Systers, PIR, ClusTr
• Structural classifications: SCOP, CATH,
FSSP
Why should there be another database?
Another method
• Enhanced Smith-Waterman algorithm: Monte-Carlo
evaluation (Lipman et al., 1984)
• How big is the chance that two sequences are similar but
not related?
• One of the two sequences is randomized and
recalculated (200 times). Randomization leads to
sequences with the same length and the same
composition, but different order
• Method leads to calculation of the Z-value:
S(A,B) - µ
Z(A,B) = ------------------σ
Advantages
• The obtained Z-value is a very reliable
measure for sequence, compared to SWscore:
– SW-score is dependent of length, Z-value is
not
– Amino acid bias does not affect the Z-value
• Independent of the database size
• Easier updating of the database, without a
total recalculation
Disadvantage
• LOTS of calculation time needed,
especially when all proteins in all
proteomes are compared to each other
(“all-against-all”)!
 SARA
SARA calculation
• Proteomes of 82 organisms compared ‘allagainst-all’ with the use of the Monte Carlo
algorithm: more than 400,000 proteins!
• 21,600 CPU days (~520,000 CPU hours)
• = 21,600 PCs running parallel over 24
hours / 1 PC running for ~ 60 years
• Using supercomputer TERAS (1024-CPU
SGI Origin 3800) at SARA: less than two
months!
Parties involved
• Gene-IT (Paris, France)
• SARA (Amsterdam, the Netherlands)
• CMBI (Nijmegen, the Netherlands)
• Organon (Oss, the Netherlands)
• EBI (Hinxton, UK)
Supporting parties
• Financed by NCF, foundation in support
of supercomputing
• Under the auspices of BioASP, the new
Dutch knowledge and service center for
Bioinformatics
Results available through BioASP
• http://www.bioasp.nl
• Log in and click on links ‘Research’ and ‘Protein
World’:
1
2
Results available through BioASP
• Organism selection screen:
Results available through BioASP
• Results screen:
Results available through BioASP
• Alignment screen:
Conclusions
• Currently the most comprehensive and
most accurate data-set of protein
comparisons
• A start for a maintainable and unique
database of all proteins currently known
• A rich data-source for clustering, datamining and orthology determination
Orthology determination
• Orthologs: genes/proteins in different
species that derive from a common
ancestor
• Orthologs often have the same function
• Interesting! Information from other species
could help in annotating a protein
Thank you for your attention
Any questions?