MS Thesis Summary: Protein Structure Database for

Download Report

Transcript MS Thesis Summary: Protein Structure Database for

M.S. Thesis Defense
Protein Structure Database
for Structural Genomics
Group
Jessica Lau
December 13, 2004
•
Bioinformatics is
•
•
•
At the Northeast Structural Genomics Consortium, database
management systems play a large role in its daily operation
•
•
•
•
Analysis of biological data: gene expression, DNA sequence,
protein sequence.
Data mining and management of biological information through
database systems.
Data collection and mining of experimental results
Track target progress – status milestones
Exchange information with rest of the world
My thesis presents work in database management systems at
the NESG.
•
•
•
Part 1: ZebaView
Part 2: Worm Structure Gallery
Part 3: Prototype of NESG Structure Gallery
• Zebaview is the official target list of
the Northeast Structural Genomics
Consortium
• Display summary table of NESG
targets.
– Status milestones
– Protein properties: DNA and
protein sequences, molecular
weight, isoelectric point
• New targets are curated and then
uploaded to SPiNE.
• 11,284 targets from 88 organisms.
Family View
NESG Families
• Unfolded
• Membrane
• Core 50
• Nf-kB
Target Summary Statistics
Selected  Cloned  Expressed  Soluble  Purified 
X-ray or NMR data collection  In PDB
• 4,418 targets
of soluble
targets: Prokaryotic
vs. Eukaryotic
In PDBSuccess
/ Cloned
Prokaryotic
vs. Eukaryotic
Percentage In PDB/Cloned
90
35
80
30
Prokaryotic
Prokaryotic
Eukaryotic
Eukaryotic
70
25
60
50
20
40
15
30
10
20
5
10
C. elegans (W)
Organism
Organism
C. elegans (W)
H. sapiens (H)
D. melanogaster (F)
H. sapiens (H)
S. cerevisiae (Y)
S. cerevisiae (Y)
0
0
D. melanogaster (F)
Percentage of Soluble/Cloned
cloned
• 141 structures
• 3.4% successful
targets
GO, Cellular Localization, and SignalP
• Search for targets that have
• any of the three GO ontologies defined
• no GO ontologies defined at all
116 NESG structures do not have Molecular
Function defined
LOCTarget
• Secretory proteins require formation
of disulfide bonds
• Oxidative Folding needed for proper
native folding
• 2,132 “Extracellular” NESG
targets
Bovine ribonuclease A has
four disulfide bonds to
stabalize its 3-D structure.
Mahesh Narayan, et al. (2000) Acc.
Chem. Res., 33 (11), 805 -812.
SignalP
• mRNA are translated with
signal peptide for cellular
localization
• Peptide is cleaved upon
destination
Lodish et al. Molecular Cell Biology 4th edition, Figure
7.1 (2000)
• SignalP predicts cleavage
of signal peptide
• Removal of signal peptide
gives proper native fold
Part 2 – Worm Structure Gallery
Caenorhabditis elegans
– Widely studied model organism
• 2-3 weeks life span, small size (1.5-mm-long), ease of
laboratory cultivation, transparent body
• Small genome, yet has complex organ systems similar to
higher organisms: digestive, excretory, neuromuscular,
reproductive systems
Donald Riddle et al, C. elegans II (1997)
Altun Z F and Hall DH. , Atlas of C. elegans Anatomy, Wormatlas (2002-2004)
System Components
• 22,653 C. elegans proteins
• 42 experimentally determined
• 4 are from NESG
• 24 homology models
• 14 are from NESG
• 960 C. elegans proteins
potentially modeled
• Uniprot: Pfam domain, Gene
name, ORF name
• PDB Coordinates
• Structure Validation Report
• Sequence similarities to
proteins in PDB
Protein Structure Validation Software
•
Suite of quality validation software
– PROCHECK
• Quality of experimental data
• Distribution of φ, ψ angles in Ramachandran plot
– MolProbity Clashscore
• Number of H atom clashes per 1,000 atoms
•
With respect to a set of scores from 129 high resolution X-ray
crystal structures
• < 500 residues, of resolution <= 1.80 Å, R-factor <= 0.25 and R-free <=
0.28;
Bahattacharya, A et al.
to be published
Homology Modeling Automatically
(HOMA)
• Algorithm based on alignment
between query and template
sequences.
– Regions of conserved residues
forms a set of constraints for
modeling
• Sequence identity of 40% or
more
• Good quality template
Bad alignment

Bad model
Poor quality template

Poor quality model
Quality scores of 3-D structures
Quality Z-scores - Homology Models vs. Experimentally Determined
Structures
5
MolProbity Clashscore z-score
0
-10
-8
-6
-4
-2
-5
-10
-15
-20
-25
-30
-35
-40
Homology Models
Experimentally Determined Structures
-45
Procheck (all) z-score
0
2
Search
• Search for C. elegans proteins
in local database.
• Keyword: “Ubiquitin” in any
field
Results:
72 C.C.elegans
152
elegansproteins
proteins
2 Experimentally determined structures
1 Homology model
11 Potential models
19
System Architecture
• Java, Tomcat,
MySQL, Perl.
Three-tier architecture
• Client: Web browser
• Application: JSP,
Logic components,
Data access
components
• Data: MySQL
Part 3 – NESG Structure Gallery
• Structure files submitted
by automated pipeline
• ADIT integrated with
SPiNE for uniform format
• PSVS and images
automatically generated
• Structure information from
PSVS directly into SPiNE
• Archives structure files.
• Structure files
submitted by
individual groups
• Structure
information is
entered into
SPiNE manually
• Manually run
PSVS and
MolScript
• Downloads
– Structure Validation Report
– Structure related files
•
•
•
•
•
Atomic coordinates
NMR constraints
NMR peak lists
Chemical shifts
Structure factor
• Annotation
– Functional annotation
provided by other NESG
members
– Uniprot
– PDB coordinates file
• Reusing Java
components from Worm
Structure Gallery
– Enhance ZebaView performance to handle
increased load and functionalities
– Integrate annotation from other protein and
structure databases.
– Make modules available for other java-based
applications within structural genomics.
– Develop a gallery for other organisms: yeast,
fruit fly, human
– Continue specifications for the new NESG
Structure Gallery
Advisor: Dr. Gaetano Montelione
Thanks to everyone at the
Protein NMR lab and NESG!
Aneerban Bhattacharya
John Everett
All the scientists who solved the structures!