Bio-Chemical databases - Metabolomics Fiehn Lab

Download Report

Transcript Bio-Chemical databases - Metabolomics Fiehn Lab

Bio-Chemical databases
Guest Lecture
Graduate level course MCB221b - Mechanistic Enzymology
Tobias Kind – November 2007
• Database concepts - what is a “good” database (DB)
• How is data stored and queried and curated
• Enzyme DBs, Protein and peptide DBs, small molecule DBs
This document is hyperlinked (pictures and green text).
To use WWW links in this PPT switch to slide show mode.
1
Databases – very short primer (*)
Database interface – is what you see
DB2
Database queries – what you ask the database
Oracle
MySQL
Database objects – where the data is stored (index and tables)
Database types – relational databases, object oriented databases, flat file DBs
Database brands – Oracle, MySQL, Apache, IBM DB2, PostgreSQL, MS SQL
Database query language – how a database can be programmed (SQL)
Database dump file – the whole database in a single (*.dmp) file
Database Ontology – database vocabulary and used relationships
Database Semantics – capture meaning by grammar or logical analysis
(*) you can study this for several years
and get a PhD in computer and database sciences.
2
What is a good database?
As in normal life its important to distinguish between good and evil
Source: wikimedia.org
Source: wikimedia.org
Good DB:
• allows multiple input queries
• exports in multiple output formats
• connects to other DBs
• is curated (means checked for errors by humans or machines)
• is regularly updated (daily, yearly)
• cost money (your money or tax payers money) or time
• allows bulk download (millions of data sets can be downloaded)
• has open interfaces (APIs) for query requests
Bad DB:
• allow only single requests (which have to be typed manually)
• are not databases but just lists or tables
• have no link-out and no link-in
• allow no bulk download
• are not curated
•…
3
Exchange formats – SMBL, XML, BioPax
XML format – general purpose data format (CML for storing chemical data)
<?xml version="1.0" ?>
<molecule id="m1">
<atomArray>
<atom id="a1" elementType="C"
x2="-3.0333333015441895" y2="2.9166667461395264" />
</atomArray>
<bondArray>
</bondArray>
</molecule>
H
H
H
H
H
H
H
Methane
H
H
H
BioPax format – used for representing pathway data (data exchange format)
SBML format – representing models of biochemical reaction networks
SDF format – general purpose chemical structure format (small molecules)
RDF format – format for storing chemical reactions (small molecules)
PDB format – general purpose chemical structure format (proteins)
4
SBML (Systems Biology Markup Language)
Source: Akira Funahashi – Cell Designer Tutorial
• List of supported SBML programs (more than 200) from sbml.org
• List of curated and published SBML models (around 200) from biomodels DB
5
APIs, Mashups, SQL
• Application programming interfaces (API) are important to connect and automate
data exchange between local programs and databases;
Example: NCBI SOAP or PubChem PUG (Power User Interface) can be used to
download certain data via the web to another service or to a local program
• Mashups and integration services use new web technology (RDF, Yahoo Pipes) to
combine data sources and create new knowledge or enhance usage
• SQL used for programming databases
Large Database Table
yr
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1913
subject
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
Chemistry
…
winner
Jacobus H. van 't Hoff
Emil Fischer
Svante Arrhenius
Sir William Ramsay
Adolf von Baeyer
Henri Moissan
Eduard Buchner
Ernest Rutherford
Wilhelm Ostwald
Otto Wallach
SQL query
Result
yr
1909
subject
Chemistry
winner
Wilhelm Ostwald
SELECT yr, subject, winner
FROM nobel
WHERE yr = 1909 and
subject = 'chemistry'
Visit the SQL Zoo
6
Database front-ends (a good one)
Enhanced NCI Database Browser Release 2 (CACTVS DB)
• Small molecule DB with revolutionary web-front-end (2001)
• Multiple input an output (export) methods
• Allows matching of molecule lists against DB (as SMILES, CAS, NCI number)
• Links to other services
• Visualization modes (2D, 3D)
• 20 different molecular output
formats (SDF, CML, SMILES)
• export to different other
(calculational) services
• 30 different query modes
7
Database visualization
• Visualize complex networks; uses plug-in-technology from different sources
• Map your own compound data (proteins, genes, molecules) onto networks
• Perform literature search with enzymes, genes, small molecules
Source: Cytoscape.org
Start Cytoscape via JAVA webstart
8
Uber-portals (NCBI ENTREZ)
9
Source: http://gaggle.systemsbiology.org/docs/geese/
Source: WIKIMEDIA
Database and tools integration
Gaggle
• Frameworks
• Portals
• Mashups
10
Source: WIKIMEDIA
Gaggle
Integration of tools and database services
ListLink
The Gaggle: an open-source software system for integrating bioinformatics software and data sources.
Shannon PT, Reiss DJ, Bonneau R, Baliga NS.
BMC Bioinformatics. 2006 Mar 28;7:176.
Use Gaggle
11
Use or built your own local database
Example: LipidMaps DB with Instant-JChem
• Download the whole LipidMaps DB (10,000 lipids) as SDF file [LINK]
• Use Instant-JChem as data DB, molecule DB, reaction DB [LINK]
• Perform data and molecule queries on your laptop (PC, LINUX, MAC)
(…also works with KEGG/Biometa DB)
12
Welcome to the (database) jungle!
ChemBioGrid – collection of most chemistry databases
current number ~ 156
Pathguide.org – collection of pathway, enzyme, metabolite DBs
current number ~ 231
Chemistry related (big players):
PubChem, CAS (subscription), Beilstein (subscription), Chemspider (fast growing)
Important for chemistry/metabolomics:
Spectral databases (NMR, mass spectral databases), compound property DBs
Pathway, Enzyme related:
KEGG, Brenda, Reactome, Expasy, MetaCyc
13
Pathguide.org
Pathguide is a meta-database:
Comprehensive collection
of pathway, small molecule,
enzyme, protein interaction
databases
14
Enzyme and kinetics related databases
KDBI - Kinetic Data of Bio-molecular Interactions database
http://bidd.nus.edu.sg/group/kdbi/
SABIO-RK - SABIO-Reaction Kinetics Database
http://sabio.villa-bosch.de/SABIORK/
BRENDA - Comprehensive Enzyme Information System
http://www.brenda.uni-koeln.de/
EMP - Enzymes and Metabolic Pathways Database
http://www.empproject.com/
ENZYME - Enzyme nomenclature database (EXPASY)
http://www.expasy.ch/enzyme/
IntEnz - Integrated relational Enzyme database
http://www.ebi.ac.uk/intenz/index.html
TECR - Thermodynamics of Enzyme-Catalyzed Reaction
http://xpdb.nist.gov/enzyme_thermodynamics/
REBASE - Restriction Enzyme Database
http://rebase.neb.com/
Precise - Predicted and Consensus Interaction Sites in Enzymes
http://precise.bu.edu/
Source: Pathguide; Own search
15
PubChem
• Most important small
molecule DB
• There was no large open
chemistry DB until 10
years ago (!)
• All records can be
downloaded via FTP
• All other small molecule
link to PubChem
• PubChem Compounds
(true chemicals)
• PubChem Substances
(formulations, mixtures)
• substructure search and
multiple other options
Goto PubChem
16
CAS SciFinder
• 33 million molecules and 60 million peptides/proteins
• Largest reaction DB (14 million reactions) and literature DB
• A must for chemist and biochemist/biologist
• no bulk download, no good Import/ Export, no Linkouts
• only proprietary Windows interface (no plugins)
• no text mining (requires ANAVIST)
Download Scifinder
17
BRENDA - Comprehensive Enzyme Information System
18
Brenda 3D model output with JMOL
Example: Brenda connection to RSCB Protein Data bank
Visit Brenda
19
KEGG – Pathway DB
KEGG ID:
KEGG pathway map ID:
KEGG reaction ID:
C00002 (ATP)
map00195 (Photosynthesis)
R05668 (ATP + NAD reaction)
Visit KEGG
20
Reactome – curated pathway maps
Example: Skypainter, map your given KEGG IDs to pathways
Visit Reactome
21
Outlook for the database lesson
• Curation, Curation, Curation (costs money)
• Inhale the good DB and bad DB scheme and apply when you enter a DB portal
• Learn some basic database programming (Ruby on Rails, JAVA, SQL)
using bioinformatics and chemoinformatics approaches is crucial for research
• Learn how to import and store and handle database search results on your local
computer (simple: parse important data with regular expressions)
• Don’t be overwhelmed by the database jungle, take some time to play around;
Finally automation and clever use of DB tools will innovate your research
• Multiple unique identifier problem (Kegg ID, PubChem ID, CAS number)
and biological naming problem still exist
• The systems biology and chemistry database world is still different in terms
of re-use. Most of the chemistry data published (including molecules) is not
machine readable, hence can’t be automatically harvested by software robots.
22
Reading List databases
The Gaggle: An open-source software system for integrating bioinformatics software and data sources
Correcting ligands, metabolites, and pathways
Large-Scale Annotation of Small-Molecule Libraries Using Public Databases
23
Homework for homework discussion III (30 min)
1) Find three bad or evil databases in the biochemistry/chemistry world
please give a reason in a short sentence.
3) Find the molecules which were analyzed most in
papers regarding "enzyme kinetics" and "crickets“
using SciFinder (use Explore, then Analyze CAS Number)
Source: MS Office
2) Find the year in which most papers about “enzyme kinetics” were published using
SciFinder (use Explore enter search term, then Analyze year)
4) Find the price for 1g ATP from Pfaltz & Bauer
(in SciFinder use locate substance then use the Erlenmeyer icon for price info)
5) Goto Brenda and find out how many coronavirus types are in the DB
(use TaxExplorer and query)
6) Goto Brenda and find out how many enzymes are listed as resistant against
perchloric acid, report publication title (goto Brenda, Advanced search)
7) Goto KEGG Ligand DB find the KEGG Numbers for D-Hexose and ATP
8) Goto KEGG Reaction Prediction (e-zyme) : How many similar reactions occur between
D-Hexose and ATP? (Enter above KEGG IDs, press view structures; press compute)
9) Goto PubChem; What is the PubChem compound ID (CID) and the topological surface
area for Tobias acid?
24
Pathways and enzymes
http://www.biocarta.com/pathfiles/h_etcPathway.asp#
SQL learning
http://sqlzoo.net/
Databases
http://www.google.com/search?hl=en&q=enzyme+kinetics+database&btnG=Google+Search
SQL biologists
I’m a biologist Jim, not a programmer
SQL biologists
SciView part 5: interview with Alexei Drummond
Thank you!
Thanks to all Wikimedia.org contributors for pictures!
Thanks to the Dinesh Kumar (FiehnLab) for discussions.
25