developing a guiding system

Download Report

Transcript developing a guiding system

Developing a Guiding System
of Biological Database.
CSE591 Fall 2003
(AP of AI CMB)
By Hiro Takahashi.
Motivation
• Too many Bio-Information databases are
available on the Internet (genes, proteins,
enzymes, nomenclatures, taxonomy etc. etc.).
• You have to know where you can find the
information you need.
• You have to know how to search with individual
databases and how to read its results (and how
to perform a farther search based on the result).
Database is growing everyday
Everyone has a difficult time keeping up with the flow of
new information. This is articularly true in biology now as
the pace of discovery accelerates. Databases have
become an essential tool for accumulating and archiving
raw data. They also play a major role in analyzing and
presenting information to researchers and the public in an
easily accessible form.
By Paul G. Young / NCBI
Goal
• To create a more powerful query
mechanism and a common interface to
get the best answer using existing
multiple databases.
Famous bioinformatics databases
• NCBI: National Center for Biotechnology Information.
- LITERATURE DATABASES
PubMed, PubMed Central, Bookshlf, OMIM and PROW.
- ENTREZ DATABASES
Protain sequence database, Nucleotide sequence
database, Genomes, Structure, Taxonomy, Population
study data sets, Books, ProbeSet, 3D Domain,s UniSTS,
SNP, CDD, Journals and UniGene.
- NUCLEOTIDE DATABASES
GenBank, EST database, GSS database, HomoloGene,
HTG database, SNPs database, RefSeq and STS
database
NCBI cont.
- GENOME-SPECIFIC RESOURCES
Bacteria, Eukaryotic Organelles, Fruit fly, Human,
Malaria, Mouse, Nematode, Plant Genomes, Plasmids,
Rat, Retroviruses, Viroids, Yeast, Zebrafish
- TOOLS for DATA MINING
Entres, LinkOut, Cubby, Citation Matcher – text term.
BLAST, Blink – Sequence Similarity.
Taxonomy Browser, TaxTable, ProtTable TaxPlot
- Taxonomy.
- TOOLS for Sequence Analysis
COGs, COGnitor, GEO, HomoloGene, CDD, LocusLink,
MGC, Clone Registry, Trace Archive, ORF Finder,
VecScreen, e-PCR
NCBI cont.
- Tools for 3D Structure Display and Similarity Searching
CD-Search, Cn3D, Domain Architecture Retrieval Tool,
VAST Search, Threading
- MAPS
Map Viewer, Arabidopsis Map, Fruit Fly Map,
GeneMap ’99, Human Map, Human-Mouse Homology
Maps, Malaria Map, Model Maker, Mosquito Map, Mouse
Map, Nematode Map, OMIM Gene Map, OMIM Morbid
Map, Rat Map, Zebrafish Map.
- COLLABORATIVE CANCER RESEARCH
- FTP Download
- Statistics
Recommended Tutorial of NCBI
http://bcs.whfreeman.com/mga2e/bioinformatics/ch01/bridging_page.htm
Famous Databases cont.
• EMBL-EBI
European Bioinformatics Institue
http://www.ebi.ac.uk
• GenomeNet
Bioinformatics Center / Institute for
Chemical Research Kyoto University
http://www.genome.ad.jp/
List of databases
•
•
•
•
•
•
•
•
•
•
•
Database for Metabolic Pathways
Database for Enzymes, Compounds and Reactions
Database for Regulatory Pathways
Database for Protein-Protein Interactions
Database for Transcription Factors
Database for Gene Expression Pattern
Database for Nomenclature (General)
Database for Nomenclature (Organism - specific)
Database for Nomenclature (Protein – specific)
Database for Taxonomy
Database for Complete Genomes and Analysis
Database for Metabolic Pathways
KEGG Metabolic Pathways:
http://www.genome.ad.jp/kegg/metabolism.html
EMP - Enzymes and Metabolic Pathways:
http://emp.mcs.anl.gov/
WIT - Metabolic Reconstruction:
http://wit.mcs.anl.gov/WIT2/
UM-BBD - Microbial Biocatalysis/Biodegradatation:
http://umbbd.ahc.umn.edu/
EcoCyc - E. coli Genes and Metabolism:
http://www.ecocyc.org/
Metalgen - Genes and Metabolism:
http://indigo.genetique.uvsq.fr/
Boehringer Mannheim - Biochemical Pathways:
http://www.expasy.org/cgi-bin/search-biochem-index
IUBMB-Nicholson Minimaps:
http://www.tcd.ie/Biochemistry/IUBMB-Nicholson/
Database for Enzymes, Compounds and Reactions
LIGAND - Biochemical Compounds and Reactions:
http://www.genome.ad.jp/ligand/
ENZYME – Enzymes:
http://www.expasy.ch/enzyme/
BRENDA - Comprehensive Enzyme Information System:
http://www.brenda.uni-koeln.de/
Worthington Enzyme Manual:
http://www.worthington-biochem.com/index/manual.html
Klotho - Biochemical Compounds:
http://www.biocheminfo.org/klotho/
ChemFinder - Searching Chemicals:
http://chemfinder.camsoft.com/
ChemIDplus at NLM:
http://chem.sis.nlm.nih.gov/chemidplus/
PROMISE - Prosthetic Groups and Metal Ions:
http://metallo.scripps.edu/PROMISE/
GlycoSuiteDB - Glycan Structure Database:
http://www.glycosuite.com/
CarbBank - Complex Carbohydrate Structure Database:
http://bssv01.lancs.ac.uk/gig/pages/gag/carbbank.htm
LIPIDBANK for Web – Lipids:
http://lipidbank.jp/
WebElements - Periodic Table:
http://www.webelements.com/
Database for Regulatory Pathways
KEGG Regulatory Pathways:
http://www.genome.ad.jp/kegg/regulation.html
SPAD - Signal Transduction:
http://www.grt.kyushu-u.ac.jp/spad/
Yeast Pathways in MIPS:
http://www.mips.biochem.mpg.de/proj/yeast/pathways/index.html
Interactive Fly - Drosophila Genes:
http://sdb.bio.purdue.edu/fly/aimain/1aahome.htm
GeNet - Gene Networks Database:
http://www.csa.ru/Inst/gorb_dep/inbios/genet/genet.htm
HOX-Pro - Homeobox Genes Database:
http://www.iephb.nw.ru/labs/lab38/spirov/hox_pro/hox-pro00.html
Wnt Signaling Pathway:
http://www.stanford.edu/~rnusse/wntwindow.html
TRANSPATH - Gene Regulatory Pathways:
http://transpath.gbf.de/
Database for Protein-Protein Interactions
BRITE Database for Biomolecular Relations:
http://www.genome.ad.jp/brite/
DIP - Database of Interacting Proteins:
http://dip.doe-mbi.ucla.edu/
BIND - Biomolecular Interaction Network Database:
http://www.binddb.org/
Database for Transcription Factors
TRANSFAC - Transcription Factor Database:
http://transfac.gbf.de/TRANSFAC/index.html
RegulonDB - E. coli Transcriptional Regulation:
http://www.cifn.unam.mx/Computational_Genomics/regulondb/
DBTBS - B. subtilis Transcription Factors:
http://elmo.ims.u-tokyo.ac.jp/dbtbs/
SCPD - S. cerevisiae Promoter Database:
http://cgsigma.cshl.org/jian/
DPInteract - DNA binding proteins:
http://arep.med.harvard.edu/dpinteract/
Database for Gene Expression Pattern
Axeldb - Xenopus laevis:
http://www.dkfz-heidelberg.de/abt0135/axeldb.htm
NEXTDB - Caenorhabditis elegans:
http://nematode.lab.nig.ac.jp/
MAGEST - Halocynthia roretzi:
http://www.genome.ad.jp/magest/
Database for Nomenclature (General)
IUBMB Nomenclature:
http://www.chem.qmul.ac.uk/iubmb/
IUPAC Nomenclature:
http://www.chem.qmul.ac.uk/iupac/
IUPHAR Receptor Nomenclature:
http://www.iuphar-db.org/iuphar-rd/
SWISS-PROT Documents:
http://www.expasy.ch/sprot/sp-docu.html
Gene Ontology:
http://www.geneontology.org/
Database for Nomenclature (Organism - specific)
Human (HUGO):
http://www.gene.ucl.ac.uk/nomenclature/
Mouse (MGD):
http://www.informatics.jax.org/mgihome/nomen/
Rat (RATMAP):
http://rgnc.gen.gu.se/RGNChem.html
D. melanogaster (FlyBase):
http://flybase.bio.indiana.edu/docs/nomenclature/lk/nomenclature.html
C. elegans:
http://elegans.swmed.edu/Genome/Nomencl2001w.htm
Plants (Mendel):
http://www.mendel.ac.uk/
S. cerevisiae (SGD):
http://genome-www.stanford.edu/Saccharomyces/registry.html
Database for Nomenclature (Protein – specific)
Alcohol dehydrogenase:
http://www.gene.ucl.ac.uk/nomenclature/genefamily/ADH-2.shtml
Protein kinases (PKC):
http://pkr.sdsc.edu/html/index.shtml
Phosphodiesterases:
http://depts.washington.edu/pde/Nomenclature.html
Glycosyl hydrolases (CAZy) / (ExPASy):
http://afmb.cnrs-mrs.fr/CAZY/
Aminoacyl-tRNA synthetases (AARSDB):
http://rose.man.poznan.pl/aars/index.html
Cytochrome P450:
http://drnelson.utmem.edu/CytochromeP450.html
Metallothionein / (ExPASy):
http://www.unizh.ch/~mtpage/classif.html
CD Molecules (PROW) / (ExPASy):
http://www.ncbi.nlm.nih.gov/PROW/
Immunoglobulins and T-cell receptors (IMGT):
http://imgt.cines.fr/
Cytokines (dbCFC):
http://cytokine.medic.kumamoto-u.ac.jp/
Database for Nomenclature (Protein – specific cont.)
Transport proteins:
http://www-biology.ucsd.edu/~msaier/transport/
G protein coupled receptors (GPCRDB):
http://www.gpcr.org/7tm/
Olfactory receptors (ORDB):
http://senselab.med.yale.edu/senselab/ORDB/
Eph and Eph receptors:
http://cbweb.med.harvard.edu/eph-nomenclature/
Nuclear receptors:
http://www.ens-lyon.fr/LBMC/laudet/nomenc.html
Nuclear receptors (NRR):
http://nrr.georgetown.edu/NRR/NRR.html
Mitochondrial proteins:
http://mips.gsf.de/proj/medgen/mitop/
Ribosomal proteins (ExPASy):
http://www.expasy.ch/cgi-bin/lists?ribosomp.txt
Homeobox proteins / (ExPASy):
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?cmd=Retrieve&db=PubMed&list_uids=8098522
&dopt=Abstract
Database for Taxonomy
NCBI Taxonomy:
http://www.ncbi.nlm.nih.gov/Taxonomy/
Tree of Life:
http://tolweb.org/tree/phylogeny.html
UCMP Phylogeny Exhibit:
http://www.ucmp.berkeley.edu/exhibit/phylogeny.html
Ribosomal Database Project II:
http://rdp.cme.msu.edu/html/
Database for Complete Genomes and Analysis
NCBI Complete Microbial Genomes:
http://www.ncbi.nlm.nih.gov/PMGifs/Genomes/micr.html
TIGR Comprehensive Microbial Resource:
http://www.tigr.org/tigr-scripts/CMR2/CMRHomePage.spl
GIB - Genome Information Broker:
http://gib.genes.nig.ac.jp/
PIR Complete Genomes:
http://www-nbrf.georgetown.edu/pir/genome.html
PEDANT:
http://pedant.mips.biochem.mpg.de/
GOLD:
http://ergo.integratedgenomics.com/GOLD/
MBGD - Comparative Microbial Genome Database:
http://mbgd.genome.ad.jp/
COG - Clusters of Orthologous Groups:
http://www.ncbi.nlm.nih.gov/COG/
Example Queries that requires multiple
databases search
• Homology Search
• Prediction of protein secondary structure.
Homology Search Example
• A Gene called BCL2 (related with
apoptosis process).
• Using its DNA sequence, find out similar
gene and get their evolutional and
functional relationship.
Process of Homology search
1. Find out Amino Acid sequence of BCL2.
- Goto EMBL-EBI/SwissProt and search the sequence
of BCL2.
- Select BCL2/Human Gene from the result.
- Copy the Sequence to clipboard
2. Use FASTA or BLAST to find similar
genes using sequence match.
- Go to a site that provides FASTA/BLAST and paste
the above result.
- Perform search and to through the result.
Review: FASTA and BLAST
•Both are sequence match algorithms.
FASTA: Focus on global matching view.
Less sensitivity on gaps.
Can be used with very short sequence.
BLAST: More focus on local matching view.
More sensitivity on gaps.
Can’t use for very short sequence.
Prediction of protein secondary
structure
• Pick two proteins (one water-soluble
protein and one membrane protein) and
compare predicted and actual secondary
structure.
Process of the query
1.
2.
3.
4.
Pick two proteins using PDB
(http://www.rcsb.org/pdb/index.html)
Use nnpredict
(http://www.cmpharm.ucsf.edu/~nomi/nnpredict.html)
to get prediction of water-soluble protein’s secondary
structure.
Use SOSUI (http://sosui.proteome.bio.tuat.ac.jp) to get
prediction of membrane protein’s secondary structure.
Use PDBsum
(http://www.biochem.ucl.ac.uk/bsm/pdbsum/) to get
actual secondary structures of each and compare the
result.
Result of nnpredict
Tertiary structure class: none
Sequence:
MSQSNRELVVDFLSYKLSQKGYSWSQFSDVEENRTEAPEETEPERETPSAINGNPSWHLA
DSPAVNGATGHSSSLDAREVIPMAAVKQALREAGDEFELRYRRAFSDLTSQLHITPGTAY
QSFEQVVNELFRDGVNWGRIVAFFSFGGALCVESVDKEMQVLVSRIASWMATYLNDHLEP
WIQENGGWDTFVDLYGNNAAAESRKGQERFNRWFLTGMTVAGVVLLGSLFSRK
Secondary structure prediction (H = helix, E = strand, - = no prediction):
------HEEEHHHHHH------EE-----------------------------------------------------HH--HHHHHHHHHHH--HHHHHHHH---HHH--EEE-------HHHHHHHHHHH-----EEEEEE-----EEE----HHHHHHHHHHHHHHHHH-------H-------EEEEH-----HHHHH---HHHHHHHHH---HEEEEEE--H----
Result of PDBsum
One Existing Multiple Database Search Project
DBGET - Integrated
database retrieval
system
(A part of
GenomeNet – from
Kyoto University)
http://www.genome.ad.j
p/dbget/
About DBGET
• DBGET is a simple database retrieval system for a diverse range
of molecular biology databases.
• Most of the existing molecular biology databases can be treated in
this simplified manner, or as so-called flat-file databases.
• Because each entry of a database is given a unique identifier, i.e.,
an entry name or an accession number, the molecular biology
databases in the world can be retrieved uniformly by specifying the
combination of the database name and the identifier.
dbname:identifier
• The KEGG gene catalogs are also considered as flat-file
databases where the combination of the organism name and the
gene name:
organism:gene
is used for identification.
LinkDB
• LinkDB is a database of links, each of which is represented
as a binary relation in the form of:
dbname1:identifier1 --> dbname2:identifier2
• LinkDB contains all cross-reference links, called original links,
extracted from all the databases in DBGET. Furthermore,
LinkDB dynamically generates additional links by computation,
i.e., by combining multiple links and/or using links in reverse
directions. Thus, LinkDB is a deductive database and the links
in LinkDB are of the following three types:
original links represented by: -->
reverse links represented by: <-indirect links consisting of multiple links
Structure of DBGET / LinkDB
DBGET access
• DBGET has three basic commands (or three
basic modes in the Web version), bfind, bget, and
blink.
bget : performs the retrieval of database entries
specified by the combination of dbname:identifier.
bfind : used for searching entries by keywords.
blink : LinkDB search, can be used to retrieve
related entries in a given database or all
databases in GenomeNet.
Databases support
Databases support cont.
How to get information using bfind/bget
Command Version
bfind [option] dbname expression
bget [option] dbname identifier [identifier1...]
bget [option] dbname1:identifier1 [dbname2:identifier2...]
URL version
Retrieve a single entry:
http://www.genome.ad.jp/dbget-bin/www_bget?dbname+identifier
http://www.genome.ad.jp/dbget-bin/www_bget?dbname:identifier
dbname = Database name or organism name
identifier = Entry name (accession number) or gene name
Retrieve multiple entries:
http://www.genome.ad.jp/dbgetbin/www_bget?dbname+identifier1+identifi
er2+...
http://www.genome.ad.jp/dbgetbin/www_bget?dbname1:identifier1+dbna
me2:identifier2+...
The first form is applicable only to multiple entries from a single database.
Cont…
Retrieve sequence entries in FASTA format:
http://www.genome.ad.jp/dbget-bin/www_bget?-f+dbname+identifier1+identifier2+...
http://www.genome.ad.jp/dbget-bin/www_bget?f+dbname1:identifier1+dbname2:identifier2+..
When the entry contains multiple sequences, specify as follows:
-f+-n+1 first sequence in FASTA format
-f+-n+2 second sequence in FASTA format
-f+-n+a amino acid sequence in FASTA format (GENES database only)
-f+-n+n nucleotide sequence in FASTA format (GENES database only)
Display title description of entries
http://www.genome.ad.jp/dbget-bin/www_btit?dbname+identifier1+identifier2+...
http://www.genome.ad.jp/dbget-bin/www_btit?dbname1:identifier1+dbname2:identifier2+...
Mark an object in the KEGG pathway
http://www.genome.ad.jp/dbget-bin/show_pathway?mapno+dbname:identifier
mapno = pathway entry accession number, such as map00010 and hsa00010
dbname:identifier = kegg identifier such as ec:5.3.1.1
Databases are updated every second
Conclusion
• There are so many bio-databases around the
world.
• DBGET provides a common interface to access
biological database in the world.
• We can use this DBGET access method to
perform extensive search online.
• Since it provides a common interface, we can
write an intelligent search procedure over
DBGET using our own knowledge base.
• Implement an easy/inteligent access method on
top DBGET/LinkDB – I will call it “BioWizard.”
Reference
•
•
•
•
•
Akiyama, Y., Goto, S., Uchiyama, I., and Kanehisa, M.; WebDBGET: an
integrated database retrieval system which provides hyper-links among related
entries. MIMBD'95: Second Meeting on the Interconnection of Molecular
Biology Databases (1995). [pdf] [ps]
Goto, S., Akiyama, Y., and Kanehisa, M.; LinkDB: a database of cross links
between molecular biology databases. MIMBD'95: Second Meeting on the
Interconnection of Molecular Biology Databases (1995). [pdf] [ps]
GenomeNet http://www.genome.ad.jp/
NCBI http://www.ncbi.nih.gov/RefSeq/
EBI http://www.ebi.ac.uk/embl/index.html