e-Bio System for Bio-Knowledge Discovery
Download
Report
Transcript e-Bio System for Bio-Knowledge Discovery
APAN e-Science Workshop
e-Bio System for BioKnowledge Discovery
2003.8.27
Sangsoo Kim
Nat’l Genome Informat’n Ct.
Korea Res. Inst. of Biosci. & Biotech.
Bio-Databases & Servers
• Contents
– Bibliographic (Journal abstracts such as
Medline)
– Experimental data (Sequences or structures)
– Results from annotation and analyses
– Bioinformatic analysis tools
• Purpose
–
–
–
–
Storing & managing raw data
Querying for knowledge discovery
Sharing information with others
Serving others with online analysis
New Role of Databases
• New discoveries of biological knowledge are
published in scientific journals
• But journal space is limited and not suitable to
publish large amount of high throughput data
• The supplementary information is provided in an
accompanying website
• Readers can download the supplementary
information and analyze from different aspect
• Combination with other information may surprise
with unexpected results
• Journal publishers require supplementary
information deposited in public archives
Example - Nucleotide
Sequence Repositories
• Nucleotide sequences discovered by sequencing
experiments are deposited in any one of the
public archives and the journal paper list the
accession numbers only (without deposition, you
cannot publish sequence discovery in journals)
• Public archives are
– DDBJ operated by CIB, NIG in Japan
– EMBL operated by EMBL-EBI in UK
– GenBank operated by NCBI, NIH in USA
• The contents of these archives are exchanged
daily and freely accessible to everybody
• Now extended to archive DNA chip data as well
Growth of GenBank
A Nucleotide Sequence Repository
Human Genome Project
Entrez: Home Page
RTFM
Entrez: Display
GenBank as HTML
FASTA as HTML
Example – BLAST Servers
• Originally developed to compare my sequence to
those in the repository in order to check whether
mine is novel or not
• Extended to detect distantly related sequences,
serving as the major sequence annotation tool
• Servers accept various kinds of queries and return
alignment results over WWW
• The most widely used bioinformatic tool
• For the analysis of many sequences, better to use
local installation
BLAST (Basic Local Alignment Sequence Tool)
http://www.ncbi.nlm.nih.gov/BLAST
RTFM
program
query
database
blastn
dna
dna
blastp
protein
protein
blastx
dna (6x)
protein
tblastn
protein
dna (6x)
tblastx
dna (6x)
dna (6x)
BLASTN (Cont'd)
Descriptions
Alignments
Example – Derived Databases
• Swiss-Prot & PIR
– Proteins are predicted from deposited
nucleotide sequences, either being mRNA or
genomic DNA
– Functions and features of the protein is
annotated manually by experts
• Protein motifs
– Prosite, pfam, BLOCKS, InterPro
– Keyword querying and motif detection of user’s
sequence
• Gene Ontology
– Hierarchical organization of biological terms
– Cataloging associated gene products
ExPASy (http://www.expasy.ch)
Expert Protein Analysis System
NiceProt View
Gene Ontology
• Systematic classification of
biological terminology
– Molecular function
– Biological process
– Cellular component
• Controlled vocabulary
• Associated GENE list
Data Mining
• Objective:
– Discovery of (biological) knowledge by
querying information in the databases and
comprehending it
• Problems:
–
–
–
–
Too many databases
Different protocols for access
Lack of standards
Poor quality or propagation of errors
• Solutions:
– Data warehousing or federated databases
Catalog of Bio-DBs
arranged by Data Domain
Database of Databases
• Data warehousing
–
–
–
–
Collect all databases by mirroring
Store in a unified format
Entrez (NCBI) or SRS (EBI)
Powerful but heavy maintenance load
• Federated databases
–
–
–
–
Maintained by participating members
Accessed by common protocols
Bio-DAS or Web Services via SOAP/XML
Next generation technology, but dependent on
both the cooperation by members and Internet
bandwidth
www.ngic.re.kr
www.ncbi.nih.gov
/LocusLink
New Data Types
• Textual
– Nucleotide or amino acid sequences
– Associated feature annotation
– Bibliographical texts
• Numeric
– Gene expression profiles
– Results from statistical analysis
• Graphical
– Protein-protein interaction network
– Genetic network
– Biochemical reaction pathways
Building a Nation from a
Land of City States
Lincoln D. Stein
Cold Spring Harbor
Laboratory
Italy in the Middle Ages
Bioinformatics, ca. 2002
Bioinformatics
In the XXI Century
Making Easy Things Hard
Give me all human
sequences submitted to
GenBank/EMBL last week.
Lots of ways to do it
• Download weekly update of
GenBank/EMBL from FTP site
• Use official network-based interfaces
to data:
– NCBI toolkit
– EBI CORBA & XEMBL servers
• Use friendly web interfaces at NCBI,
EBI
Perl/Java/Python to the
Rescue
•
•
•
•
•
One script to do the web fetch
Another to parse the file format
A third to move into private database
A fourth to repeat this weekly
Result:
– 6,719 scripts that do the same thing
– None of them work together
What’s Wrong with This?
• My EMBL fetcher is poorly documented so
you write your own
• Your fetcher won’t work with my parser
• My parser won’t work with your fetcher
• We’ve now wasted 20 hours rather than 10
• Multiply this by 6,719
What’s else is Wrong?
•
•
•
•
NCBI/EBI tweaks something
6,719 scripts fail at once
6,719 bioinformaticists tear their hair
21,261 biologists curse the
bioinformaticists
• 6,719 bioinformaticists curse their
own existence
Unifying Bioinformatics
Services
MIMBD: Meetings on the
Interconnection of Molecular
Biology Databases
Federated models: Gaea, Kleisli
Data warehouses: GUS, MODs,
Ensembl, UCSC
Ad hoc web services
Formal web services
Ad hoc services
BioXXX
Conf file
Your Script
Formal Web Services
SeqFetch
Service
SeqFetch
Service
GO
Service
BLAT
Service
BLAST
Service
Microarray
Service
Formal Web Services
SeqFetch
Service
SeqFetch
Service
Service
Registry
GO
Service
BLAT
Service
BLAST
Service
Microarray
Service
Formal Web Services
SeqFetch
Service
SeqFetch
Service
Service
Registry
GO
Service
BLAT
Service
BioXXX
Your Script
BLAST
Service
Microarray
Service
Technical Infrastructure is
Here*
•
•
•
•
•
•
Common vocabulary: GO
Transport format: XML
Data definition language: XSD
Wire protocol: SOAP
Service definition language: WSDL
Service registry: UDDI
*(almost)
Distributed Annotation System
http://www.biodas.org
Reference Server
Annotation Server
AC003027
M10154
AC005122
Annotation Server
AC003027
WI1029
AFM820
Thursday 10:30 AM
Canyon IV
Annotation Server
M10154
AFM1126
AC005122
WI443
Europe, ca 2000
Bioinformatics, ca 2010?
Collection and Sharing of
National Genome Information
KNIH
Human
Microbial
Industry
Proteome
Research
Institutes
NGIC
Animal
Plant
Crop
Ag-Bio
Universities
National Genome
Information Network
KNIH
Human
Data Grid
Proteome
Microbial
NGIC
Plant
Animal
Crop
Ag-Bio
KISTI
ETRI
Application
Grid