Resources of biomolecular data - Center for Biological Sequence

Download Report

Transcript Resources of biomolecular data - Center for Biological Sequence

Center for Biologisk Sekvensanalyse
”Resources of
Biomolecular Data:
Sequences,
Structures and
Functionality”
PhD course #27803
Nikolaj Blom
Center for Biological Sequence Analysis
BioCentrum-DTU
Technical University of Denmark
[email protected]
Center for Biologisk Sekvensanalyse
Outline
Magnitudes and Scales
Resources: Data Sources & Tools
•
•
•
•
•
•
Primary DNA sources
Sequence Repositories
Structure Repositories
Functional Categorization
Integration of Databases
The Human Genome
• Genome Browsers
• Prediction Tools
• Evaluation of Prediction Servers
Starting points
• Link collections
Learning Objectives
Center for Biologisk Sekvensanalyse
The student should be able to:
• Describe differences between sequence
repositories and curated databases
• Describe the challenges of maintaining
genome-wide biological databases
• List two entry points for getting an
overview of ”my gene of interest”
• Describe how prediction servers may be
evaluated
Center for Biologisk Sekvensanalyse
Resources: Sources & Tools
There is A LOT OF
biomolecular
databases/sources
A LOT OF overlap of
information/redundancy
A LOT OF TOOLS
Personal
picks/preferences
• User-friendliness
• Update intervals
• Curation efforts / error
correction
• Linkage to other DBs
Center for Biologisk Sekvensanalyse
Faster than Moore’s law...
Center for Biologisk Sekvensanalyse
Faster than Moore’s law...
Center for Biologisk Sekvensanalyse
Human Genome
Published
HUGO: Nature,
15.feb.2001
Celera: Science,
16.feb.2001
Center for Biologisk Sekvensanalyse
Magnitudes and Scales
Human genome
3,200,000,000 bp
• Single basepair  full
genome is 9 orders of
magnitude
Genome = Football field:
~3 billion leaves of grass
Single base A T G C (or
SNP) = 1 leaf of grass
Genome browsing
• Zooming from whole
stadium to single leaf
How we got the sequence
Center for Biologisk Sekvensanalyse
Sanger chain termination method
Center for Biologisk Sekvensanalyse
Primary DNA sources
Trace files repositories
Single read: 500-1000 bp (~golf ball size / jig saw puzzle)
Variable quality
• WashU-Merck Human EST Project / Trace files
• ”Base-calling” non-trivial
G, C or nothing?
Center for Biologisk Sekvensanalyse
Assembly is Non-trivial!
Center for Biologisk Sekvensanalyse
Sequence repositories - GenBank
et al.
GenBank / EMBL / DDBJ
• Highly redundant (many versions of same gene)
• Cross-updated daily
• Version history is recorded
• Previous sequence records can be retrieved
• Contigs/HTGS (100-200 kb) finishing at
different stages
• Draft  Finished
• Includes genomic DNA, cDNA, ESTs, translated
peptides
Center for Biologisk Sekvensanalyse
Non-redundant and Curated
databases
Non-redundant
• Manual or automatic curation
• DNA
• RefSeq (NCBI; semi-automated)
• Ensembl gene index (automated)
• Protein
• RefSeq (NCBI; semi-automated)
• TrEMBL (EMBL; automated)
Center for Biologisk Sekvensanalyse
Curated database: UniProt/SwissProt
SIB - Swiss Institute of
Bioinformatics
Protein Knowledgebase /
Sequence Database
• Highly curated
• Experimental evidence
evaluated (e.g. modifications)
• All 80,000 entries checked by
Amos Bairoch himself ;-)
ExPASy - Expert Protein
Analysis System
• Proteomics tools: links + local
servers
Center for Biologisk Sekvensanalyse
Structure databases / Protein Data
Bank (PDB)
X-ray , NMR biomolecular
structures
Protein Data Bank (PDB)
http://www.rcsb.org/pdb/
Center for Biologisk Sekvensanalyse
Structure databases / Protein Data
Bank (PDB)
Center for Biologisk Sekvensanalyse
Functional Categorization
Gene Ontology
(GO)
• Hierarchical
• Controlled
vocabulary
Center for Biologisk Sekvensanalyse
Functional Categorization
Gene Ontology (GO)
http://www.geneontology.org/
• Molecular Function - the tasks performed by
individual gene products; examples are
transcription factor and DNA helicase
• Biological Process - broad biological goals, such
as mitosis or purine metabolism, that are
accomplished by ordered assemblies of
molecular functions
• Cellular Component - subcellular structures,
locations, and macromolecular complexes;
examples include nucleus, telomere, and origin
recognition complex
Center for Biologisk Sekvensanalyse
Integration of databases - Webs of websites
Links, links,
links...
SRS = Sequence
Retrieval
System
• Powerful,
complex query
language
BioDAS –
Distributed
Annotation
System
http://srs.ebi.ac.uk/
Center for Biologisk Sekvensanalyse
For ’my gene’, how do I:
Get an overview of the sequence
information known? (GeneCards+OMIM)
Examine the ’Genome Neighbourhood’?
(Genome Browsers)
Predict protein post-translational
modifications (PTMs)? (Prediction servers)
• (Evaluate the value of predicted features)
Center for Biologisk Sekvensanalyse
GeneCards
http://nciarray.nci.nih.gov/cards/
Center for Biologisk Sekvensanalyse
GeneCards-II
Center for Biologisk Sekvensanalyse
GeneCards-III
Center for Biologisk Sekvensanalyse
GeneCards-IV
Center for Biologisk Sekvensanalyse
GeneCards-V
Center for Biologisk Sekvensanalyse
Genetic/Medical Information
OMIM, Online Mendelian Inheritance in
Man (NCBI)
• The OMIM database is a catalog of human genes
and genetic disorders
• >16,000 entries (April, 2006)
• Examples: cystic fibrosis, prions, amyloid
precursor protein
• Condensed, highly curated descriptions of
genetics/disease/animal models/references
Center for Biologisk Sekvensanalyse
OMIM-I
(http://www3.ncbi.nlm.nih.gov/Omim/)
Center for Biologisk Sekvensanalyse
OMIM-II
Center for Biologisk Sekvensanalyse
OMIM-III
Center for Biologisk Sekvensanalyse
For ’my gene’, how do I:
Get an overview of the sequence
information known? (GeneCards+OMIM)
Examine the ’Genome Neighbourhood’?
(Genome Browsers)
Predict protein post-translational
modifications (PTMs)? (Prediction servers)
• (Evaluate the value of predicted features)
Center for Biologisk Sekvensanalyse
Genome Browsing
Three public
• Open access
• Use same genome build/assembly
• NCBI (U.S.)
• UCSC (Santa Cruz, U.S.)
• EnsEmbl (EBI, EU)
(One private)
• (Restricted, commercial; closed 2005)
Center for Biologisk Sekvensanalyse
Celera Discovery System & Database
Center for Biologisk Sekvensanalyse
Genome Browsers
- Portals to the Genomic World
UCSC – Univ. California – Santa Cruz (U.S.)
• http://genome.ucsc.edu/
NCBI – National Center for Biotechnology
Information (U.S.)
• http://www.ncbi.nlm.nih.gov/Genomes/index.
html
EnsEmbl – European Molecular Biology
Laboratory (E.U.)
• http://www.ensembl.org/
Center for Biologisk Sekvensanalyse
UCSC – Genome Browser
Center for Biologisk Sekvensanalyse
UCSC – Genome Browser II
Center for Biologisk Sekvensanalyse
NCBI
Center for Biologisk Sekvensanalyse
NCBI
Center for Biologisk Sekvensanalyse
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
EnsEmbl – Genome Browser
Center for Biologisk Sekvensanalyse
For ’my gene’, how do I:
Get an overview of the sequence
information known? (GeneCards)
Examine the ’Genome Neighbourhood’?
(Genome Browsers)
Predict protein post-translational
modifications (PTMs) or Gene Structure?
(Prediction servers)
• ...and evaluate the reliability of prediction
methods
CBS Services/Toolbox
Center for Biologisk Sekvensanalyse
http://www.cbs.dtu.dk/services/
Center for Biologisk Sekvensanalyse
Center for Biologisk Sekvensanalyse
NetPhos – a prediction server
Center for Biologisk Sekvensanalyse
http://www.cbs.dtu.dk/services/NetPhos/
Center for Biologisk Sekvensanalyse
NetPhos – a prediction server
Center for Biologisk Sekvensanalyse
Evaluating Prediction Servers
Performance on independent/crossvalidated data presented?
Published in peer-reviewed journal?
Cited by others?
• Science Citation Index
Linked to from credible web sites?
• Google Page-rank
• ”link:URL” search
Center for Biologisk Sekvensanalyse
Evaluating Prediction Servers
Center for Biologisk Sekvensanalyse
2can Bioinformatics Education
At EBI – European
Bioinformatics Institute
http://www.ebi.ac.uk/2
can/index.html
Tutorials, resource
links, etc.
Center for Biologisk Sekvensanalyse
EnsEMBL Bioinformatics Education
Center for Biologisk Sekvensanalyse
Starting Points
General Bioinformatics
• NCBI, National Center for Biotechnology
Information, U.S.
• EBI, European Bioinformatics Institute
Prediction Tools
• CBS, DK
• Expasy (Protein analysis), Switzerland
Center for Biologisk Sekvensanalyse
Dynamic Resources
Pros
• Includes most recent developments
• Updated regularly
• User interface improves(usually)
Cons
• Difficult to keep pace
• Tutorials and lectures hard to recycle ;-(
• Difficult to use at irregular intervals
Center for Biologisk Sekvensanalyse
Genome Browsers
- Portals to the Genomic World
Three main entry points:
• NCBI, UCSC, EnsEmbl
• Essentially contain same information
• High degree of linking to secondary databases
• Advisable to become familiar with only one genome
browser
• Learn to navigate and make queries
GeneCards and OMIM
• well suited for getting a quick overview of a
gene of interest
Center for Biologisk Sekvensanalyse
Prediction Servers
Evaluate scientific ’soundness’
• Look for indications of quality (citations, etc.)
Remember that prediction servers
provide...well, predictions!
Learning Objectives
Center for Biologisk Sekvensanalyse
The student should be able to:
• Describe differences between sequence
repositories and curated databases
• Describe the challenges of maintaining
genome-wide biological databases
• List two entry points for getting an
overview of ”my gene of interest”
• Describe how prediction servers may be
evaluated
Center for Biologisk Sekvensanalyse
Immediate Feedback
Title: ”Resources of Biomolecular Data:
Sequences, Structures and Functionality”
Did the lecture live up to your
expectations?
Did you expect to learn about
resources that were not covered
during this lecture?
NB! You can also provide input at the general
course evaluation
Center for Biologisk Sekvensanalyse
25,000?
The End