Bioinformatics information resources and networks
Download
Report
Transcript Bioinformatics information resources and networks
www.
.uni-rostock.de
Bioinformatics
Information Resources And Networks
[email protected]
Bioinformatics and Systems Biology Group
www.sbi.informatik.uni-rostock.de
Ulf Schmitz, Bioinformatics Information Resources and Networks
1
www.
Outline
.uni-rostock.de
• Bioinformatics Information Resources And Networks
– EMBnet – European Molecular Biology Network
• DBs and Tools
– NCBI – National Center For Biotechnology Information
• DBs and Tools
–
–
–
–
–
–
Nucleic Acid Sequence Databases
Protein Information Resources
Metabolic Databases
Mapping Databases
Databases concerning Mutations
Literature Databases
Ulf Schmitz, Bioinformatics Information Resources and Networks
2
EMBnet – European Molecular Biology
Network
•
•
•
•
•
•
www.
.uni-rostock.de
Founded in 1988
Network that links European laboratories
that use biocomputing and bioinformatics
in molecular biology research
is a science-based group of collaborating
nodes throughout Europe and nodes
outside Europe
provides information, services and training
to the useres
efforts to increase the availability and
accessibility of data resources and
computing tools
increase knowledge and proficiency in
bioinformatics through education and
training
Ulf Schmitz, Bioinformatics Information Resources and Networks
3
www.
EMBnet - Nodes
National
Nodes
.uni-rostock.de
• governmental
(18)
EMBnet
• academic, industrial
research centers
(41 nodes)
• Biocomputing centers from
non European countries
Specialist
Nodes
Associate
Nodes
(9)
(11)
Ulf Schmitz, Bioinformatics Information Resources and Networks
4
www.
EMBnet - Nodes
.uni-rostock.de
National Nodes
Vienna Biocenter - Austria
BEN - Belgium
CSC - Finland
INFOBIOGEN - France
DKFZ - Germany
HEN - Hungary
INCBI - Ireland
INN - Israel
IEN-AdR - Italy
CMBI - Netherlands
Bio - Norway
IBB - Poland
PEN - Portugal
GeneBee - Russia
CNB-CSIC - Spain
BMC - Sweden
SIB - Switzerland
SEQNET - UK
• Appointed by the
governments
• Provide on-line services,
user support and training
Ulf Schmitz, Bioinformatics Information Resources and Networks
5
www.
EMBnet - Nodes
Munich Information Center for protein sequences
Specialist Nodes
MIPS
ICGEB
.uni-rostock.de
• Academic, industrial or
research centers in specific
areas of bioinformatics
• Largely responsible for
maintainance of biological
databases and software
Pharmarcia
F.Hoffmann – La Roche
EBI
HGMP - RC
Sanger
Hinxton
Hall
(Cambridge UK)
Important key specialist node
and home of:
EMBL, SWISS-PROT and
TrEMBL databases
UCL
Ulf Schmitz, Bioinformatics Information Resources and Networks
6
EMBnet - Nodes
Associate Nodes
IBBM - Argentina
ANGIS - Australia
CBI - China
CIGB - Cuba
CDFD - India
SANBI – South Africa
EMBnet - Brazil
CBR - Canada
EMBnet - Chile
EBMnet - Colombia
www.
.uni-rostock.de
• Centers from non
European countries
CIFN - MEXICO
Ulf Schmitz, Bioinformatics Information Resources and Networks
7
EMBnet’s Mission
www.
.uni-rostock.de
• Assist in biotechnological and bioinformatics
related research
• Provide training and education
• Exploit network infrastructures
• Investigate and develop new technologies
• Bridge between commercial and academic sectors
Ulf Schmitz, Bioinformatics Information Resources and Networks
8
Who are EMBnet’s Users?
www.
.uni-rostock.de
• > 40,000 registered users from all over the
world as well as a larger number of Internet
users
• All scientists working in Life Sciences, from
undergraduate students to top level
scientists, in academia as well as industry,
can get support from EMBnet
Ulf Schmitz, Bioinformatics Information Resources and Networks
10
EMBnets – SRS
www.
.uni-rostock.de
Sequence Retrieval System - SRS
National
Nodes
• result of a research project with the
EMBnet to interrogating all resources
gathered together
EMBnet
Specialist
Nodes
• SRS is a network browser for DBs in
molecular Biology
Associate
Nodes
• SRS allows any flat-file DB to be
indexed to any other
• queries across a range of different
DB types via a single interface
• independent of underlying data
structures or query languages
Ulf Schmitz, Bioinformatics Information Resources and Networks
11
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
12
EMBnets - EMBOSS
www.
.uni-rostock.de
• The European Molecular Biology Open Software Suite
• EMBOSS is a free Open Source software analysis package specially
developed for the needs of the molecular biology (e.g. EMBnet) user
community.
• The software automatically copes with data in a variety of formats and
even allows transparent retrieval of sequence data from the web.
• Also, as extensive libraries are provided with the package, it is a
platform to allow other scientists to develop and release software in true
open source spirit.
• EMBOSS also integrates a range of currently available packages and
tools for sequence analysis into a seamless whole.
Ulf Schmitz, Bioinformatics Information Resources and Networks
13
What can EMBOSS do for you?
www.
.uni-rostock.de
• Within EMBOSS you will find around hundreds of programs
(applications) covering areas such as:
–
–
–
–
Sequence alignment,
Rapid database searching with sequence patterns,
Protein motif identification, including domain analysis,
Nucleotide sequence pattern analysis---for example to identify CpG
islands or repeats,
– Codon usage analysis for small genomes,
– Rapid identification of sequence patterns in large scale sequence
sets,
– Presentation tools for publication,
• and much more. Check:
http://emboss.sourceforge.net/
Ulf Schmitz, Bioinformatics Information Resources and Networks
14
Jemboss
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
15
NCBI – National Center For
Biotechnology Information
• Leading American
information provider
• Established in 1988 as a
division of the National
Library of Medicine (NLM)
– Located on the campus of
the National Institute of
Health
(NIH – Rockville/Maryland)
www.
.uni-rostock.de
Mission:
• Development of new information
technologies to aid our understanding
of the molecular and genetic
processes that underlie health and
disease
• Creation of systems for storing and
analysing biological information
• Development of advanced methods of
computer-based information
processing
• Facilitation of user access to DBs and
software
• Co-ordination of efforts to gather
biotechnology information worldwide
Ulf Schmitz, Bioinformatics Information Resources and Networks
16
NCBI
www.
.uni-rostock.de
• Since 1992 – maintainance of GenBank and collaboration
with international nucleotide DBs: EMBL and DDBJ (Japan)
• Providing the Entrez that facilates to access biological DBs
(similar to SRS that is provided by the EMBnet)
Ulf Schmitz, Bioinformatics Information Resources and Networks
17
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
18
NCBI - Responsibilities
•
•
•
•
•
•
•
www.
.uni-rostock.de
administers research on biomedical problems at the molecular level
using mathematical and computational methods
maintains collaborations with several NIH (National Institutes of
Health) institutes, academia, industry, and other governmental
agencies
promotes scientific communication by sponsoring meetings,
workshops, and lecture series
supports training on basic and applied research in computational
biology for postdoctoral fellows through the NIH Intramural Research
Program
engages members of the international scientific community in
informatics research and training through the Scientific Visitors
Program
develops, distributes, supports, and coordinates access to a variety of
databases and software for the scientific and medical communities
develops and promotes standards for databases, data deposition and
exchange, and biological nomenclature
Ulf Schmitz, Bioinformatics Information Resources and Networks
19
Nucleic Acid Sequence Databases
www.
.uni-rostock.de
• the principal nucleic acid sequence databases are GeneBank,
EMBL and DDBJ, which each collect a portion of the total sequence
data reported world-wide, and exchange new and updated entries
on a daily basis
Nucleic acid sequence Databases
EMBL (Europe)
GenBank (USA)
DDBJ (Japan)
ENSEMBL (project between EMBL - EBI and the Sanger Institute)
dbEST (division of GenBank)
GSDB (division of GenBank)
Ulf Schmitz, Bioinformatics Information Resources and Networks
20
EMBL
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
21
Nucleic Acid Sequence Databases EMBL
www.
.uni-rostock.de
The EMBL Database (yesterday morning) containes
115,478,836,243 nucleotides in 63,713,453 entries.
Entry Type
Standard
Constructed (CON)
Third Party Annotation (TPA)
Whole Genome Shotgun (WGS)
Entries
Nucleotides
52,092,157
55,843,115,059
339,875
n/a
4,737
331,788,217
11,275,863
58,772,358,766
source: http://www3.ebi.ac.uk/Services/DBStats/
Ulf Schmitz, Bioinformatics Information Resources and Networks
22
Nucleic Acid Sequence Databases EMBL
Number of entries
(current 63,713,453)
www.
.uni-rostock.de
Total nucleotides
(current 115,478,836,243)
Ulf Schmitz, Bioinformatics Information Resources and Networks
23
Nucleic Acid Sequence Databases EMBL
www.
.uni-rostock.de
By nucleotide count
Zea mays
corn
Gallus gallus
rooster
Other
Homo sapiens
human
environmental sequence
Danio rerio
toy fish
Bos taurus
Canis familiaris
dog breed
Rattus norvegicus
rat
Pan troglodytes
Wren (bird)
Mus musculus
mouse
Ulf Schmitz, Bioinformatics Information Resources and Networks
24
Nucleic Acid Sequence Databases –
GenBank
www.
.uni-rostock.de
• GenBank which is produced at NCBI, is split into
smaller, discrete divisions.
• This facilitates fast, specific searches by restricting
queries to perticular database subsets
• During 1992-1997, the level of EST and STS data
within GenBank grew 10-fold.
• the overall sequence information contributed by
such partial data was still less than that of higher
quality sequences in the other major divisions
Ulf Schmitz, Bioinformatics Information Resources and Networks
25
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
26
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
27
Specialised Genomic Resources
www.
.uni-rostock.de
• In addition to the comprehensive DNA sequence DBs,
there is a variety of more specialised genomic resources.
• These so called boutique DBs bring focus to speciesspecific genomics and to particular sequencing
techniques.
Specialised Genomic Resources
SGD – Saccharomyces Genome Database
UniGene - gene-oriented clusters from GenBank
TIGR - Databases of The Institute for Genomic Research
ACeDB – A C.elegans DataBase
Ulf Schmitz, Bioinformatics Information Resources and Networks
28
Specialised Genomic Databases
www.
.uni-rostock.de
• SGD
http://genome-www.stanford.edu/Saccharomyces (bakers yeast)
• AceDB
http://www.acedb.org (c.elegans)
• FlyBase
http://flybase.bio.indiana.edu (fruit fly)
• MGD
http://www.informatics.jax.org (Mouse)
Ulf Schmitz, Bioinformatics Information Resources and Networks
29
Protein Information Resources
www.
.uni-rostock.de
Levels of protein sequence and structural organisation:
primary
secondary
tertiary
The primary structure of a protein is its amino acid sequence
The second structure of a protein corresponds to regions of local
regularity (e.g., α-helices and β-strands).
The tertiary structure of a protein arises from the packing of its
secondary structure elements, which may form discrete
domains within a fold.
Ulf Schmitz, Bioinformatics Information Resources and Networks
30
www.
Protein Information Resources
.uni-rostock.de
Levels of protein sequence and structural organisation:
primary
sequence
AVILDRYFH
secondary
motif
[AS]-[IL]2-X[DE]-R-[FYW]2-H
tertiary
domain
module
a,b,c
@.*,#
Ulf Schmitz, Bioinformatics Information Resources and Networks
primary
database
secondary
database
structure
database
31
Primary Protein Databases
www.
.uni-rostock.de
• The primary structure of a protein is its amino acid sequence
• these are stored in primary databases as linear alphabets that
denote the constituent residues
Protein sequence Databases
SWISS-PROT - Protein knowledgebase
TrEMBL - Computer-annotated supplement to Swiss-Prot
PIR – Protein Information Resource
MIPS – Munich Information Centre for Protein Sequences
NRL-3D - produced by PIR
Ulf Schmitz, Bioinformatics Information Resources and Networks
32
www.
Protein Sequence Databases
.uni-rostock.de
Table of the most represented species
•
•
•
•
Swiss-Prot contains 197,228
sequence entries, comprising
71,501,181 amino acids
abstracted from 135,257
references
Total number of species
represented in Swiss-Prot:
9,520
The average sequence length
in Swiss-Prot is 362 amino
acids.
Swiss-Prot is the most highly
annotated protein sequence DB
No.
Frequ.
Species
1
13049
Homo sapiens (Human)
2
10132
Mus musculus (Mouse)
3
5189
Saccharomyces cerevisiae (Baker's
yeast)
4
4847
Escherichia coli
5
4669
Rattus norvegicus (Rat)
6
3665
Arabidopsis thaliana (Mouse-ear
cress)
8
2863
Schizosaccharomyces pombe
(Fission yeast)
7
2814
Bacillus subtilis
9
2750
Caenorhabditis elegans
10
2286
Drosophila melanogaster (Fruit fly)
Ulf Schmitz, Bioinformatics Information Resources and Networks
33
Composite Protein Sequence
Databases
www.
.uni-rostock.de
• Composite databases amalgamate a variety of
different primary databases
• They render sequence searching much more
efficient, because they obviate the need to
interrogate multiple resources
• Different composite databases use different
primary sources and different redundancy criteria
in their amalgamation procedures
Ulf Schmitz, Bioinformatics Information Resources and Networks
34
Composite Protein Sequence
Databases
NRDB
OWL
Natural Resource DB
www.
MIPSX
.uni-rostock.de
SP+TrEMBL
SwissProt TrEMBL
PDB
SWISS-PROT
PIR1-4
SWISS-PROT
SWISS-PROT
PIR
MIPSOwn
TrEMBL
PIR
GenBank
MIPSTrn
GenPept
NRL-3D
MIPSH
SWISS-PROTupdate
PIRMOD
GenPeptupdate
NRL-3D
SWISS-PROT
EMTrans
GBTrans
Kabat
PseqIP
Ulf Schmitz, Bioinformatics Information Resources and Networks
35
Secondary databases
www.
.uni-rostock.de
• Secondary databases contain pattern data, i.e., diagnostic signatures
for protein families. These signatures encode the most highly
conserved features of multiply aligned sequences, which are often
crucial to the structure or function of the protein
• The second structure of a protein corresponds to regions of local
regularity (e.g., α-helices and β-strands).
• Which, in sequence alignments, are often apparent as well-conserved
motifs
• patterns are regular expressions, fingerprints, blocks, profiles, etc.
Ulf Schmitz, Bioinformatics Information Resources and Networks
36
Secondary databases
www.
.uni-rostock.de
Stored
information
Secondary DB
Primary source
PROSITE
SWISS-PROT
Regular expressions
(patterns)
Profiles
SWISS-PROT
Weighted matrices
(profiles)
PRINTS
OWL
Aligned motifs
(fingerprints)
BLOCKS
PROSITE/PRINTS
Aligned motifs (blocks)
IDENTIFY
BLOCKS/PRINTS
Fuzzy regular
expressions (patterns)
Ulf Schmitz, Bioinformatics Information Resources and Networks
37
Secondary databases
•
•
•
•
•
•
•
•
•
•
•
•
www.
.uni-rostock.de
TRANSFAC
http://transfac.gbf.de
EPD
http://www.epd.isb-sib.ch
InterPro
http://www.ebi.ac.uk/interpro/
PROSITE
http://www.expasy.ch/prosite
BLOCKS
http://blocks.fhcrc.org
PRINTS
ftp://ftp.seqnet.dl.ac.uk/pub/database/prints
PFAM
http://www.sanger.ac.uk/Software/Pfam/index.shtml
ProDom
http://www.toulouse.inra.fr/prodom.html
InterPro
http://www.ebi.ac.uk/interpro
GeneCards
http://bioinformatics.weizmann.ac.il/cards
ENSEMBL
http://www.ensembl.org
EcoCyc
http://ecocyc.panbio.com/ecocyc/ecocyc.html
Ulf Schmitz, Bioinformatics Information Resources and Networks
38
Secondary databases
www.
.uni-rostock.de
• There is some overlap in content between the secondary
databases
• PDBsum alone has 35,291 entries
• Pattern DB growth is slow because the addition of detailed
family annotation is very time consuming.
• PROSITE and PRINTS are the only comprehensively,
manually annotated secondary DBs
• To address the annotation bottleneck, the secondary
database curators are together created a unified database
of protein families known as InterPro
Ulf Schmitz, Bioinformatics Information Resources and Networks
39
Structure Classification DBs
www.
.uni-rostock.de
• Contain 3D structures available from
crystallographic and spectroscopic studies
Structure Classification Databases
PDBsum – Protein Data Bank
CATH – Class, Architecture, Topology, Homology
SCOP – Structural Classification of Proteins
Ulf Schmitz, Bioinformatics Information Resources and Networks
40
Structure Classification DBs
www.
.uni-rostock.de
• PDB
http://www.rcsb.org
• SCOP
http://scop.mrc-lmb.cam.ac.uk/scop
• CATH
http://www.biochem.ucl.ac.uk/bsm/cath
• DSSP
http://www.sander.ebi.ac.uk/dssp
• FSSP
http://www.ebi.ac.uk/dali/fssp
• HSSP
http://www.sander.ebi.ac.uk/hssp
Ulf Schmitz, Bioinformatics Information Resources and Networks
41
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
42
Metabolic Databases
www.
.uni-rostock.de
• A number of metabolic databases are available electronically
• some with features for querying and visualizing metabolic
pathways and regulatory networks.
• KEGG (Kyoto Encyclopedia of Genes and Genomes)
http://www.genome.ad.jp/kegg
• ENZYME (Enzyme nomenclature database)
http://www.expasy.ch/enzyme
• BRENDA (Enzyme Information System)
http://www.brenda.uni-koeln.de
• EMP (Enzymes and Metabolic Pathways database)
http://www.empproject.com
Ulf Schmitz, Bioinformatics Information Resources and Networks
43
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
44
Mapping Databases
www.
.uni-rostock.de
• OMIM
http://www3.ncbi.nlm.nih.gov/omim
• GDB
http://www.gdb.org
• RHDB
http://corba.ebi.ac.uk/RHdb
Ulf Schmitz, Bioinformatics Information Resources and Networks
45
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
46
www.
Ulf Schmitz, Bioinformatics Information Resources and Networks
.uni-rostock.de
47
Databases concerning Mutations
www.
.uni-rostock.de
• dbSNP
http://www.ncbi.nlm.nih.gov/SNP
• HGBASE
http://hgbase.cgr.ki.se
• The SNP Consortium (TSC)
http://snp.cshl.org
• HAEMA
http://europium.csc.mrc.ac.uk/usr/WWW/WebPages/database.dir/q
uiz.dir/intrquiz.htm
Ulf Schmitz, Bioinformatics Information Resources and Networks
48
Literature Databases
www.
.uni-rostock.de
• PubMed
http://www.ncbi.nlm.nih.gov/entrez/query
• The Lancet
http://www.thelancet.com
• Bioinformatics Online
http://www.bioinformatics.oupjournals.org
• Nature
http://www.nature.com
• Science
http://www.sciencemag.org
Ulf Schmitz, Bioinformatics Information Resources and Networks
49
Outlook – coming lecture
•
•
.uni-rostock.de
Introduction to sequence alignment
pair wise sequence alignment
–
–
–
•
•
www.
The Dot Matrix
Dynamic Programming
Scoring Matrices
local alignment
Alignment tools
–
–
–
BLAST
FASTA
ALIGN
Ulf Schmitz, Bioinformatics Information Resources and Networks
50
The End
www.
.uni-rostock.de
Thanks for your attention!
Ulf Schmitz, Bioinformatics Information Resources and Networks
51