Alphy 99 - Claude Bernard University Lyon 1
Download
Report
Transcript Alphy 99 - Claude Bernard University Lyon 1
Sequence databases and retrieval
systems
Guy Perrière
[ replaced by Manolo Gouy ]
Pôle Bio-Informatique Lyonnais
Laboratoire de Biométrie et Biologie Évolutive
UMR CNRS n° 5558
Université Claude Bernard – Lyon 1
In the beginning
First paper compilation in 1965 (Atlas of
Protein Sequences).
Development of real databanks at the beginning of the 80’s:
Fast access.
Make possible analyses that require a lot of
data:
– Codon usage.
– Molecular phylogeny.
General databanks
Nucleotide sequences:
EMBL/GenBank/DDBJ.
Protein sequences:
Simple translations of coding regions:
– GenPept (from GenBank).
– TrEMBL (from EMBL).
Systems containing additional data:
– SWISS-PROT.
– PIR.
EMBL
Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
Maintained since 1994 at the European
Bioinformatics Institute (EBI) near Cambridge.
Web server:
http://www.ebi.ac.uk/embl
GenBank
Set up in 1979 at the Los Alamos National
Laboratory in New Mexico, US.
Maintained since 1992 at the National Center for Biotechnology Information (NCBI) in
Bethesda.
Web server:
http://www.ncbi.nlm.nih.gov/Genbank/index.html
DDBJ
Active since 1984 at the National Institute of
Genetics (NIG) in Mishima, Japan.
Web server:
http://www.ddbj.nig.ac.jp
EMBL / GenBank / DDBJ
The International Nucleotide Sequence Database
Collaboration : EMBL / GenBank / DDBJ
New sequences are exchanged daily between the
three centers :
--> the three banks have an identical content.
Data mainly provided by direct submissions from
the authors through Internet:
Web forms.
Email.
5
03/03
12/01
09/00
06/99
6
03/98
12/96
09/95
06/94
03/93
12/91
09/90
06/89
03/88
12/86
09/85
06/84
03/83
log (number of residues)
Data growth
11
10
9
8
7
GenBank
EMBL
PIR
SWISS-PROT
GenBank/EMBL size (April 2003)
31109 nucleotides.
24106 sequences.
1.8 million genes (proteins and RNA).
313,000 bibliographic references.
100 gigabytes on disk.
Growth of 63 % in 12 months.
Taxonomic sampling (April 2003)
There are 135,560
species for which at
least one sequence is
available.
Nine species (0.007 %)
correspond to 62 % of
the total.
77,900 species are
represented by only
one sequence!
Homo sapiens
Mus musculus
Zea mays
Rattus norvegicus
Brassica oleracea
Arabidopsis thaliana
Danio rerio
Drosophila melanogaster
Oryza sativa
27.3%
20.1%
3.0 %
2.9 %
2.3 %
2.0 %
2.0 %
1.4 %
0.9 %
The nine most represented
species in GenBank/EMBL
Distribution format
The banks are distributed as a set of text
files called divisions ( 292 for EMBL).
A division contains sequences related to:
A taxon (e.g., bacteria, invertebrates, mammals).
A class of sequences (EST, HTG, GSS).
Within a division, each sequence is called an
entry.
Entry structure
Information is introduced in structured
fields.
The format differs in its form between
EMBL and GenBank/DDBJ …
but not in substance.
ID, AC, SV and DT fields
Contain identifiers and the creation and the last
modification dates for the entries.
ID
XX
AC
XX
SV
XX
DT
DT
BSAMYL
standard; DNA; PRO; 2680 BP.
V00101; J01547
V00101.1
13-JUL-1983 (Rel. 03, Created)
12-NOV-1996 (Rel. 49, Last updated, Version 11)
DE, KW, OS and OC fields
Definition, Keywords, Taxonomy.
DE
XX
KW
KW
XX
OS
OC
OS
Bacillus subtilis amylase gene.
amyE gene; amylase; amylase-alpha;
regulatory region; signal peptide.
Bacillus subtilis
Bacteria; Firmicutes; Bacillus/Clostridium group;
Bacillus/Staphylococcus group; Bacillus.
The NCBI maintains a unified taxonomy, largely based on
sequence information.
RN, RX, RA and RT fields
contain bibliographic information.
RN
RP
RX
RA
RT
RT
RL
…
[1]
1-2680
MEDLINE; 83143299.
Yang M., Galizzi, A., Henner, D.J.;
"Nucleotide sequence of the amylase gene from
Bacillus subtilis";
Nucleic Acids Res. 11:237-249(1983).
FT field
contains the descriptions of functional regions.
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
FT
...
key
promoter
RBS
CDS
location and qualifiers
369..374
/note="put. promoter sequence P2 [3] (amyR1)"
414..419
/note="rRNA-binding site rbs-1 [3]"
498..2480
/gene="amyE"
/db_xref="SWISS-PROT:P00691"
/product="alpha-amylase precursor"
/EC_number="3.2.1.1”
/protein_id="CAA23437.1"
/translation="MFAKRFKTSLLPLFAGFLLLFHLVLAGPAA
ASAETANKSNELTAPSIKSGTILHAWNWSFNTLKHNMKDIHDAG
Intron/exon structure
Sequence
Subsequence
FT
FT
FT
FT
FT
FT
...
CDS
join(242..610,3397..3542,5100..5351)
/codon_start=1
/db_xref="SWISS-PROT:P01308"
/note="precursor"
/gene="INS"
/product="insulin"
SQ field
Contains the sequence iself
SQ
//
Sequence 2680 BP; 825
gctcatgccg agaatagaca
agaatcaatt gcttgcgcct
ccatacattc ttcgcttggc
gtttctgctt cggtatgtga
(...)
gatggtttct tttttgttca
tgttgcacaa tataaatgtg
cctgcaagga tgctgatatt
A; 520 C; 642 G; 693 T; 0 other;
ccaaagaaga actgtaaaaa cgggtgaagc
ttgcggtagt ggtgcttacg atgtacgaca
tgaaaatgat tcttcttttt atcgtctgcg
ttgtgaagct ggcttacaga agagcggtaa
agcagcgaat
gggggattcc
gcggcgttct
aagaagaaat
60
120
180
240
taaatcagac aaaacttttc tcttgcaaaa gtttgtgaag
aaatacttca caaacaaaaa gacatcaaag agaaacatac
gtctgcattt gcgccggagc
2580
2640
2680
Errors in databanks
There are a lot of errors in the nucleotide
sequence databanks:
In annotations:
– Inaccuracies, omissions, and even mistakes.
– Inconsistencies between entries.
In the sequences themselves:
– Sequencing errors.
– Cloning vectors inserted.
Redundancy
Another major problem is redundancy.
A lot of entries are
partially or entirely
duplicated:
20% of vertebrate sequences in GenBank.
Duplicated entries are
often different in their
sequence.
{
{
{
Partial and complete
sequence duplications
Protein sequence databases
Translation of Coding DNA Sequences (CDS)
from EMBL/GenBank/DDBJ.
Consultation of publications or patents.
Very small number of direct protein sequence
submission by authors.
In SwissProt and PIR: additional annotations.
SWISS-PROT
Created by Amos Bairoch in 1986 at the
Department of Medical Biochemistry in
Geneva.
Maintained by the Swiss Institute of
Bioinformatics (SIB) and funded by
GeneBio, and, very recently, by NIH.
Web server:
http://www.expasy.ch/sprot/sprot-top.html
SWISS-PROT characteristics
Almost no redundancy.
Cross-references with 60 other databanks.
High-quality annotations:
Systematic control by a team of annotators.
Help from a set of > 200 volunteer experts.
Embedded in Expasy, a www proteomics
server (http://www.expasy.org) .
Annotations
Protein function.
Post-translational modifications.
Structural or functional domains.
Secondary and quaternary structures.
Similarities with other proteins.
Conflicts between positions for CDS.
Disease-related mutations
Associated databanks
TrEMBL, built using only annotated CDS
from the EMBL data library.
ENZYME, for the international enzyme
nomenclature.
PROSITE, for biologically significant sites,
patterns and profiles.
SWISS-2DPAGE, for two-dimensional
polyacrylamide gel electrophoresis maps.
PIR
PIR (The Protein Information Resource) was
created by Margaret Dayhoff in 1965.
Aims:
To provide exhaustive and non-redundant
protein sequence data.
To give a classification using taxonomic and
similarity data:
entries grouped in super-families, families
and subfamilies.
Data maintenance
Three organisms collect and organize the
data introduced in PIR:
The National Biomedical Research Foundation
(NBRF) in the United States.
The Martinsried Institute for Protein Sequence
(MIPS) in Germany.
The Japan International Protein Sequence
Information Database (JIPID) in Japan.
Results
The exhaustivity is not better than what is obtained
with SWISS-PROT+TrEMBL.
Still contains redundancy.
Less comprehensive annotation.
Low number of cross-references.
PIR has recently joined forces with EBI and SIB
to establish the UniProt (United Protein
Databases), the central resource of protein
sequence and function.
Specialized databanks
A lot of specialized databanks have been
developed, which are devoted to:
Complete genomes.
Families of homologous genes.
Non-sequence data.
These systems are under the responsibility
of curators:
Data quality and homogeneity control.
Complete genomes
There is a large number of databanks
devoted to specific organisms.
These banks are associated to sequencing or
mapping projects.
For some model organisms there are often
several concurrent systems.
Examples
Organism
Available databanks
Bacillus subtilis
NRSub (Non-Redundant B. subtilis)
SubtiList
Escherichia coli
Colibri
EcoGene (E. coli Gene Database)
ECDC (E. coli Database Collection)
Various prokaryotes
CMR (Comprehensive Microbial Resource)
EMGLib (Enhanced Microbial Genomes Library)
Micado (Microbial Advanced Database Organization)
Saccharomyces cerevisiae
MYGD (MIPS Yeast Genome Database)
SGD (Saccharomyces Genome Database)
YPD (Yeast Proteome Database)
Drosophila melanogaster
FlyBase
Plasmodium falciparum
PlasmoDB (P. falciparum Database)
Caenorhabditis elegans
WormBase
WormPD (Worm Protein Database)
Arabidopsis thaliana
TAIR (The Arabidopsis Information Resource)
Gene family databanks
Built with automated procedures:
Similarity search between sets of proteins
(BLASTP, FASTP, Smith-Waterman).
Clustering into homologous families using
similarity criteria.
Include various data:
Protein (and sometimes nucleotide) sequences.
Multiple sequence alignments and trees.
Taxonomy.
ProtFam
Developed at MIPS.
Built with PIR sequences.
Includes four levels of classification:
Superfamilies (based on function and similarity
criteria).
Families (50% similarity).
Subfamilies (80% similarity).
Entries (≥95% similarity).
ProtFAm characteristics
Allows to visualize alignments and
dendrograms for the families.
Integrates Pfam domains.
Allows users to classify their own protein
sequences.
Web server:
http://mips.gsf.de
ProtoMap
Initially developed at the Hebrew University
of Jerusalem ; now hosted at Cornell
University.
Built with SWISS-PROT & TrEMBL
sequences.
Combines 3 sequence similarity measures
(BLASTP, FASTA and Smith-Waterman).
ProtoMap characteristics
Alignments and trees are visualized with
Java applets.
Users can submit sequences and classify
them.
Web server:
http://protomap.cornell.edu/index.html
Specialized systems
HOVERGEN (Homologous Vertebrate
Genes Database) :
HOBACGEN (Homologous Bacterial Genes
Database) for prokaryotes and yeast:
Based on GenBank CDS.
Based on SWISS-PROT/TrEMBL.
HOBACGEN-CG for completely sequenced
genomes:
Based on SWISS-PROT/TrEMBL.
Other specialized systems
COG (Clusters of Orthologous Groups), also
for complete genomes:
NuReBase (Nuclear Receptors Database) for
mammalian nuclear receptors:
Based on GenBank CDS.
Based on EMBL CDS.
RTKdb (Tyrosine Kinase Receptors):
Based on EMBL CDS.
Are COGs real orthologs?
100
Q9S2Y9
P96218
Q9KPJ4
Q9KC46
GLTB_BACSU
22
30
97
100
75
57
100
100
85
100
100
56
100
100
GLTB_SYNY3
Q9PJA4
Q9RXX2
GLTB_ECOLI
Q9KPJ1
P95456
AAG08421
Q9PA10
O67512
GLTS_SYNY3
Q22275
Q9VVA4
GLT1_YEAST
Glutamate synthase large subunit
Reciprocal
best BLAST hit
Escherichia coli
Bacillus subtilis
Pseudomonas
aeruginosa
Vibrio cholerae
Synechocystis sp.
Beyond protein families
ProtFam, Hovergen, Hobacgen, COGs
gather protein sequences homologous on
their whole length
Patterns, profiles, domains, …
are covered in Terry Attwood’s lecture.
Non-sequence data
Data
Available systems
Gene expression
GXD (Mouse Gene Expression Database)
The Stanford Microarray Database
Mapping
GDB (Genome Data Base)
EMG (Encyclopedia of Mouse Genome)
MGD (Mouse Genome Database)
INE (Integrated Rice Genome Explorer)
Protein quantification
SWISS-2DPAGE
PDD (Protein Disease Database)
Sub2D (B. subtilis 2D Protein Index)
3D structures
PDB (Protein Data Bank)
MMDB (Molecular Modelling Data Base)
NRL_3D (Non-Redundant Library of 3D Structures)
SCOP (Structural Classification of Proteins)
Polymorphism
ALFRED (Allele Frequency Database)
Molecular interactions
DIP (Database of Interacting proteins)
BIND (Biomolecular Interaction Network Database)
Sequence Data retrieval
Made mainly through Internet access:
With client software (e.g., Entrez, HobacFetch).
By remote connections to servers providing online access to the banks (INFOBIOGEN).
Using World-Wide Web servers and browsers
Advantages and limitations
Users do not have to cope with the usual
databases problems:
Storing of large amounts of data.
Daily updates.
Software upgrades.
Simplicity of use.
Net access is sometimes very slow at peak
hours:
consider using other servers besides NCBI
The ACNUC retrieval system
Direct access to functional regions described in
feature tables (CDS, tRNA, rRNA).
Selection of entries using various criteria:
Sequence names and accession numbers.
Bibliographic criteria.
Keywords.
Taxonomy.
Organelle.
Developed at Lyon University
ACNUC : possible accesses
Graphical interface distributed along with
the databases themselves.
http://pbil.univ-lyon1.fr/databases/acnuc.html
Web access at Pôle Bio-Informatique
Lyonnais (PBIL):
http://pbil.univ-lyon1.fr/search/query.html
ACNUC characteristics
Allows to query any bank in PIR, SWISSPROT, EMBL, or GenBank formats.
Keywords and species browsing.
Complex queries.
Links with sequence analysis programs on
the Web server (alignment, codon usage).
click
click
The Query form
Building queries to the sequence data bases
click
click
Retrieving sequences
Locally save the received
sequence data.
Browsing
the
species
trees
HOVERGEN:
Families of
homologous
vertebrate genes
Access to family
members
Download tree
or alignment
SRS
Public version developed at EMBL by
Etzold and Argos (1993).
Presently available on the different Web
servers belonging to EMBnet:
EBI (England).
INFOBIOGEN (France).
DKFZ (Germany).
…
Characteristics
Database index built with the use of ODD
(Object Design and Definition).
More than 250 databanks have been indexed
and are accessible through 35 SRS servers.
Allows queries to operate simultaneously on
different banks.
Databanks interconnection
Blocks
MIMMAP
REBASE
PDBFINDER
ALI
OMIM
PROSITE
ProDom
SWISSNEW
ENZYME
DSSP
PROSITEDOC
SWISSDOM
HSSP
FSSP
GenBank
PDB
MOLPROBE
SWISS-PROT
NRL_3D
YPDREF
ECDC
EPD
EMBL
EMNEW
PMD
YPD
TFSITE
TrEMBLNEW
ProtFam
FlyGene
TrEMBL
PIR
TFACTOR
Entrez
Developed by Schuler et al. (1996) at NCBI.
Allows to query several US-made databases:
GenBank, GenPept, NR, MMDB, MEDLINE.
Access through client software (Unix, Mac or
Windows) or Web server:
http://www.ncbi.nlm.nih.gov
Characteristics
Introduces the concept
of neighbours between
sequences, references
and structures.
Sequence neighbours
are established using
similarity criteria.
No access to multiple
alignments.
Refs.
(PubMed)
Phylogeny
(Taxman)
Nucl. Seq.
(GenBank)
Complete
Genomes
Structures
(MMDB)
Prot. Seq.
(GenPept)
NAR 2003 database issue
http://nar.oupjournals.org/content/vol31/issue1/