Transcript Document

Bioinformática
Bancos de datos biológicos
Prof. Mirko Zimic
[email protected]
En 1944 IBM y la Universidad de Harvard estrenan
Mark I, la primera computadora que responde a la
moderna definición. Medía.15 metros de largo, 2.40 mts
de alto y pesaba 10 toneladas. Utilizaba relays
electromecánicos.
Collecting Sequence Data
• Genome (DNA-level): Genomic sequencing
 Complete picture of genome
 Generates physical map
 Includes regulatory and other silent regions
• Transcriptome (RNA-level): Expression-library
sequencing
 Expressed genes only
 Splicing / variant forms
 Can correlate with levels of expression
• Proteome (protein-level): Protein sequencing
 Insight into biological function
 Gives information on protein-protein interactions
 Post-translational modifications detected
DNA Sequencing
DNA Sequencing (Cont’d)
Fragment Assembly
Genomic DNA
Random shearing
Sequence overlapping fragments
Sequences assembled
CCAGATTACGAAATCC . . . GGCTTATACCGGCAT
Sequencing from Expression Libraries
Exon 1 Exon 2 Exon 3 Exon 4 Exon 5
Gene
Introns
Transcription / splicing / processing
AAA…AAA
mRNA
Reverse transcriptase
AAA…AAA
TTT…TTT
Sequence
Transcriptome
Secuenciamiento de Proteínas
Digital Storage of Sequence Data
• Bit: A binary digit represented in a digital circuit; only two states
recognized, 0 and 1 (usually  0 V and +5 V, respectively).
• Byte: Grouping of 8 bits into a larger unit.
Bits are usually numbered
0-7 (not 1-8!).
• ASCII: Acronym for American Standard Code for Information
Interchange. Representation of alphanumeric and some special
characters as 1-byte (8 bit) unsigned integers {0 ... 255} (the set
{20-1 ... 28-1}). The ASCII character set also includes nonprinting
control characters such as carriage return (CR) or line feed (LF).
Minimum storage requirement for human genome data represented as
ASCII characters:  3109 bytes (3000 Mbytes) or about 5 CD-ROMs,
exclusive of annotations or other data
Number Systems
Dec
Bin
Octal
Hex
Dec
Bin
Octal
Hex
0
0
0
0
10
1010
12
A
1
1
1
1
11
1011
13
B
2
10
2
2
12
1100
14
C
3
11
3
3
13
1101
15
D
4
100
4
4
14
1110
16
E
5
101
5
5
15
1111
17
F
6
110
6
6
16
10000
20
10
7
111
7
7
17
10001
21
11
8
1000
10
8
18
10010
22
12
9
1001
11
9
19
10011
23
13
The ASCII Table
Extended ASCII Characters
Nucleic-acid Base Codes
Symbol
Meaning
Symbol
Meaning
A
A
S
G or C
G
G
W
A or T
C
C
H
A, C, or T (~G)
T
T
B
C, G, or T (~A)
R
A or G
V
A, C, or G (~T)
Y
C or T
D
A, G, or T (~C)
M
A or C
N
A, C, G, or T
K
G or T
Adapted from Mount, Bioinformatics Cold Spring Harbor Laboratory
Press, Cold Spring Harbor, NY (2001)
Amino-acid Codes
1-letter
Code
3-letter
Code
Amino Acid
1-letter
Code
3-letter
Code
Amino Acid
A
Ala
alanine
N
Asn
asparagine
C
Cys
cysteine
P
Pro
proline
D
Asp
aspartic acid
Q
Gln
glutamine
E
Glu
glutamic acid
R
Arg
arginine
F
Phe
phenylalanine
S
Ser
serine
G
Gly
glycine
T
Thr
threonine
H
His
histidine
V
Val
valine
I
Ile
isoleucine
W
Trp
tryptophan
K
Lys
lysine
X
Xxx
undetermined
L
Leu
leucine
Y
Tyr
tyrosine
M
Met
methionine
Z
Glx
Glu or Gln
Adapted from Mount, Bioinformatics Cold Spring Harbor Laboratory
Press, Cold Spring Harbor, NY (2001)
The exponential growth of molecular
sequence databases & cpu power —
Year
BasePairs
Sequences
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
680338
606
2274029
2427
3368765
4175
5204420
5700
9615371
9978
15514776
14584
23800000
20579
34762585
28791
49179285
39533
71947426
55627
101008486 78608
157152442
143492
217102462
215273
384939485
555694
651972984
1021211
1160300687
1765847
2008761784
2837897
3841163011
4864570
11101066288
10106023
14396883064
13602262
doubling time ~
one year
What are sequence databases?
These databases are an organized way to store the tremendous
amount of sequence information accumulating worldwide. Most
have their own specific format.
North America: the National Center for Biotechnology Information (NCBI),
a division of the National Library of Medicine (NLM), at the National
Institute of Health (NIH), has GenBank & GenPept.
Europe: the European Molecular Biology Laboratory (EMBL), the
European Bioinformatics Institute (EBI), and the Swiss Institute of
Bioinformatics’ (SIB) Expert Protein Analysis System (ExPasy), all help
maintain the EMBL Nucleotide Sequence Database, and the SWISSPROT & TrEMBL amino acid sequence databases.
Asia: The National Institute of Genetics (NIG) supports the Center for
Information Biology’s (CIG) DNA Data Bank of Japan (DDBJ).
More organization stuff —
Nucleic acid sequence databases (and TrEMBL) are split into
subdivisions based on taxonomy (historical rankings — the Fungi
warning!). PIR is split into subdivisions based on level of
annotation. TrEMBL sequences are merged into SWISS-PROT
as they receive increased levels of annotation.
• Nucleic Acid DB’s
– GenBank/EMBL/DDBJ
• all Taxonomic categories
• “Tags”
– EST’s
– GSS’s
• Amino Acid DB’s
– SWISS-PROT
• TrEMBL
– PIR
• PIR1
• PIR2
• PIR3
• PIR4
• NRL_3D
– Genpept
• TrEMBL contains the translations of all coding
sequences (CDS) present in the EMBL Nucleotide
Sequence Database, which are not yet
integrated into SwissProt.
• PIR (Protein Information Resource) produces the
Protein Sequence Database (PSD) of
functionally annotated protein sequences, which
grew out of the Atlas of Protein Sequence and
Structure (1965-1978) edited by Margaret Dayhoff
TREMBL (proteina traducida del EMBL)
EMBL (DNA)
SwissProt (proteínas secuenciadas – curadas)
PIR
GeneBank
DDBJ
PROSITE
What about other types of biological databases?
• Three dimensional structure databases:
• the Protein Data Bank and Rutgers Nucleic Acid Database.
• These databases contain all of the 3D atomic coordinate data
necessary to define the tertiary shape of a particular biological
molecule. The data is usually experimentally derived, either
by X-ray crystallography or with NMR, but sometimes it is a
hypothetical model. In all cases the source of the structure and
its resolution is clearly indicated.
• Secondary structure boundaries, sequence data, and reference
information are often associated with the coordinate data, but it
is the 3D data that really matters, not the annotation.
Other types of Biological DB’s —
•
Still more; these can be considered ‘non-molecular’:
•
Genomic linkage mapping databases for most large genome projects (w/ pointers to
sequences) — H. sapiens, Mus, Drosophila, C. elegans, Saccharomyces,
Arabidopsis, E. coli, . . . .
•
Reference Databases (also w/ pointers to sequences): e.g.
• OMIM — Online Mendelian Inheritance in Man
• PubMed/MedLine — over 11 million citations from more than 4 thousand bio/medical
scientific journals.
•
•
Phylogenetic Tree Databases: e.g. the Tree of Life.
•
Metabolic Pathway Databases: e.g. WIT (What Is There) and Japan’s GenomeNet
KEGG (the Kyoto Encyclopedia of Genes and Genomes).
•
Population studies data — which strains, where, etc.
And then databases that many biocomputing people don’t even usually
consider:
•
e.g. GIS/GPS/remote sensing data, medical records, census counts, mortality and
birth rates . . . .
Large Databases
• Once upon a time, GenBank sent out
sequence updates on CD-ROM disks a few
times per year.
• Now GenBank is over 40 Gigabytes
(11 billion bases)
• Most biocomputing sites update their copy of
GenBank every day over the internet.
• Scientists access GenBank directly over the
Web
Finding Genes in GenBank
•These billions of G, A, T, and C letters would
be almost useless without descriptions of what
genes they contain, the organisms they come
from, etc.
•All of this information is contained in the
"annotation" part of each sequence record.
Entrez is a Tool for Finding Sequences
• GenBank is managed by the NCBI (National Center
for Biotechnology Information) which is a part of
the US National Library of Medicine.
• NCBI has created a Web-based tool called Entrez
for finding sequences in GenBank.
http://www.ncbi.nlm.nih.gov
• Each sequence in GenBank has a unique “accession
number”.
• Entrez can also search for keywords such as gene
names, protein names, and the names of orgainisms
or biological functions
Entrez is Internally Cross-linked
• DNA and protein sequences are linked to
other similar sequences
• Medline citations are linked to other
citations that contain similar keywords
• 3-D structures are linked to similar
structures
Databases contain more than just DNA &
protein sequences
Proyecto Genoma Humano
La secuencia del genoma está casi completa!
– aproximadamente 3.5 billones de pares de bases.
Raw Genome Data
Gene finding
Data Quality Issues
Bioinformatics Databases
•
•
•
•
Usually organised in flat files
Huge collection of Data
Include alpha-numeric and pictorial data
Latest databases have gene/protein expression data
(images)
Demand
• High quality curated data
• Interconnectivity between data sets
• Fast and accurate data retrieval tools
– queries using fussy logic
• Excellent Data mining tools
– For sequence and structural patters
Errors in DNA sequence and Data Annotation
• Current technology should reduce error rates to as
low as 1 base in 10000 as every base is sequenced
between 6-10 times and at least one reading per
strand.
• Therefore, in a procaryote, error of 1 isolated wrong
base would result to one amino acid error in ~10-15
proteins.
• In human genome gene-dense regions contain about
1 gene per 10000 bases, with average estimated at 1
gene per 30000bases.
• Therefore, corresponding error rate would be
roughly one amino acid substitution in 100 proteins.
• But large scale error in sequence assembly can also
occur. Missing a nucleotide can cause a frameshift
error.
DNA data …
• The DNA databases (EMBL/ GenBank/ DDBJ)
carry out quality checks on every sequence
submitted.
• No general quality control algorithm is yet in
widespread use.
• Some annotations are hypothetical because they
are inferences derived from the sequences.
– Ex. Identification of coding regions. These inferences
have error rates of their own.
DNA Sequencing (Cont’d)