Genome Databases and Open Access Resources

Download Report

Transcript Genome Databases and Open Access Resources

Genomes Databases and Open
Access Bibliographic
Resources
Antonio Basílio de Miranda
Laboratório de Genômica Funcional e Bioinformática
Instituto Oswaldo Cruz
Fundação Oswaldo Cruz
Rio de Janeiro - Brazil
• Outline
• General introduction and overview of complete
genome sequences
• Genomes databases and where to find them
• Comparative Genomics Databases
• Other Omics resources
• Bibliographic/Open access resources
 Why use databases?
 In the genomic era we have billions of data
that need to be stored, curated and made
accessible for analysis and knowledge discovery.
 Databases are essential resources for both
experimental and computational biologists.
 We have crossed the Terabyte threshold of
genomic data.
And what is a database system?
From Oxford Dictionary:
 Database: an organized body of related
information.
 Database system, DataBase Management
System (DBMS): a software system that facilitates
the creation, maintenance and use of an
electronic database.
Common database models:
Hierarchical
Network
Relational
Object-relational
Object
Other models:
Associative
Concept-oriented
Entity-Attribute-Value
Multi-dimensional
Semantic data model
Semi-structured
Star schema
XML database
What is stored:
Nucleotide sequences
Protein sequences
Genomes
Patterns
Structures
Etc.
Some problems:
Different data formats and technologies
Different types of data
Size
Redundancy
“Hereditary” mistakes
Inconsistent annotations
Different formats – C. trachomatis pyruvate kinase
Completely sequenced genomes – a timeline
1977 first viral genome (5386 base pairs; encoding 11
genes). Sanger et al. sequence bacteriophage fX174.
1981 Human mitochondrial genome. 16,500 base pairs
(encodes 13 proteins, 2 rRNA, 22 tRNA).
1986 Chloroplast genome. 156,000 base pairs (most are
120 kb to 200 kb).
1995 first genome of a free-living organism, the bacterium
Haemophilus influenzae, by TIGR, 1830 Kb, 1713 genes.
1996 first genome of an archaeal genome: Methanococcus
jannaschii DSM 2661, by TIGR, 1664 Kb, 1773 genes.
1997 first eukaryotic genome : Saccharomyces cerevisiae
S288C; International collaboration; 16 Chromosomes; 12,057
Kb, ~6000 genes.
1998 first multicellular organism Nematode Caenorhabditis
elegans; 97 Mb; ~19,000 genes.
1999: first human chromosome: Chromosome
22 (49 Mb, 673 genes).
2000 Fruitfly Drosophila melanogaster (137 Mb; ~13,000
genes).
2000 first plant genome: Arabidopsis thaliana (115,428
Mb; 22670 genes
2001 draft sequence of the human genome (3300 Mb;
~28000 genes)
2002 Plasmodium falciparum (22,9 Mb; 5334 genes)
2002 mouse genome (2700 Mb; ~28000 genes)
2004 Fish draft Tetraodon nigroviridis genome (x Mb;
~28000 genes);
2005 Dog (41Mb, 33651 genes) and chicken genomes
( 18031 genes)
2007 James Watson’s genome is sequenced.
2007 Craig Venter publishes the results of his own
sequenced genome.
October 2013 Deadline for the X Prize Foundation
challenge to sequence 100 human genomes for less than
$10,000 each.
www.genomesonline.org
3825 projects
• 827 published (06-29-08)
• 1842 bacteria
• 90 archaea
• 936 eukaryotes
• 130 metagenomes
Genome sequencing projects
There are several web-based resources that
document the progress of completely
sequenced genomes and their reference
publications, including:
GOLD - Genomes Online Database
http://www.genomesonline.org/gold.cgi
How big are genome
sizes?
Viral genomes: 1 kb to 360 kb (Canarypox virus)
Note: Mimivirus: 1.2 Mb (http://www.giantvirus.org/top.html )
(Top 100 largest viral genome sequences)
Bacterial genomes: 0.5 Mb to 13 Mb;
Eukaryotic genomes: 8 Mb to 670 Gb;
Database of Genome sizes:
http://www.cbs.dtu.dk/databases/DOGS/index.php
3500
3000
2500
2000
Size
1500
1000
500
0
E.coli
Yeast
Worm
Fly
Fugu
Human
Genome size and database increase
BIOLOGICAL DATABASE CATEGORIES
• Databases of nucleic acid sequences (RNA, DNA)
• Databases of protein sequences
• Databases of protein motifs and protein domains
• Databases of structures
• Databases of genomes
• Databases of genes
• Databases of expression profiles
• Databases of SNPs and mutations
• Databases of metabolic pathways and protein associations
• Databases of taxonomy
•…
Can we find a list of ‘clean’
databases ?
• The NAR database issue
• The 2008 update includes 1078 databases, 110 more
than the previous one.
• 98 new databases
• updates of 84 existing databases
• 25 obsolete databases removed!
• The complete database list and summaries are available
online on the Nucleic Acids Research web site
http://nar.oxfordjournals.org/
• NAR database category list
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Nucleotide Sequence Databases
RNA sequence databases
Protein sequence databases
Structure Databases
Genomics Databases (non-vertebrate)
Metabolic and Signalling Pathways
Human and other Vertebrate Genomes
Human Genes and Diseases
Microarray Data and other Gene Expression Databases
Proteomics Resources
Other Molecular Biology Databases
Organelle databases
Plant databases
Immunological databases
• Genomics Databases (non-vertebrate)
– MGD - Mouse Genome Database
– TIGR Gene Indices
– Genome annotation terms, ontologies and
nomenclature
– Taxonomy and identification
– General genomics databases
– Viral genome databases
– Prokaryotic genome databases
– Unicellular eukaryotes genome databases
– Fungal genome databases
– Invertebrate genome databases
• Three types of genome databases:
• Databases which collect data of all sequenced genomes
(Entrez_Genomes; EBI_genomes)
• Databases which collect data of a category of organisms
with sequenced genomes (Microbial Genomes at TIGR)
• Databases specific for one organism with sequenced
genomes (Flybase, MGD, Ensembl)
• What kind of information you find there?
• Genome databases contain genomic information
collected from many sources.
– Genome assembly
– Gene predictions
– Known genes, mRNA, ESTs, proteins
– Genetic maps, markers and polymorphisms
– Gene expression and phenotypes
– Annotations
– Interspecies homologues
Resources for genomes
There are two main resources for genomes:
EBI
European Bioinformatics Institute
http://www.ebi.ac.uk/genomes/
NCBI
National Center for Biotechnology Information
http://www.ncbi.nlm.nih.gov/Genomes/
But many others resources from sequencing Institutions:
Sanger
The welcome Trust Sanger Institute
http://www.sanger.ac.uk/
TIGR
The Institute for Genomic Research
http://cmr.tigr.org/tigr-scripts/CMR/shared/Genomes.cgi
Genolevures
http://cbi.labri.fr/Genolevures/index.php
Databases by phylogenetic groups
Eucaryotic genomes:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi
Bacteria, fungi genomes:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=11:Fungi&taxgroup
=11:Fungi|12:
Insects:
http://www.ncbi.nlm.nih.gov/genomes/leuks.cgi?p3=12:Insects&taxgrou
p=11:|12:Insects
Plant genomes:
http://www.ncbi.nlm.nih.gov/genomes/PLANTS/PlantList.html
...
The Entrez
System
OMIM
PubMed
PubMed Central
3D Domains
Journals
Structure
Books
CDD/CDART
Entrez
Protein
Taxonomy
Genome
GEO/GDS
UniSTS
UniGene
Nucleotide
SNP
PopSet
WGS
Other
GenBank
RefSeq
Contig
BAC
RefSeq
Transcript
Mouse assembly
UniGene
Transcript
Maps and
Options
Some common features of genomic databases:
 Possibility to download all the sequences of the
genome or part of them (chromosomes, clones,
genes, CDS,..)
 Most of them have a corresponding protein
resource (the set of proteins obtained by
translating all CDS – conceptual translation)
 Example: Entrez-Genome of the NCBI Genpept
Comparative genomics
Analyses of the genetic material of different species help in the
understanding of the similarities and differences between
genomes, their evolution and the evolution of their genes.
• Intra-genomic comparisons help understanding the degree of
duplication (genome regions; genes) and genes organization,...
• Inter-genomic comparisons help understanding the degree of
similarity between genomes; degree of conservation between genes;
• Understanding gene and genome evolution
Internet resources for whole-genome comparative analysis and associated tools
UCSC Genome4 Bioinformatics
Ensembl
MapViewer
VISTA Genome Browser
K-BROWSER
Comparative Regulatory Genomics
GALA
EnsMart
ETOPE
PipMaker and MultiPipMaker
VISTA server
MAVID server
zPicture server
rVISTA server
COGs
http://genome.ucsc.edu/
http://www.ensembl.org/
http://www.ncbi.nlm.nih.gov/mapview/
http://pipeline.lbl.gov/
http://hanuman.math.berkeley.edu/cgi-bin/kbrowser2
http://corg.molgen.mpg.de/
http://www.bx.psu.edu/
http://www.ensembl.org/EnsMart/
http://www.bx.psu.edu/
http://www.bx.psu.edu/
http://www-gsd.lbl.gov/vista/
http://baboon.math.berkeley.edu/mavid/
http://zpicture.dcode.org/
http://rvista.dcode.org/
http://www.ncbi.nlm.nih.gov/COG/
NCBI
Homo sapiens Genome:
Statistics -- Build 36 version 2
Protein coding genes: 21,541
General considerations:
 Organism specific databases can be more
up-to-date than general databases.
 Genome databases are not a one stop
shop for all information, other databases like
UniProt are still needed!
Bibliographic Databases
and Open Access resources
• Pubmed - http://www.pubmed.org/
• An access to more than 12 millions papers
since 1950 (3790 jounals).
• Simple and advanced literature Search
with keywords, author name, MESH terms,
journals, single citation,..
• Some papers are free from the journal
website or through the editors.
• Free access journals
• Authors pay to allow readers to get the
papers free
• The BMC initiative
• The Plos initiative
• Other initiatives: some journals are giving
immediate free online access and others
after a few (1-12) months from publication
• The HINARI initiative
• The Health InterNetwork Access to Research Initiative
(HINARI) provides free or very low cost online access to the
major journals in biomedical and related social sciences to
local, not-for-profit institutions in developing countries.
• HINARI was launched in January 2002, with some 1500
journals from 6 major publishers. 22 additional publishers
joined in May 2002, bringing the total number of journals to
over 2000.
• Today more than 70 publishers are offering their content in
HINARI and others will soon be joining the programme.
http://citeseer.ist.psu.edu/
Laboratório de Genômica Funcional e Bioinformática
Instituto Oswaldo Cruz
Wim Maurits Degrave – Pesquisador Titular
Antonio Basílio de Miranda – Pesquisador Associado
Nicolas Carels – Pesquisador Visitante
Fábio Faria da Mota – Pesquisador Visitante
Thomas Dan Otto – Pesquisador Visitante
Marcos Catanho – Aluno de Doutorado (BCM – IOC)
Ana Carolina Guimarães – Aluno de Doutorado (BCM – IOC)
Flávio Engelke – Aluno de Mestrado (PCM - UERJ)
Monete Rajão – Aluna de Mestrado (BCS – IOC)
Erica Ramos Cardoso - Bolsista PIBITI