Databases in bioinformatics

Download Report

Transcript Databases in bioinformatics

Introduction to Bioinformatics
databases: Nucleic Acid
Databases
Dinesh
Gupta
ICGEB
7/16/2015 7:46 PM
Biological databases: why?
• Need for storing and communicating
large datasets has grown
• Make biological data available to
scientists.
• To make biological data available in
computer-readable form.
7/16/2015 7:46 PM
Different classifications of
databases
• Type of data
– nucleotide sequences
– protein sequences
– proteins sequence patterns or motifs
– macromolecular 3D structure
– gene expression data
– metabolic pathways
7/16/2015 7:46 PM
Different classifications of databases….
• Primary or derived databases
– Primary databases: experimental results
directly into database
– Secondary databases: results of analysis of
primary databases
– Aggregate of many databases
• Links to other data items
• Combination of data
• Consolidation of data
7/16/2015 7:46 PM
Different classifications of databases….
• Technical design
– Flat-files
– Relational database (SQL)
– Exchange/publication technologies (FTP,
HTML, CORBA, XML,...)
7/16/2015 7:46 PM
Different classifications of databases….
• Availability
– Publicly available, no restrictions
– Available, but with copyright
– Accessible, but not downloadable
– Academic, but not freely available
– Proprietary, commercial; possibly free for
academics
7/16/2015 7:46 PM
Where do I get DB of my interest ?
7/16/2015 7:46 PM
7/16/2015 7:46 PM
http://www3.oup.co.uk/nar/database/c/
7/16/2015 7:46 PM
Nucleotide sequence databases
• EMBL, GenBank, and DDBJ are the three
primary nucleotide sequence
databases
• EMBL www.ebi.ac.uk/embl/
• GenBank
www.ncbi.nlm.nih.gov/Genbank/
• DDBJ www.ddbj.nig.ac.jp
7/16/2015 7:46 PM
Genbank
• An annotated collection of all publicly
available nucleotide and proteins
• Set up in 1979 at the LANL (Los Alamos).
• Maintained since 1992 NCBI (Bethesda).
• http://www.ncbi.nlm.nih.gov
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
EMBL Nucleotide Sequence
Database
• An annotated collection of all publicly available
nucleotide and protein sequences
• Created in 1980 at the European Molecular
Biology Laboratory in Heidelberg.
• Maintained since 1994 by EBI- Cambridge.
• http://www.ebi.ac.uk/embl.html
7/16/2015 7:46 PM
7/16/2015 7:46 PM
http://www3.ebi.ac.uk/Services/DBStats/
7/16/2015 7:46 PM
DDBJ–DNA Data Bank of Japan
• An annotated collection of all publicly available
nucleotide and protein sequences
• Started, 1984 at the National Institute of
Genetics (NIG) in Mishima.
• Still maintained in this institute a team led by
Takashi Gojobori.
• http://www.ddbj.nig.ac.jp
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
Other NCBI nucleic acids DBs
•
EST database: A collection of expressed sequence tags, or short, single-pass sequence
reads from mRNA (cDNA).
•
GSS database: A database of genome survey sequences, or short, single-pass genomic
sequences.
•
HomoloGene: A gene homology tool that compares nucleotide sequences between pairs of
organisms in order to identify putative orthologs.
•
HTG database: A collection of high-throughput genome sequences from large-scale
genome sequencing centers, including unfinished and finished sequences.
•
SNPs database: A central repository for both single-base nucleotide substitutions and
short deletion and insertion polymorphisms.
•
RefSeq: A database of non-redundant reference sequences standards, including genomic
DNA contigs, mRNAs, and proteins for known genes. Multiple collaborations, both within
NCBI and with external groups, supports data-gathering efforts.
•
STS database: A database of sequence tagged sites, or short sequences that are
operationally unique in the genome.
•
UniSTS: A unified, non-redundant view of sequence tagged sites (STSs).
•
UniGene: A collection of ESTs and full-length mRNA sequences organized into clusters,
each representing a unique known or putative human gene annotated with mapping and
expression information and cross-references to other sources.
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
Sequence submission
• Data mainly direct submissions from the
authors.
• Submissions through the Internet:
– Web forms.
– Email.
• Sequences shared/exchanged between
the 3 centers on a daily basis:
– The sequence content of the banks is
identical.
7/16/2015 7:46 PM
Derived databases
• CUTG Codon usage tabulated from GenBank
http://www.kazusa.or.jp/codon/
• Genetic Codes Deviations from the standard genetic code in various
organisms and organelles
http://www.ncbi.nlm.nih.gov/Taxonomy/Utils/wprintgc.cgi?mode=c
• TIGR Gene Indices Organism-specific databases of EST and gene
sequences http://www.tigr.org/tdb/tgi.shtml
• UniGene Unified clusters of ESTs and full-length mRNA sequences
http://www.ncbi.nlm.nih.gov/UniGene/
• ASAP Alternative spliced isoforms
http://www.bioinformatics.ucla.edu/ASAP
• Intronerator Introns and alternative splicing in C.elegans and
C.briggsae http://www.cse.ucsc.edu/~kent/intronerator/
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
Nucleic acid structure
databases
• NDB Nucleic acid-containing structures
http://ndbserver.rutgers.edu/
• NTDB Thermodynamic data for nucleic acids
http://ntdb.chem.cuhk.edu.hk/
• RNABase RNA-containing structures from PDB and
NDB http://www.rnabase.org/
• SCOR Structural classification of RNA: RNA motifs by
structure, function and tertiary interactions
• http://scor.lbl.gov/
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
7/16/2015 7:46 PM
Database searching tips
• Look for links to Help or Examples
• Try Boolean searches
• Be careful with UK/US spelling differences
– leukaemia vs leukemia
– haemoglobin vs hemoglobin
– colour vs color
7/16/2015 7:46 PM
Exercises
• Study the statistics of the three primary nucleic acid
databases: Are they matching ?
• Look for a gene of your interest in the three primary
nucleic acid databases: compare the information given in
each one of them.
• Read NAR DB paper and NAR DB index site: search for
different nucleic acid databases based on different
search terms.
• Self study:
– http://www3.oup.co.uk/nar/database/c/
– Download NAR database paper (NARDB2004) from:
ftp://cbag.sc.mahidol.ac.th/pub/Course_Materials/dinesh
7/16/2015 7:46 PM