Transcript document
Review of Biological Database Utilization
1
Biological Databases
We will discuss:
• Usefulness to the bioinformaticist
• Database types
• Search methods and tools
http://www.sequenceanalysis.com/
2
Importance of the Public Databases
• The data provide the basis for sequencebased biology
– Open access is key
• Supported by Human Genome Project, International
Nucleotide Sequence Database Collaboration and others
• The amount of biological data is enormous
– Biologists are dependent on computers for storing,
organizing, searching, manipulating, and retrieving
the data/information
•
3
Why Search Biological Databases?
• Generate new sequence
– Is it already in bank?
– Homologous sequences?
• Find out about the gene
– Annotation
– Literature
4
Why Search Biological Databases?
• Similar non-coding sequences
– Repetitive elements
– Regulatory regions
• Homologous proteins;families
• Identify and verify PCR priming sites
5
Biological Databases
Types of Databases
• Generalized databases (DNA, proteins and
carbohydrates, 3D-structures)
• Specialized databases (EST, STS, SNP, RNA,
genomes, protein families, pathways, microarray
data ...)
6
Generalized Databases
• 2 Main Classes
– DNA (nucleotide) The large databases are:
• GenBank at NCBI (US),
• EMBL at EBI (Europe - UK),
• DDBJ (Japan).
– Protein
–
SWISS-PROT/TrEMBL (high level of annotation),
(protein identification resource).
PIR
7
Specialized Databases
•
•
•
•
ESTs (Expressed Sequence Tags)
STSs (Sequence-Tagged Sites)
SNPs (Single Nucleotide Polymorphisms)
Organismal Genomic databases: Human
(GDB), mouse (MGB), yeast (SGB), fly
• HTGS (High Throughput Genomic
Sequences
• RNA
– tRNAs, rRNAs, small RNA’s & others
8
Specialized Databases
• Protein families
– PROSITE, PRINTS, BLOCKS
• Pathways: metabolic, regulatory etc.
– EMP , PathDB, KEGG
• Microarray data: expression data
– 4 major: GeneX, ArrayExpress,
– Stanford, Gene Expression Omnibus (GEO)
To find specialized databases:
http://www.agr.kuleuven.ac.be/vakken/i287/bioinformatica.htm#
9
Types of Database
• Primary: archival
– experimental data with some annotation
(interpretation)
• Secondary: curated
10
What is annotation?
• Extraction, definition and interpretation of
features on the genome sequence
• Derived by integrating computational tools
and biological knowledge
– for example, known and predicted genes
• Some databases are referred to as
“annotated databases”
– means that they contain sequence, comments,
literature references, notes on experiments…
11
Curated Databases
• Records are added only after they have been
through a curation process
– checked for accuracy, additional information
(annotation)
– scientific judgments are made as data are cleaned
up and merged
• Examples of curated databases:
– SWISS-PROT, OMIM, RefSeq, LocusLink
12
Swissprot
http://www.expasy.ch/sprot/
Swissprot
• SWISS-PROT is a curated protein sequence database which strives
to provide a high level of annotations (such as the description of the
function of a protein, its domains structure, post-translational
modifications, variants, etc.), a minimal level of redundancy and high
level of integration with other databases.
13
Organismal Databases
These databases often serve a specific research community
•
•
•
•
•
Human
Mouse
Drosophilia
C. elegans
Yeast
•
•
•
•
•
Livestock
Arapidopsis
Maize
Plasmodium
Other
http://tolweb.org/tree/home.pages/linksdb.html#organismal
14
Multi-Organism Resources
www.ncbi.nlm.nih.gov
www.tigr.org
www.expasy.org
15
Biological Databases
Types of Database Search
•
Text-based database search (SRS, Entrez)
•
Sequence-based database search (sequence
similarity search) (BLAST, FASTA...)
•
Motif-based database search (ScanProsite,
eMOTIF)
•
Structure-based database search (structure
similarity search) (VAST, DALI...)
16
Database Search Tools
Text-based :querying the annotation
• SRS6 at http://srs6.ebi.ac.uk/srs6bin/cgibin/wgetz?-page+top
• ENTREZ at http://www.ncbi.nlm.nih.gov/Entrez/
• DBGET/LinkDB at http://www.genome.ad.jp/dbgetbin/www_bfind?linkdb
•
17
Sequence-based Searches
Considerations:
• Should I compare DNA or protein sequences?
• More random matches with DNA
http://www.people.virginia.edu/~rjh9u/codetabl.ht
ml
• Protein “matches” may be more relevant
• DNA databases are larger
18
Sequence-based Searches
• Sensitivity vs. Selectivity
• Sensitivity: the ability to find true positive
matches but still have false positives
• Selectivity: the ability to reject false positives
• Trade-off when choosing algorithm
19
Database Search Tools
Sequence-Based
• FASTA (FASTA at EBI, UK)
• BLAST (Basic local alignment search tool at NCBI,
USA)
• MPsrch (Smith-Waterman algorithm-based search at
EBI, UK)
20
More Sequence-based Tools
• BLAST Microbial Genomes at
http://www.ncbi.nlm.nih.gov/Microb_blast/unfi
nishedgenome.html
(Search finished and unfinished genomic sequences
at NCBI)
• Genome and proteome FASTA (at EBI, UK)
at http://www2.ebi.ac.uk/fasta3/genomes.html
21
More Sequence-based Tools
•
Protein search in genomes at
http://searchlauncher.bcm.tmc.edu/seqsearch/protein-search-genomes.html
(BLAST and FASTA Species-specific protein sequence
searches at Baylor College of Medicine, USA)
•
SectionSearch (FASTA or TFASTA search against
predefined sections of sequence databanks at IUBIO Indiana,
USA)
•
NRL-3D at
http://pir.georgetown.edu/pirwww/dbinfo/nrl3d.html
(Sequence-structure data base search at John Hopkins
University, USA)
22
Tools to Search Special Databases for
Sequences with Similar Motifs or Patterns
ProfileScan
• uses pfscan to find similarities between a
query sequence and profile library
• PROSITE is one such database
• an Expasy database
(ExpertProteinAnalysisSYstem,
http://www.expasy.ch/)
• similarities are based on fingerprints or
common patterns
23
BLOCKS Database
• a block is a motif or region of similar structure
• no gaps are introduced
• a block refers to the alignment, not the individual
sequences
• BLOCKS database is derived from PROSITE
• searches can be done at Fred Hutchinson Cancer
Center in Seattle
24
3 Major Portals into the Genome Data
• UCSC Genome Browser at Univ. of California
Santa Cruz
• Ensembl at European Bioinformatics Inst
(EBI)
– http://www.ensembl.org
• Entrez at NCBI
– http://www.ncbi.nlm.nih.gov/Entrez/
25
Entrez Databases
• PubMed: The biomedical literature
– PUBMED database contains Medline abstracts as well as links to
full text articles on sites maintained by journal publishers
• Nucleotide sequence database (Genbank)
• Protein sequence database
• Structure: three-dimensional macromolecular
structures
• Genome: complete genome assemblies
• PopSet: population study data sets
26
Entrez Databases
•
•
•
•
•
OMIM: Online Mendelian Inheritance in Man
Taxonomy: organisms in GenBank
Books: online books
ProbeSet: Gene Expression Omnibus (GEO)
3D Domains: domains from Entrez Structure
27
Entrez sequence searching
• can find sequences for a given gene or
protein
• can download copy of sequence
28
NCBI BLAST
NCBI offers
several “flavors”
of BLAST
29
NCBI BLAST
NCBI offers
several “flavors”
of BLAST
30
The Take Home Lessons
•
•
Search often, search with multiple parameters
Use specialized DBs where possible, use protein
sequence if appropriate
•
•
•
There are many tools available.
You must know what tools are relevant.
You must know how to use available tools.
•
•
Look for sites that have multiple resources
Google is your best friend.
31