Transcript Databases
Databases
WHY DO I HAVE TO LISTEN ABOUT THIS?!
DataBase – what the heck is that?
A database is a collection of information that is organized so
that it can easily be accessed, managed, and updated.
Various types – from simple to complex ones
Flat-file, relational
Records retrieved using a query language
Are you using one??
Phone directory
Archive of bills
Birth registers
Problems with data – why you need a db?
Nowadays obtaining data is no problem
Having data is no reason to have database
Problems with data that require DB:
Size
Ease of updating
Accuracy
Security
Redundancy
Importance
DBs - dissection
Information system
Query system
Storage System
Data
GenBank flat file
PDB file
Interaction Record
Title of a book
Book
DBs - dissection
Oracle
Information system
Query system
Storage System
Data
MySQL
PostgreSQL
PC binary files
Unix text files
Bookshelves
DBs - dissection
Information system
Query system
Storage System
Data
A List you look at
A catalogue
indexed files
SQL
grep
DBs - dissection
Information system
Query system
Storage System
Data
Google
Entrez
SRS
DBget
DBs – what are they made of?
Tables (entities)
• basic elements of information to track, e.g., gene, organism,
sequence, citation...
Columns (fields)
• attributes of tables, e.g. for citation table, title, journal, volume,
author...
Rows (records)
• actual data
• whereas fields describe what data is stored, the rows of a table
are where the actual data is stored
Flat-File DBs
All of the data is stored in one large table
Txt file, excel…
Relational DBs
contains multiple tables and defines the relationships between them
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Relational DBs
Relationships can be built between tables and fields
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Relational DBs – even more technical...
Get the info using Structured Query Language (SQL):
SELECT customer_table.name, customer_table.address
FROM customer_table, invoice
WHERE invoice.product = “Acme Snow Machine”
AND invoice.customer = customer_table.name
Result:
Wiley, Southwest desert
invoice_id customer
1 Elmer
2 Wiley
3 Elmer
4 Bugs
product
price
quantity total
buckshot
$2.00
2
$4.00
Acme snow machine
$5.00
1
$5.00
shotgun
$25.00
1
$25.00
carrots
$0.50
20
$10.00
customer_table
name
address
Elmer
Looney Tunes Dr.
Wiley
Southwest desert
Bugs
Rabbit Hole
product_table
product
carrots
shotgun
buckshot
Acme snow machine
price
$
$
$
$
notes
likes hunting and opera
big mail order customer
likes to cross dress
notes
0.50
25.00 oddly flexible
2.00
5.00 high defect rate
Biological DBs
A lot of them..
• Vary in size, quality, coverage, level of interest
• Is it any good?
•
•
•
•
•
•
comprehensiveness
accuracy
is up-to-date
good interface
batch search/download
API (web services, DAS, etc.)
DBs by data types
Sequence databases
Sequence analysis
Functional genomics
Literature databases
Structural databases
Metabolic pathway databases
Specialised databases
Confused??
http://www.oxfordjournals.org/nar/database/a/
http://www.expasy.org/links.html
DBs by scope
Comprehensive
Contain data from many organisms and many different types of
sequences
Nucleotide
GenBank (National Center for Biotechnology Information)
EMBL (European Molecular Biology Laboratory)
DDBJ (DNA Data Bank of Japan)
GenBank, EMBL & DDBJ comprise the International Nucleotide
Sequence Database Collaboration
Protein, such as Swiss-Prot
Protein Structure, such as PDB: Protein Data Bank
Genomes and Maps, such as Entrez Genomes
DBs by scope
Specialized
– Contain data from individual organisms, specific
categories/functions of sequences, or data generated by
specific sequencing technologies.
– Example: Flybase, Wormbase, etc.
DBs by level of curation
Primary databases – Archival data
Repository of information
Redundant; might have many sequence records for the same
gene, each from a different lab
Submitters maintain editorial control over their records: what
goes in is what comes out
No controlled vocabulary
Variation in annotation of biological features
GenBank/EMBL/DDBJ
UniProt
PDB
Medline (PubMed)
DBs by level of curation
Secondary (derivative) databases – Curated
data
Non-redundant; one record for each gene, or each splice
variant
Each record is intended to present an encapsulation of the
current understanding of a gene or protein, similar to a review
article
Records contain value-added information that have been
added by an expert(s)
RefSeq
Taxon
UniProt
OMIM
Literature DBs
PubMed www.ncbi.nlm.nih.gov/pubmed
Focuses on biomedicine
Integrated with other NCBI DBs and services
Uses NCBI search sytax (PubMed help)
Google Scholar scholar.google.com
Standard Google syntax
Subject areas
Free pdfs
To do:
Stein, L.D. 2003. Integrating biological databases. Nat Rev Genet 4: 337-345.
DBs - how much is in there?
Growth of GenBank and WGS
GenBank
www.ncbi.nlm.nih.gov/Genbank/
Genbank
database of nucleotide sequences from >160,000 organisms
started in 1981 (263 entries; 436,710 residues)
Release 175 - 12/09 (112,910,950 entries; 110,118,557,163 base pairs)
Release 189 - 04/12 (151 824 421 entries; 139 266 481 398 base pairs)
Release 201 – 04/14 (171 744 486 entries; 159 813 411 760 base pairs)
Release 207 – 04/15 (182 188 746 entries; 189 739 230 107 base pairs)
divided into 18 divisions
Organism specific (primate , rodent, invertebrate, bacterial, viral… 11 divisions)
Technology specific
EST - EST sequences (expressed sequence tags)
PAT - patent sequences
STS - STS sequences (sequence tagged sites)
GSS - GSS sequences (genome survey sequences)
HTG - HTG sequences (high-throughput genomic sequences)
HTC - unfinished high-throughput cDNA sequencing
ENV - environmental sampling sequences
GenBank file
GenBank file - header
GenBank file - features
GenBank file - sequence
//
GenBank - interface
GenBank - interface
GenBank - interface
GenBank - interface
GenBank - interface
NCBI/EBI/GenomeNet Formats
NCBI DBs
GenBank: The Nucleotide Sequence Database
PubMed: The Bibliographic Database
Macromolecular Structure Databases
The Taxonomy Project
The Single Nucleotide Polymorphism Database (dbSNP) of Nucleotide
Sequence Variation
The Gene Expression Omnibus (GEO): A Gene Expression and Hybridization
Repository
Online Mendelian Inheritance in Man (OMIM): A Directory of Human Genes and
Genetic Disorders
The NCBI BookShelf: Searchable Biomedical Books
PubMed Central (PMC): An Archive for Literature from Life Sciences Journals
The SKY/CGH Database for Spectral Karyotyping and Comparative Genomic
Hybridization Data
The Major Histocompatibility Complex Database, dbMHC
NCBI - Entrez
http://www.ncbi.nlm.nih.gov/gquery/
General Protein DBs
UniProt (http://www.uniprot.org)
SWISS-PROT
GenPept/TrEMBL
Manually curated
high-quality annotations, less data
Translated coding sequences from GenBank/EMBL
Few annotations, more up to date
PIR
}
UniProt
(2002)
Phylogenetic-based annotations
European
Bioinformatics
Institute (EBI)
Swiss Institute of
Bioinformatics
(SIB)
Protein Information
Resource (PIR)
Other protein DBs
Structural DBs (PDB)
PDB (Protein Databank)
MMDB (Molecular Modeling database)
Protein domains DB (Pfam)
Pfam
SMART (a Simple Modular Architecture Research Tool)
CDD (Conserved Domain Database)
Protein motif DBs
Scan Prosite
PRINTS
Other DBs
Ribosomal RNA DBs
RDP (Michigan State University, USA)
rRNA database (University of Antwerp, Belgium)
Silva
Genome DBs
Colibase (E. coli and related species)
Flybase (Drosophila)
Dictybase (Dictyostelium discoideum)
Metabolic pathways DBs…
Nutrigenomics related DBs
Gene oriented
Gene expression:
GEO - Gene Expression Omnibus (NCBI)
Array Express (EBI)
CGED (Cancer Gene Expression Database)
Variation databases:
dbSNP (NCBI)
Hapmap http://hapmap.ncbi.nlm.nih.gov/abouthapmap.html
HGVbase (Human Genome Variation)
OMIM
Online Mendelian Inheritance in Man
database that catalogues all the known diseases with
a genetic component
relationship between phenotype and genotype
~ 20 000 entries
Clinical and Mutation Databases
HGMD
Human Gene Mutation Database
•
•
Database of sequences and phenotypes of disease-causing
mutations
http://www.hgmd.cf.ac.uk/ac/index.php
General Disease DBs
http://swissvar.expasy.ch
KEGG Disease http://www.genome.jp/kegg/disease/
Swisswar
Disease-specific mutation databases
Nutrigenomics related DBs
Nutrigenomics database
microarray data related to nutrition
http://foodfunction.dc.affrc.go.jp/en/
NuGO
http://www.nugo.org
dbNP – Nutritional Phenotype database
Biological information in db:
genetics
transcriptomics
proteomics
biomarkers
metabolomics
functional assays
food intake and food
composition
Nutrition db – myplate.gov
Nutrition db – myplate.gov
Nutrition db - USDA
http://ndb.nal.usda.gov/
Nutrition db - USDA
Nutrition db - USDA
Nutrition databases
Nutrition databases
http://nutritiondata.self.com/
Literature DBs
ISI Web of knowledge
portal.isiknowledge.com
WOS
WOS
ISI Web of knowledge
ISI – Citation report
ISI 2 do
Each group takes one department and check
publications of full professors (www.pbf.hr)
Count all publications and citings for your
department
What is the most cited publication for your
department
What is the highest h-factor in your department
Normalize the data...
PubMed Overview
PubMed is a Web-based retrieval system developed by
the National Center for Biotechnology Information
(NCBI) at the National Library of Medicine (NLM)
NLM has been indexing the biomedical literature since
1879
PubMed is a database of bibliographic information
drawn primarily from the life sciences literature
PubMed contains links to full-text articles at
participating publishers' Web sites as well as links to
other third party sites
PubMed provides access and links to the integrated
molecular biology and chemistry databases maintained
by NCBI
What’s in PubMed?
Over 23 million records representing articles in the
biomedical literature
Most PubMed records are MEDLINE citations
MEDLINE®, the National Library of Medicine’s
premier bibliographic database containing citations
and author abstracts from more than 5,500
biomedical journals
The scope of MEDLINE includes diverse topics such
as microbiology, delivery of health care, nutrition,
pharmacology and environmental health
PubMed - author search
Full names are
not available for
all authors – it
is smarter to use
only initials
PubMed – author search results
Search results options
Article view
Subject search (simple)
To search by subject be specific as possible
Do not use punctuation, tags or operators
Search for articles on the use of aspirin for heart
attack prevention. Which query to use?
a)
“aspirin for heart attack prevention”
b)
aspirin heart attack prevention
aspirin AND heart AND attack AND prevention
c)
Advanced Pubmed search using MeSH
MeSH (Medical Subject Headings) is the NLM controlled
vocabulary which gives uniformity and consistency to the
indexing and cataloging of biomedical literature
Similar to keywords on other systems
Arranged in s hierarchical manner
Even more about MeSH
MeSH Vocabulary includes four types of terms:
Headings —represent concepts found in the biomedical
literature
Body Weight
Kidney
Radioactive Waste
Subheadings — attached to MeSH headings to describe a
specific aspect of a concept
Therapy
Diagnosis
Metabolism
Supplementary Concept Record
Publication Types
PubMed Search using MeSH – graphic example
Results
MeSH example
We will be looking for papers dealing with medication of
adults with nutrition disorders
1. go to PubMed advanced search
2. In builder change All Fields to MeSH terms and write
nutrition disorders (choose from dropdown menu)
3. In the next field write “adults” and click on Show index
list – select “adults”
4. Change All fileds to MeSH Subheadings and from index
list select “drug therapy”
5. Click on Search button
Tasks
Search for papers looking at vitamin B
supplementation and its effects on Alzheimer’s
disease
Find all reviews published from 2010 dealing with
drug therapies used for Alzheimer’s disease. Export
all abstracts to a file.
Need the full text article?
If not looking for specific article filter your results
using “Free full text” option
Try searching PubMed Central (PMC) - a free
archive of biomedical and life sciences journal
literature
Find paper of interest in pubmed and search Google
Scholar to see if free pdfs are available
Using MeSH
Go to MeSH homepage - http://www.ncbi.nlm.nih.gov/mesh
Search MeSH term for chewing
How is it called?
What subheadings does it have?
In how many papers chewing is a major topic?
MeSH – combining queries
Search for terms obesity and outbreak
Merge them into one query
MeSH – using subheadings
Search for papers dealing with genetics of obesity
Tasks
Find is there genetic basis for the vitamin C
deficiency in humans?
Find all nutrition disorders indexed in MeSH. To
which group of diseases they belong?
Find all reviews dealing with prevention and control
of nutrition disorders in children.
OMIM
Online Mendelian Inheritance in Man
OMIM is a comprehensive, authoritative, and timely
compendium of human genes and genetic
phenotypes
OMIM contain information on all known mendelian
disorders and over 12,000 genes
OMIM focuses on the relationship between
phenotype and genotype
OMIM
Obesity http://omim.org/entry/601665
Phenylkenonuria http://omim.org/entry/261600
Description
Clinical features
Biochemical features
Inheritance
Clinical management
Population genetics
Animal models