2.1.databases_intro - T
Download
Report
Transcript 2.1.databases_intro - T
Finding What you
Need in Biological
Databases
Cédric Notredame
Cédric Notredame (02/04/2016)
Databases:
Where is my Needle ?
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Our Scope
Give you means to answer simple questions
Databases are UNFRIENDLY INFORMATION DESKS
Give you an idea of what is possible
WHAT can you ask ?
HOW can you ask it ?
Cédric Notredame (02/04/2016)
Outline
- An Overall view
- Asking a biological question to a database
- Turning a question into a query
- Bibliographic Databases: Medline, OMIM
- Gene Databases: GenBank, LocusLink, ENSEMBL
- Protein Databases: SwissProt, InterPro, Prodom
- SRS
Cédric Notredame (02/04/2016)
Database:
What is a Database ?
Cédric Notredame (02/04/2016)
DataBase Entries
AGCTGTCGAGGGATAGGACA
TATACATAAATTAATATAAT
1 entry = 1 Sequence
SEQ
1 entry = 1 File = Sequence +Doc
DOC
SEQ
DOC
SEQ
DOC
= Flat File
SEQ
DOC
SEQ
DOC
SEQ
DOC
Cédric Notredame (02/04/2016)
SEQ
DOC
Database = Collection of Flat Files
SEQ
DOC
DataBase Entries: Flat Files
Accession number: 1
First Name: Amos
Last Name: Bairoch
Course: DEA=oct-nov-dec 2002
http://www.expasy.org/people/amos.html
//
Accession number: 2
First Name: Laurent
Last name: Falquet
Course: EMBnet=sept 2000, sept 2001;DEA=oct-nov-dec 2000;
//
Accession number 3:
First Name: Marie-Claude
Last name: Blatter Garin
Course: EMBnet=sept 2000; sept 2001; DEA=oct-nov-dec 2000;
http://www.expasy.org/people/Marie-Claude.Blatter-Garin.html
//
Cédric Notredame (02/04/2016)
DataBase: Relational Databases
Relational database (« table file »):
Accession
number
Education
Amos
1
Biochemistry
Laurent
2
Biochemistry
M-Claude
3
Biochemistry
Teacher
Date
Involved
teachers
DEA
Oct-nov-dec 2000
1,3
EMBnet
Sept 2000, Sept 2001
2,3
Course
Cédric Notredame (02/04/2016)
To Summarize: What’s a database ?
Collection of Data that is:
•Structured
Data
•Searchable
(index)
•Updated
periodically (release)
•Cross-referenced
(hyperlinks)
-> table of contents
-> new edition
-> links with other db
Collection of tools (software) necessary for:
Searching –Updating -Releasing
Data storage managment: flat files, relational databases…
Cédric Notredame (02/04/2016)
Database:
What’s on the Menu?
Cédric Notredame (02/04/2016)
A large amount of information
More than 1000 different databases
Generally accessible through the web
EBI: http://www.ebi.ac.uk/
NCBI: http://www.ncbi.nlm.nih.org
Google: http://www.google.com
Variable size: <100Kb to >10Gb
DNA: > 10 Gb
Protein: 1 Gb
3D structure: 5 Gb
Other: smaller
Update frequency: daily to annually
Cédric Notredame (02/04/2016)
A Non Exhaustive List
AATDB, AceDb, ACUTS, ADB, AFDB, AGIS, AMSdb, ARR, AsDb,
BIOMDB,
BLOCKS,
BovGBASE,
BBDB, BCGD,
Beanref, Biolmage,BioMagResBank,
BOVMAP, BSORF, BTKbase, CANSITE, CarbBank, CARBHYD, CATH, CAZY, CCDC, CD4OLbase, CGAP, ChickGBASE, Colibri, COPE,
CottonDB, CSNDB, CUTG, CyanoBase, dbCFC, dbEST, dbSTS, DDBJ, DGP, DictyDb, Picty_cDB, DIP, DOGS, DOMO, DPD,
DPlnteract, ECDC, ECGC, EC02DBASE, EcoCyc, EcoGene, EMBL, EMD db, ENZYME, EPD, EpoDB, ESTHER, FlyBase, FlyView,
GCRDB, GDB, GENATLAS, Genbank, GeneCards, Genline,
GenLink,
GENOTK,
GenProtEC,
GIFTS, GPCRDB, GRAP,
GRBase, gRNAsdb, GRR, GSDB, HAEMB, HAMSTERS, HEART-2DPAGE, HEXAdb, HGMD, HIDB, HIDC, HlVdb, HotMolecBase, HOVERGEN,
HPDB, HSC-2DPAGE, ICN, ICTVDB, IL2RGbase, IMGT, Kabat, KDNA, KEGG, Klotho, LGIC, MAD, MaizeDb, MDB, Medline, Mendel,
MEROPS, MGDB, MGI, MHCPEP5 Micado, MitoDat, MITOMAP, MJDB, MmtDB, Mol-R-Us, MPDB, MRR, MutBase, MycDB, NDB, NRSub,
0-lycBase, OMIA, OMIM, OPD, ORDB, OWL, PAHdb, PatBase, PDB, PDD, Pfam, PhosphoBase, PigBASE, PIR, PKR, PMD, PPDB,
PRESAGE, PRINTS, ProDom, Prolysis, PROSITE, PROTOMAP, RatMAP, RDP, REBASE, RGP, SBASE, SCOP, SeqAnaiRef, SGD, SGP,
SheepMap, Soybase, SPAD, SRNA db, SRPDB, STACK, StyGene,Sub2D,SubtiList, SWISS-2DPAGE, SWISS-3DIMAGE, SWISS-MODEL
Repository, SWISS-PROT, TelDB, TGN, tmRDB, TOPS, TRANSFAC, TRR, UniGene, URNADB, V BASE, VDRR, VectorDB, WDCM, WIT,
WormPep, YEPD, YPD, YPM, etc .................. !!!!
There Exists A Specialized Database on
Almost anything you can think of
Cédric Notredame (02/04/2016)
A database of databases
Cédric Notredame (02/04/2016)
What’s on the Menu:
The Art of Eating Well
Always Use Fresh Data:
The Latest Update of your DataBase
Make Sure The DataBase is Maintained:
Many Databases are poorly maintained
Treat DataBases like Publications:
Some Journals are Better than Others
Cédric Notredame (02/04/2016)
Bio-Google:
How Can I Search a
Database ?
Cédric Notredame (02/04/2016)
Searching Databases
There are 2 ways to search databases
SEQ
DOC
AGCTGTCGAGGGATAGGACA
TATACATAAATTAATATAAT
Cédric Notredame (02/04/2016)
Text based queries: Medline, Entrez
Search For « Smith AND dUTPase>
Similarity Searches: BLAST
Searching Databases
Each database is a little kingdom…
Has its own query system
Has its own information structure
The main databases are well documented
and this documentation is available online
Most databases can be searched using SRS
or Entrez
Cédric Notredame (02/04/2016)
Databases: Asking the right Question
Databases ARE NOT
meant for browsing
When you search a Database you must
have an idea of what your Needle-in-ahay-stack looks like
Cédric Notredame (02/04/2016)
Databases: Asking the right Question
Browsing a database is like Using your
phone book in place of a dating agency…
Cédric Notredame (02/04/2016)
Databases: Asking the right Question
Finding Data: Database Search
Finding Questions:
Cédric Notredame (02/04/2016)
Data Mining
The Kind Of Questions We Can Ask:
SEQUENCE Based
InterPro
Any Known Domain in my Protein ???
SwissProt
Any Protein like mine ???
These ARE Predictions
Cédric Notredame (02/04/2016)
The Kind Of Questions We Can Ask:
TEXT Based
Medline
Who Worked on my Protein ???
SwissProt
Function of My Protein ???
PDB
Structure of My Protein ???
These are NOT Predictions
Cédric Notredame (02/04/2016)
Just like When You Google up
Specific Queries give Precise Answers
Cédric Notredame (02/04/2016)
Medline:
Who worked on my
Protein ?
Cédric Notredame (02/04/2016)
Medline (PubMed)
Cédric Notredame (02/04/2016)
What is in Medline ?
MEDLINE covers the fields of medicine, nursing,
dentistry, veterinary medicine, the health care system,
and the preclinical sciences
more than 4,000 biomedical journals and More than 10
million citations since 1966 until now
Contains links to biological db and to some journals
nMany
papers not dealing with human are not in Medline
nBefore
1970, keeps only the first 10 authors !
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
During the last Lab Meeting, I heard the word dUTPase.
What can it be ? What has been published on this ?
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
By Default, Medline Assumes you mean:
Abergel AND dUTPase
Cédric Notredame (02/04/2016)
Using Medline: Asking a question
I have found the reference I wanted.
Now I want to save it so that I can use it
later, For instance to Import it in ENDnote
my Reference Manager
Save Your Data in the Proper DataBase format
Cédric Notredame (02/04/2016)
Using Medline: Storing your results
Cédric Notredame (02/04/2016)
Using Medline: Storing your results
Cédric Notredame (02/04/2016)
Retrieving EXACTLY
the Information that you need
[AB]
[AD]
Restricted fields
Cédric Notredame (02/04/2016)
Using Medline: Storing your results
AB
AD
Cédric Notredame (02/04/2016)
Using Medline: Looking for a Review
I Want to Find the LATEST REVIEW on
the dUTPase.
Use The Limit Option of Medline
Cédric Notredame (02/04/2016)
Using Medline: Looking For a Review
1-Limits
Title OR Abstract
Article type
Cédric Notredame (02/04/2016)
Language
Using Medline: A Few Tips
•Quoted queries (e.g. «down syndrome» ) behave as a single
word, and are great to improve the relevance of your search
•Adding initials to names (e.g. “Abergel C” ) (if you can) also
reduces your output
•Write down the PubMed Identifier (the number in the PMID
field) of that interesting paper you just find. It could be very
useful in your subsequent search for related items such as
associated gene and protein sequences
Cédric Notredame (02/04/2016)
Using Medline: A Few Tips
•Spelling mistakes, wrong field restrictions or Limits setting
can occur. These may be the problem.
•Use abstracts to enlarge your vocabulary and look for
synonyms: some papers on dUTPase might use dUTP
pyrophosphatase instead!
•The “related papers” button (on the extreme right of the
PubMed output). Try it from time to time, to enlarge a search
that is not giving you enough references
Cédric Notredame (02/04/2016)
Using Medline: A Few Tips
•Storing your PDFs,
•Memory is cheap, access is sometimes strange…
•Storing your favourite PDF is a good idea
•Which name on your disk?
•THE MEDLINE ID NUMBER !!!
•With a reference manager like EndNote
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
GenBank:
What is the Sequence
of my Gene ?
Cédric Notredame (02/04/2016)
GenBank: an Overview
Cédric Notredame (02/04/2016)
GenBank: an Overview
Cédric Notredame (02/04/2016)
GenBank: an Overview
EMBL, GenBank and DDBJ are the same database.
They are synchronized every day.
GenBank
EMBL
DDBJ
Cédric Notredame (02/04/2016)
GenBank: an Overview
GenBank contains EVERY piece of DNA that
has been sequenced and made publicly
available.
It contains GOOD and BAD data
There is a Historical Aspect in the GenBank
data:
-Complex Genes are spread in many
entries:
Cédric Notredame (02/04/2016)
GenBank Entries Are Complex
because Genes are complex
Prokaryotic Example
Gene
Promoter
RBS
ATG
STOP
mRNA
ORF
Protein
Cédric Notredame (02/04/2016)
GenBank Entries Are Complex
because Genes are complex
Gene
Protein (form1)
mRNA (form1)
Promoter
exon
exon
exon
exon
exon
exon
mRNA (form2)
Protein (form2)
Cédric Notredame (02/04/2016)
Using GenBank: Asking a question
What is the Sequence of the E. Coli dUTPase ?
Cédric Notredame (02/04/2016)
Using GenBank: Asking a question
The Naive Way
Escherichia coli dUTPase
This search reports EVERY GenBank
entry that contains these two words.
Most Bacterial Genomes Entries
(annotated by similarity) Contain
these two words
Cédric Notredame (02/04/2016)
Using GenBank: Asking a question
The Right Way
Escherichia coli[organism] dUTPase[definition]
Cédric Notredame (02/04/2016)
Using GenBank: And There Is Plenty
More where It comes from…
GenBank Is Redundant:
If a Gene is published more than once,
Each publication gets its own entry
This can mean MANY ENTRIES if you have
SNPs or ESTs
Cédric Notredame (02/04/2016)
Header
Contains all the practical
Information
Cédric Notredame (02/04/2016)
Features
Contains Experimental
Information and Predictions
Cédric Notredame (02/04/2016)
Extra Gene
This is common in GenBank
entries
Cédric Notredame (02/04/2016)
Using GenBank: Asking a question
What is the Sequence of the E. Coli dUTPase ?
What is the Sequence of the Human dUTPase ?
Cédric Notredame (02/04/2016)
Using GenBank: Finding the Human
dUTPase
1-Request Limits
2-Check box here to exclude ESTs
Cédric Notredame (02/04/2016)
Using GenBank: Finding the Human
dUTPase
The Gene does NOT
appear in a single entry
Cédric Notredame (02/04/2016)
Using GenBank: Finding the Human
dUTPase
Cédric Notredame (02/04/2016)
Using GenBank: Reconstructing your gene
Cédric Notredame (02/04/2016)
Some Good News…
-This Information is complicated because it is RAW
Information
-It is necessary to keep UNINTERPRETED Experimental
Information available
-There are SIMPLER alternatives to using this RAW
Information:
-Gene Centric Databases
-Protein Databases
Cédric Notredame (02/04/2016)
RefSeq/LocusLink:
What Is There To
know about This Gene?
Cédric Notredame (02/04/2016)
Using LocuLink
Cédric Notredame (02/04/2016)
Using LocusLink: Asking a question
What Can I find about the DUT Gene ?
Cédric Notredame (02/04/2016)
Enter
Gene name
Select
LocusLink
Cédric Notredame (02/04/2016)
Using LocusLink: Asking a question about
a Gene
Cédric Notredame (02/04/2016)
Using LocusLink: Asking a question about
a Gene
Cédric Notredame (02/04/2016)
OMIM:
Is There A disease
Associated to This
Gene?
Cédric Notredame (02/04/2016)
OMIM: Finding Out About The Phenotype
of a Gene
Cédric Notredame (02/04/2016)
OMIM: Finding Out About The Phenotype
of a Gene
OMIM™: Online Mendelian Inheritance in
Man
A catalog of human genes and genetic
disorders
Contains a summary of literature,
pictures, and reference information. It
also contains numerous links to articles and
sequence information.
Cédric Notredame (02/04/2016)
OMIM: Finding Out About The Phenotype
of a Gene
Cédric Notredame (02/04/2016)
NCBI-GENOME:
What is the Context
of my Gene In Its
Genome?
Cédric Notredame (02/04/2016)
NCBI-GENOME
Cédric Notredame (02/04/2016)
NCBI-GENOME: The Virus Section
Cédric Notredame (02/04/2016)
NCBI-GENOME: The Virus Section
Cédric Notredame (02/04/2016)
NCBI-GENOME: The Bacteria
Section
Cédric Notredame (02/04/2016)
NCBI-GENOME: The Bacteria Section
Cédric Notredame (02/04/2016)
ENSEMBL:
Where is my Gene in
the Human Genome
(who are its neighbors)
?
Cédric Notredame (02/04/2016)
Using ENSEMBL
Cédric Notredame (02/04/2016)
My Gene:
A Summary
Cédric Notredame (02/04/2016)
Gathering Everything you need on a gene
GenBank: What is the Sequence ?
LocusLink: What about this Gene?
ENSEMBL: What is the Context?
MEDLINE: Are There Papers?
OMIME: Are There Illnesses?
Cédric Notredame (02/04/2016)
SwissProt:
What Do We Know
About My Protein ?
Cédric Notredame (02/04/2016)
The Protein Databases
GenBank: A Big
Bag of DNA
PREDICTION
+
EXPERIMENT
Generic Non Redundant
Protein Databases
NR
trEMBL
Specialized Protein
Databases
SwissProt
PIR
Cédric Notredame (02/04/2016)
What Is SwissProt ?
Cédric Notredame (02/04/2016)
What Is SwissProt ?
Fully-annotated (manually), non-redundant, crossreferenced, documented protein sequence database.
~100 ’000 sequences from more than 6’800 different
species; 70 ’000 references (publications); 550 ’000 crossreferences (databases); ~200 Mb of annotations.
Collaboration between the SIB (CH) and EMBL/EBI (UK)
Cédric Notredame (02/04/2016)
Using SwissProt: Asking a question
We hear the word EPO quite often these days, but what
exactly is known about it ?
Cédric Notredame (02/04/2016)
Using SwissProt: Asking a question
A Simple SwissProt
Text Query
EPO HUMAN
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Structure Information
Cédric Notredame (02/04/2016)
Using SwissProt: Reading an Entry
Cédric Notredame (02/04/2016)
The Protein Databases
GenBank: A Big
Bag of DNA
PREDICTION
+
EXPERIMENT
Generic Non Redundant
Protein Databases
NR
trEMBL
Specialized Protein
Databases
SwissProt
PIR
UniProt
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
SwissProt
How Good is Good ?
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
PDB:
What is the Structure
of my Protein ?
Cédric Notredame (02/04/2016)
PDB: The Protein Database
Cédric Notredame (02/04/2016)
PDB: The Protein Database
Managed by Research Collaboratory for Structural
Bioinformatics (RCSB) (USA).
Contains macromolecular structure data on proteins,
nucleic acids, protein-nucleic acid complexes, and
viruses.
Currently there are ~16’000 structure data for
about 4’000 different molecules, but far less protein
families (highly redundant) !
Cédric Notredame (02/04/2016)
Using PDB: Asking a question
Does tolB have a known Structure? And If the answer
is Yes, How can I look at it ?
Cédric Notredame (02/04/2016)
Using PDB: Asking a question
Query: TolB
Cédric Notredame (02/04/2016)
Using PDB: Viewing a Structure
View Structure
Cédric Notredame (02/04/2016)
Using PDB: Viewing a Structure
Cédric Notredame (02/04/2016)
Using PDB: Viewing a Structure
Cédric Notredame (02/04/2016)
Using PDB: Viewing a Structure
Cédric Notredame (02/04/2016)
Using PDB: Downloading Data
Coordinates
Cédric Notredame (02/04/2016)
Interpro:
Are There Domains In
my Protein ?
Cédric Notredame (02/04/2016)
Interpro: The Idea of Domains
Cédric Notredame (02/04/2016)
Interpro: The Idea of Domains
Cédric Notredame (02/04/2016)
Interpro: A Federation of Databases
Cédric Notredame (02/04/2016)
Using InterPro: Asking a question
Which Domains does the oncogene FosB contain?
Cédric Notredame (02/04/2016)
Using InterPro: Asking a question
Cédric Notredame (02/04/2016)
Using InterPro: Asking a question
Cédric Notredame (02/04/2016)
Using CDsearch: Asking a question
Cédric Notredame (02/04/2016)
Using CDsearch: Asking a question
Cédric Notredame (02/04/2016)
Using Domains: Some Statistics
• 10 most common protein domains for H. sapiens
Immunoglobulin and major histocompatibility complex
domain
Zinc finger, C2H2 type
Eukaryotic protein kinase
Rhodopsin-like GPCR superfamily
Pleckstrin homology (PH) domain
RING finger
Src homology 3 (SH3) domain
RNA-binding region RNP-1 (RNA recognition motif)
EF-hand family
Homeobox domain
Cédric Notredame (02/04/2016)
My Protein:
A Summary
Cédric Notredame (02/04/2016)
Gathering Everything you need on a
Protein
trEMBL: What is the Sequence ?
SwissProt:What about the Function
INTERPRO: Which Domains?
MEDLINE: Are There Papers?
PDB: Which Structure?
Cédric Notredame (02/04/2016)
SRS:
Can I search Many
Databases
Simultaneously ?
Cédric Notredame (02/04/2016)
Using SRS
Cédric Notredame (02/04/2016)
Using SRS
Cédric Notredame (02/04/2016)
A Few Databases in
Bulk
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
Cédric Notredame (02/04/2016)
A Few Addresses
Cédric Notredame (02/04/2016)
A few Databases
Cédric Notredame (02/04/2016)