The Johns Hopkins University - American University of Beirut
Download
Report
Transcript The Johns Hopkins University - American University of Beirut
Retrieving Information:
Using Entrez
Lecture 2.2
1
Retrieving information: how it works:
• Servers have the records you want
• You need to understand the data they have,
and how it is organized
• There are often many ways to get to an
answer.
• Route to get there is not always obvious, but
you need to think of alternatives and traps.
• Use some query language – each system has
its own.
• Retrieve data in a specified format.
• Save it in a way that will be useful to you.
Lecture 2.2
2
What you may be looking for:
• Did a BLAST search – and you need more info
about some of the proteins they found
similarities to.
• Heard on about a disease gene that was
recently discovered, and you want to know
more about it.
• Want to build a dataset for local blast
searches.
• A colleague wants you to do an alignment of
all sequences from a given protein family.
Lecture 2.2
3
What you are looking for:
• PubMed paper from author X
• Sequence from gene X in organism Y
• All information about organelle W in
model organism Y
• All information about disease X in
human
• Orthologs of that disease genes in other
model organisms
Lecture 2.2
4
Central Dogma: NCBI version
DNA
RNA
Write a paper
about it
protein
Lecture 2.2
5
Entrez: Pathway to Discovery
Term frequency
statistics
1993
Literature
citations in
sequence
databases
MEDLINE
abstracts
Nucleotide
sequences
Nucleotide
sequence
similarity
Lecture 2.2
Literature
citations in
sequence
databases
Protein
sequences
Coding region
features
Amino acid
sequence similarity
6
Type in your last name and find a paper
form one of your teammates
Related Articles
Lecture 2.2
7
Hard link DNA to protein
L12345
Lecture 2.2
8
From Fig 1 of
Entrez search and retrieval system
Jim Ostell
Chapter 14, the NCBI Handbook.
2003
Lecture 2.2
9
Lecture 2.2
10
Lecture 2.2
11
Lecture 2.2
12
Ctrl-F
Lecture 2.2
13
Lecture 2.2
14
Getting started in Entrez
Lecture 2.2
15
“ouellette bf” [au] AND yeast
Lecture 2.2
16
Lecture 2.2
17
Lecture 2.2
18
Lecture 2.2
19
MeSH: Medical Subject Heading
Lecture 2.2
20
A query
• Word <free text> : too many hits
– More words (the Boolean ‘AND’ is the
default)
– Limit query to specified field
– Limit query in time
– Do Boolean on queries
• #1 AND #2
• #3 NOT #5
• #7 OR #8
Lecture 2.2
21
hieter p [au]
Lecture 2.2
22
Limit in Time: 1993-01-01 1993-12-31
Lecture 2.2
23
Lecture 2.2
24
No abstract
With abstract
Full Text on-line
Full Text in PubMed Central
Lecture 2.2
25
boguski m [au]
99
boguski ms [au]
80
Lecture 2.2
26
#24 NOT #23
Lecture 2.2
19
27
Lecture 2.2
28
Other types of links in Entrez
• Next slides to explore other kind of
things linked into Entrez records.
Lecture 2.2
29
“hieter p” [au] cdc16p
Lecture 2.2
30
Lecture 2.2
31
Lecture 2.2
32
Lecture 2.2
33
Lecture 2.2
34
Lecture 2.2
35
Lecture 2.2
36
Lecture 2.2
37
Lecture 2.2
38
“Books”
Lecture 2.2
39
(2)
Lecture 2.2
40
Lecture 2.2
41
Lecture 2.2
42
Lecture 2.2
43
Lecture 2.2
44
Lecture 2.2
45
Link to Genome
View of Chromosome I
Lecture 2.2
46
Lecture 2.2
47
Lecture 2.2
48
RefSeq
• RefSeq represents the NCBI curated
“reference sequences” for all ‘worked’
genome.
• Historically, these used to be referred to as
“GenBank-Gold”.
• RefSeq are either genomic, mRNA or protein
sequences.
• Not all sequences are in RefSeq
• All RefSeq sequences are assembled/taken
from things in GenBank.
Lecture 2.2
49
Some of the features of the
RefSeq:
• non-redundancy
• explicitly linked nucleotide and protein
sequences
• updates to reflect current knowledge of
sequence data and biology
• data validation and format consistency
• distinct accession series
• ongoing curation by NCBI staff and
collaborators, with review status indicated on
each record
Lecture 2.2
50
Accession number space
• GenBank:
– 1+5 (L12345, U00001)
– 2+6 (AF000001, AC000003)
– 4+2+6 (WGS)
• All have accession.version
• Protein:
– 1+5 (SwissProt/UniProt)
– 3+5 (GenPept)
• All have accession.version
• RefSeq:
– N*_12345
Lecture 2.2
51
RefSeq Accession Number Space
NC_123456
Genomic
Complete genomic molecules
including genomes, chromosomes,
organelles, plasmids.
NG_123456
Genomic
Incomplete genomic region; supplied
to support the NCBI Genome
Annotation pipeline.
NM_123456
mRNA
NR_123456
RNA
NP_123456
Protein
NP_12345678 Protein
Lecture 2.2
Non-coding transcripts including
structural RNAs, transcribed
pseudogenes, and others
Planned expansion of accession
series
52
Automated Assemblies
NT_123456
Genomic
Intermediate genomic assemblies of BAC
sequence data
NW_123456
Genomic
Intermediate genomic assemblies of Whole
Genome Shotgun sequence data
Lecture 2.2
53
Model RefSeq records
XM_123456
mRNA
model mRNA provided by the Genome
Annotation process; sequence
corresponds to the genomic contig.
XR_123456
RNA
model non-coding transcripts provided
by the Genome Annotation process;
sequence corresponds to the genomic
contig.
XP_123456
Protein
model proteins provided by the Genome
Annotation process; sequence
corresponds to the genomic contig.
Lecture 2.2
54
WGS special case
NZ_ABCD123
45678
Genomic
A collection of whole genome shotgun
sequence data for a project. Accessions
are not tracked between releases. The
first four characters following the
underscore (e.g. 'ABCD') identifies a
genome project.
ZP_12345678
Protein
Proteins annotated on NZ_ accessions
(often via computational methods).
Lecture 2.2
55
Download all the data
Entrez and RefSeq
Lecture 2.2
56
Lecture 2.2
57
Lecture 2.2
58
Lecture 2.2
59
Locus Link
Lecture 2.2
60
Things to watch out for:
Lecture 2.2
61
Lecture 2.2
62