Practical 1. Discussion

Download Report

Transcript Practical 1. Discussion

Practical 1
Discussion
1
Features of major databases
(PubMed and NCBI Protein Db)
2
Anatomy of PubMed Db
3
Epub ahead of print and journal
impact factor
How to get impact
factor of any journal:
1) Direct source –
web of science
database
2) In direct source,
e.g. blogs, sites etc
(do Google search)
4
Adopted from : http://admin-apps.isiknowledge.com/JCR/JCR?RQ=LIST_SUMMARY_JOURNAL
Anatomy of a PubMed record
5
Demo on downloading articles
6
Anatomy of a Protein Db
7
Accession numbers and
GenInfo Identifiers
gi|numeric identifier |source |alphanumeric identifier
humanP53 RefSeq mRNA record as an example:
gi|120407067|ref|NM_000546.3
120407067
GI or Geninfo Identifier)
GI (or GenInfo Identifier) 120407067
Refseq database
Source
Source
RefSeq database
NM_000546
Accession
Accession
NM_000546
NM_000546.3
Version
Other popular sources:
dbj – DDBJ (DNA Data Bank of Japan database)
emb – The European Molecular Biology
Laboratory (EMBL) database
prf – Protein Research Foundation database
sp – SwissProt
gb – GenBank
pir – Protein Information Resource
8
Why do we need accession
number and GI for one record?
1) What is the difference between accession and GI?
2) Why do we need these two when both seem to be
accession numbers?
9
Why do we need accession
number and GI for one record?
ACCESSION
NM_000546
NM_000546
Sequence_v1
Version
GI
NM_000546.1
4507636
VERSION
NM_000546.3
NM_000546.2
NM_000546.1
GI
120407067
8400737
4507636
NM_000546
NM_000546
Sequence
update
Sequence_v2
NM_000546.2
8400737
Sequence
update
Sequence_v3
NM_000546.3
120407067
Q1) Which revision will NCBI show if you were to search by
the accession only without the version number?
10
Accession numbers
- The unique identifier for a sequence record.
- An accession number applies to the complete record.
- Accession numbers do not change, even if information in the record
is changed at the author's request.
- Sometimes, however, an original accession number might become
secondary to a newer accession number, if the authors make a new
submission that combines previous sequences, or if for some
reason a new submission supercedes an earlier record.
11
GenInfo Identifiers
- GenInfo Identifier: sequence identification number
- If a sequence changes in any way, a new GI number will be assigned
- A separate GI number is also assigned to each protein translation
Within a nucleotide sequence record
- A new GI is assigned if the protein translation changes in any way
- GI sequence identifiers run parallel to the new accession.version
system of sequence identifiers
12
Version
- A nucleotide sequence identification number that represents a single,
specific sequence in the GenBank database.
- If there is any change to the sequence data (even a single base), the
version number will be increased, e.g., U12345.1 → U12345.2, but
the accession portion will remain stable.
- The accession.version system of sequence identifiers runs parallel to
the GI number system, i.e., when any change is made to a sequence,
it receives a new GI number AND an increase to its version number.
- A Sequence Revision History tool
(http://www.ncbi.nlm.nih.gov/entrez/sutils/girevhist.cgi)
is available to track the various GI numbers, version numbers, and
update dates for sequences that appeared in a specific GenBank record
13
Anatomy of a Protein Db record
14
Fasta Sequence
15
Fasta Format
• Text-based format for representing  nucleic
acid sequences or peptide sequences (single
letter codes).
• Easy to manipulate and parse sequences to
programs.
Description line/row
Sequence data line(s)
Description line/row
Sequence data line(s)
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Fasta Format (cont.)
•
•
•
Begins with a single-line description, followed by lines of sequence data.
Description line
– Distinguished from the sequence data by a greater-than (">") symbol.
– The word following the ">" symbol in the same row is the identifier of the sequence.
– There should be no space between the ">" and the first letter of the identifier.
– Keep the identifier short and clear ; Some old programs only accept identifiers of only 10
characters. For example: > gi|5524211|Human or >HumanP53
Sequence line(s)
– Ensure that the sequence data starts in the row following the description row (be careful of
word wrap feature)
– The sequence ends if another line starting with a ">" appears; this indicates the start of another
sequence.
Description line/row
Sequence data line(s)
Description line/row
Sequence data line(s)
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Amino acids & Nucleotides
18
IUPAC One Letter Amino Acid Code
•
•
•
•
•
•
•
•
•
•
•
•
•
A
B
C
D
E
F
G
H
I
J
K
L
M
Alanine
ASx
Cysteine
Aspar(D)ic Acid
Glutamic Acid
(F)enylalanine
Glycine
Histidine
Isoleucine
Lysine
Leucine
Methionine
•
•
•
•
•
•
•
•
•
•
•
•
•
N
O
P
Q
R
S
T
U
V
W
X
Y
Z
Asparagi(N)e
Aspartic Acid
22nd (Pyl) Pyrr(O)lysine
Asparagine
Proline
ASx
(Q)lutamine
Arginine
(R)ginine
Glutamic Acid
Serine
Glutamine
Threonine
GLx
21st (Sec)Selenocysteine Lysine
Valine
Phenylalanine
T(W)ptophan
Tyrosine
Tryptophan
T(Y)rosine
21st (Sec) Selenocysteine
GLx
22nd (Pyl) Pyrrolysine
Note
Amino acid
Three letter code
Single letter code
Asparagine or aspartic acid
Asx
B
Glutamine or glutamic acid,
GLx
Z
Leucine or Isoleucine,
Xle
J
Unspecified or unknown amino acid
Xaa
X
IUPAC Nucleotide Code
Standard IUPAC Nucleotide code is used to describe ambiguous sites in
a given DNA sequence motif, where a single character may represent
more than one nucleotide. The code is shown in the table below.
http://www.yeastract.com/help/help_iupac.php
22
Advice
• We highly recommend that you memorize the
amino acid codes and their structures
• Memorizing the codes and in particular the
structures will be very useful for this module and
other modules, especially for research purposes.
• It is not compulsory that you memorize these for
this module.
Features of major database
(Gene Db)
24
Anatomy of Gene Db
25
Anatomy of a Gene Db record
26
A section of Gene Db record:
Reference Sequences
mRNA
Accession
number
Protein
Accession
number
27
Nucleic Acid Databases
28
Entrez nucleotide database (nt)
• GenBank
• DDBJ
• EMBL
• RefSeq_genomic
Amino Acid Databases
1) Sequence repositories
GenPept (redundant; translation of GenBank; minimal annotation)
•
Entrez Protein (redundant or NR)
• translated DDBJ/EMBL/GenBank (i.e. GenPept)
• Swiss-Prot, PIR, RefSeq_protein and PDB
•
RefSeq (non-redundant; reference sequences; minimal manual
curation; limited species)
2) Universal curated databases
•
PIR-PSD (non-redundant; focus on protein family classification)
•
Swiss-Prot (non-redundant; manually annotated)
•
TrEMBL (non-redundant; extensively computer-annotated)
3) Next-generation of protein sequence database
•
UniProtKB (Swiss-Prot, TrEMBL and PIR-PSD integrated; less
redundant than UniProt NREF)
•
UniParc (like Entrez Protein but more comprehensive)
•
UniProt NREF (like RefSeq but more comprehensive and rich with
annotation)
Read more: http://www.ebi.ac.uk/panda/pdf/apweiler_bairoch_2004.pdf
29
•
The RefSeq Project
• Designed to reduce duplication by selecting one
representative sequence for each locus, except when there
are naturally occurring paralogs and splice variants.
• Info from:
– Predictions from genomic sequence
– Analysis of GenBank Records
– Collaborating databases
30
• Goal: a “comprehensive, integrated, non-redundant set of
sequences, including genomic DNA, transcript (RNA), and
protein products, for major research organisms.”
http://www.ncbi.nlm.nih.gov/RefSeq/index.html
Genbank versus refseq
http://www.ncbi.nlm.nih.gov/books/NBK21105/#ch1.Appendix_GenBank_RefSeq_TPA_and_UniP
Choice of databases for
genomic/proteomic data
Genome architecture
Enhancer
Promoter
Gene
E
E
I
U
U
Databases to house genomic/proteomic data
Nucleotide
All of above in
multiple records
RefSeq_genome
Reference ones only
Protein
All real/ reliably predicted
proteins in multiple records
RefSeq_Protein
Reference proteins only
Gene
Gene record with all related
Information included (mRNA
Protein, promoter, enhancer)
Database searching can help answer
questions like
•
•
•
•
•
•
•
•
•
•
•
What is the sequence of human IL-10?
What is the gene coding for human IL-10?
Is the function of human IL-10 known? What is it?
Are there any variants of human IL-10?
Who sequenced this gene?
What are the differences between IL-10 in human and in other
species?
Which species are known to have IL-10?
Is the structure of IL-10 known?
What are structural and functional domains of the IL-10?
Are there any motifs in the sequence that explain their
properties?
What is an upstream region of IL-10 containing transcriptional
regulation sites?
IL10 = X?
Take home messages for databases
•
•
•
•
•
•
•
•
•
Bioinformatics = databases + tools
General databases versus specialized databases
Databases come and go (especially the small ones)
Database redundancy - many databases for the same topic (use the most comprehensive, if
not use all for comprehensiveness)
Database accuracy – published ones are more reliable; nevertheless, they are still prone to
errors; always good to spend sometime assessing the reliability of your data of interest by
doing cross-referencing to literature or other databases
Fortunately, most databases are cross-referenced
Unfortunately, no common standard format; need to spend some time familiarizing each;
becomes easy after some practice
Finding databases relevant to you
– NAR Database catalogue
– Pubmed
– Google
2 main methods for searching databases (each with its own pros and cons)
– 1. Keyword search (covered today)
– 2. Sequence search (day 2)
34