Biological Data Resources

Download Report

Transcript Biological Data Resources

Computer Storage of
Sequences
(Chapter 2 of
Bioinformatics: Sequence and Genome Analysis
By David W. Mount)
CSE730: Seminar on
“Information Retrieval of Biomedical
Text and Data”
Outline
Storing DNA/Protein sequences into
computer files or databases.
Related information placed in the
database along with the sequence in a
number of sequence data formats.
Online public access Databases for
sequence retrieval.
Nucleotide Sequence
Nomenclature Committee of the International Union of Biochemistry
Code
Nucleic Acid(s)
Code
Nucleic Acid(s)
A
Adenine
M
A or C (amino)
C
Cytosine
R
A or G (purine)
G
Guanine
W
A or T (weak)
T
Thymine
S
C or G (strong)
U
Uracil
Y
C or T (pyrimidine)
K
G or T (keto)
V
A or C or G
H
A or C or T
D
A or G or T
B
C or G or T
N
A or G or C or T (any)
Protein Sequence
Code
Amino acid
Code
Amino acid
A
Alanine
N
Asparagine
B
Asparagine
P
Proline
C
Cysteine
Q
Glutamine
D
Aspartic acid
R
Arginine
E
Glutamic acid
S
Serine
F
Phenylalanine
T
Threonine
G
Glycine
V
Valine
H
Histidine
W
Tryptophan
I
Isoleucine
X
Unknown
K
Lysine
Y
Tyrosine
L
Leucine
Z
Glutamine
M
Methionine
Adapted from IUPAC-IUB
(1969,1972, 1983)
Sequence Formats
Sequence is stored as ASCII text (i.e. sequence of
A,G,C,T…) along with annotations.
Different sequence formats recognized by different
sequence analyzer programs.
Sequence Format includes accessory information,
gene names, source organism, investigator
name, references, and the actual sequence.
Sequence Formats (continued)
 FASTA
 GenBank Flat File format
 PIR/CODATA format
 EMBL sequence entry format
 Intelligenetics sequence entry format
 GCG (Genetics Computer Group) sequence
entry format.
 ASN.1
 XML
Databases
NCBI
GenBank at the National Center of
Biotechnology Information (NCBI), National
Library of Medicine, Washington, DC
NBRF
Protein Information Resource (PIR) database
at the National Biomedical Research
Foundation in Washington, DC
Databases (continued)
SwissProt
The SwissProt protein sequence database at
ISREC, Swiss Institute for Experimental Cancer
Research.
EMBL
European Molecular Biology Laboratory
(EMBL) Outstation at Hixton, England
DDBJ
DNA DataBank of Japan (DDBJ) at Mishima,
Japan
Databases on Internet
NCBI http://www.ncbi.nlm.nih.gov
PIR
http://www-nbrf.georgetown.edu/pirwww
SwissProt
http://www.expasy.ch/cgi-bin/sprot-search-de
EMBL http://www.ebi.ac.uk/embl/index.html
DDBJ http://www.ddbj.nig.ac.jp/
NCBI
 National resource for molecular biology
information.
 Maintains comprehensive databases for
variety of Biotech related information.
 Develops and manages access to a range
of databases and softwares for scientific
and medical communities.
NCBI : Integrated Databases
Literature Databases
Pubmed
PubMed Central
OMIM
PROW
BookShelf
NCBI : Integrated Databases
(continued)
Nucleotide Databases
GenBank
EST Database
GSS Database
SNPs Database
RefSeq
STS Database
NCBI : Integrated Databases
(continued)
Entrez Databases
Pubmed
Protein Sequence Database
Nucleotide Sequence Database
Taxonomy
OMIM
GenBank
GenBank is the NIH genetic sequence
database.
Annotated collection of all publicly
available DNA sequences.
GenBank is a part of an international
collaboration of sequence databases
along with EMBL and DDBJ.
GenBank DNA Sequence Format
DNA sequence in GenBank is formatted into
distinct attributes as following
Locus: locus name, sequence length, division, date
Definition: description of entry
Accession: unique accession number
Version: version of sequence
Keywords: keywords for cross referencing
GenBank DNA Sequence Format
(continued)
Source: source organism of DNA
Organism: description of organism
References: authors, title, journal, Medline, etc
Features: information about sequence
Base count: number of bases in sequence
Origin: sequence data begin following origin.
Genebank sample
NCBI : Tools
Tools for Data Retrieval and submission
 Text Term Searching
 Sequence Similarity Searching
 Taxonomy Searching
 Sequence Submission
NCBI : ENTREZ
Entrez is a search and retrieval system
that integrates information from databases
at NCBI.
These databases include nucleotide
sequences, protein sequences,
macromolecular structures, whole
genomes, and MEDLINE, PubMed. Etc.
Entrez
NCBI : BLAST
BLAST: Basic Local Alignment Search Tool
 It is a set of similarity search programs designed
to explore available sequence databases.
 It uses a heuristic algorithm which is able to
detect relationships among sequences which
share only isolated regions of similarity
Q-BLAST: It is a queuing system to BLAST that
allows users to retrieve results at their
convenience and format their results.
NCBI : BLAST (continued)
Access to BLAST service
Web-BLAST
Standalone BLAST
Network BLAST
BLAST URL API
NCBI : BLAST (continued)
BLAST Programs
 Blastp : Compares amino acid sequence against
protein sequence Database
 Blastn : Compares nucleotide sequence against
nucleotide sequence Database
 Blastx :Compares nucleotide query sequence against
protein sequence Database
 Tblastn : Compares protein query sequence against
nucleotide sequence Database
BLAST
NBRF :PIR
Protein Information Resource
3 Major Databases:
PSD (Protein Sequence Database)
iProClass
PIR-NREF
(Nonredundant REFerence protein database)
PIR: PSD
 The PIR, in collaboration with MIPS and JIPID,
produces and distributes the PIR-International
Protein Sequence Database (PSD) .
 Comprehensive and expertly annotated protein
sequence database.
 The primary sources of PSD data are
sequences from GenBank/EMBL/DDBJ
translations, published literature, and direct
submission to PIR-International.
PIR: PSD (continued)
The PIR-PSD data is available in XML
format and NBRF, PIR/CODATA formats.
The sequence file is available in FASTA
format.
Also available at PIR UNIX FTP server.
Address:
ftp://ftp.pir.georgetown.edu/pir_databases/psd/
CODATA format
CODATA format has approximately the
same information as a GenBank or EMBL
sequence file, but is slightly differently
formatted and has different field names.
Also called PIR format, used by NBRF.
CODATA Sample
PIR: iProClass
The iProClass database provides
comprehensive descriptions of all proteins
and serves as a framework for data
integration in a distributed networking
environment.
Very user-friendly description.
PIR: NREF
(Non-redundant REFerence protein
database)
 Comprehensive: Containing all sequences from PIR-PSD, SwissProt, TrEMBL, RefSeq, GenPept, and updated bi-weekly.
 Non-Redundant: Clustered by sequence identity and taxonomy at
the species level.
 Source Attribution: Containing protein IDs and names from
associated databases (with hypertext links), in addition to protein
sequence, taxonomy, and bibliography.
The current version (July 2002) consists of more than 809,000 non-redundant PIR-PSD,
SwissProt and TrEMBL proteins organized with more than 36,200 PIR superfamilies,
145,340 families, and links to over 50 molecular biology databases.
Swiss-Prot
Swiss-Prot is a protein knowledgebase
established in 1986.
Maintained collaboratively, by the
Department of Medical Biochemistry of the
University of Geneva (now the Swiss
Institute of Bioinformatics) and the EMBL
Data Library.
Swiss-Prot Sequence Entry Example
Sequence Format Conversion
READSEQ:
Sequence Format Conversion program.
http://bimas.dcrt.nih.gov/molbio/readseq/
Can convert to/from:
 ASN.1
 FASTA
 CODATA
 GCG
 EMBL format
 GenBank format and many other formats
References
 http://www.ncbi.nlm.nih.gov
 http://www-nbrf.georgetown.edu/pirwww
 http://www.expasy.ch/cgi-bin/sprot-search-de
 http://www.ebi.ac.uk/embl/index.html
 http://www.ddbj.nig.ac.jp/
Thank You 
Presented by:
Hemal Patel &
Jeetal Shah