Biology 4900
Download
Report
Transcript Biology 4900
Biology 4900
Biocomputing
Chapter 2
Molecular Databases and Data
Analysis
Literature Databases
• Online databases available at CSU
– Galileo
– JSTOR
• Online databases at other sites
– PubMed. If you find a useful article, you can check
PubMed Central to see if it is available online for free.
• Where to get articles
– PubMed Central
– GIL
– Interlibrary loan
Sources of Molecular Data
DNA
genomic
DNA
databases
RNA
cDNA
*ESTs
UniGene
protein
phenotype
protein
sequence
databases
*Expressed Sequence Tags
Molecular Databases
• Primary Database
– Archival - sequences submitted directly from experimental
sequencing results
• Very little interpretation
• Anyone can submit; accuracy not checked
• Examples
– Nucleic Acid: EMBL, DDJB, GenBANK
– Protein: Swiss-Prot, PIR, PDB
• Secondary Databases
– Curated – sequences are validated/checked and may be
annotated
• Refseq (nucleic acids and proteins, but limited to certain
organisms)
• TrEMBL, GenPept, Uniprot
Nucleic Acid Databases
• Contain:
– Nucleic acid sequences
• Chain termination method (Sanger sequencing)
– Used for sequences 100-1000 bp
• Whole Genome Shotgun (WGS) Sequencing
– Used for sequences >1000 bp
– DNA chopped into little chunks
– Sequenced using chain termination method (reads)
– Numerous, overlapping reads are collected and assembled
into sequence (computational methods)
– Annotations for each sequence
• Putative identification of open reading frames (ORFs = parts of
gene that encode protein) in sequence
• Putative intron(excised)/exon(retained) locations
• Authors, dates, publication, etc.
International Nucleotide Sequence Database Collaboration
(Public nucleotide and protein sequence databases)
Name: GenBank
Location: National Institutes of
Health, National Center for
Biotechnology Information
GenBank
Daily Info
sharing
Daily Info
sharing
EMBL
DDBJ
Daily Info
sharing
Name: European Molecular
Biology Laboratory (EMBL)
Location: European
Bioinformatics Institute (EBI)
Name: DNA Database of
Japan (DDBJ)
Location: National Institute of
Genetics, Mishima
GenBank
•
•
As of April 2011, There were approximately 126,551,501,141 bases in 135,440,924
sequence records in the traditional GenBank divisions.
Read the following paper: http://www.ncbi.nlm.nih.gov/pubmed/21071399
Home Page: http://www.ncbi.nlm.nih.gov/genbank/
Most sequenced organisms
•
Homo sapiens
Mus musculus
Rattus norvegicus
Bos taurus
Zea mays
Sus scrofa
Danio rerio
Strongylocentrotus purpurata
Oryza sativa (japonica)
Nicotiana tabacum
14.9 billion bases
8.9b
6.5b
5.4b
5.0b
4.8b
3.1b
1.4b
1.2b
1.2b
GenBank Home Page
NCBI Resources
•
•
•
•
PubMed
BLAST
OMIM
Taxonomy
Browser
• Structure
NCBI key features: PubMed
• National Library of Medicine's search service
•
21 million citations from MEDLINE & others (as of 2011)
•
Links to other online journals
•
http://www.ncbi.nlm.nih.gov/pubmed
• Starting point for most research
Literature Searches through PubMed
Use the pull-down menu to access related resources
such as Medical Subject Headings (MeSH)
A “how to” pull-down menu links to tutorials
Use “Advanced search” to limit by author, year,
language, etc.
PubMed search strategies
Try the tutorial
Use boolean queries (capitalize AND, OR, NOT)
lipocalin AND disease
Try using limits (see Advanced search)
There are links to find Entrez entries and external resources
1 AND 2
1
2
lipocalin AND disease
(504 results)
1 OR 2
1
2
lipocalin OR disease
(2,500,000 results)
1 NOT 2
1
2
lipocalin NOT disease
(2,370 results)
Save Searches, Save Results, Get Papers
PubMed Author Search
Scholar Google Search
• http://scholar.google.com/
• Includes references that may not be found in PubMed
NCBI key features
A search from NCBI main page will search:
• the scientific literature;
• DNA and protein sequence databases;
• 3D protein structure data;
• population study data sets;
• assemblies of complete genomes
•String search
Search by author, date, keyword, publication, etc.
Classroom exercise:
Author searches
Paper searches
Protein searches
NCBI key features: BLAST
BLAST is…
• Basic Local Alignment Search Tool
• NCBI's sequence similarity search tool
• supports analysis of DNA and protein databases
3CLN
NCBI key features: OMIM
•Online Mendelian Inheritance in Man
•Catalog of human genes and genetic disorders
NCBI key features: Taxonomy Browser
• Browser for the major divisions of living organisms
(archaea, bacteria, eukaryota, viruses)
• Taxonomy information such as genetic codes
• Molecular data on extinct organisms
• Useful to find a protein or gene from a species
NCBI key features: Structure
• Molecular Modelling Database (MMDB)
• biopolymer structures obtained from
the Protein Data Bank (PDB)
• Cn3D (a 3D-structure viewer)
• vector alignment search tool (VAST)
Cn3D
•A 3D-structure viewer
•Must download (ftp://ftp.ncbi.nlm.nih.gov/cn3d/Cn3D-4.3.msi)
•Use to align structures identified as similar by VAST
Example: Researching beta globin
• Beta globin is protein, so it will be found in 3 different types of
databases
DNA
*RNA
Proteins
GenBank dbGSS
GenBank dbHTGS
GenBank dbSTS
GenBank Entrez Gene
GenBank dbEST
UniGene
Gene Expression Omnibus
Entrez Protein
UniProt
PDB
SCOP
CATH
*Because RNA is unstable, it can be transcribed into
complementary DNA (cDNA)
Necessary (yet annoying) Definitions
• Sequence Tagged Site (STS): Small DNA fragments
with both DNA sequence data and mapping data
(genes assigned to chromosomes)
• Expressed Sequence Tags (EST): Partial DNA
sequence of a complementary (cDNA) clone
– Typically these are randomly-selected cDNA clones
sequenced on a single strand (300-800 bp)
– Useful for identifying novel genes
– Higher rate of error
http://genome.wellcome.ac.uk/doc_WTD020755.html
Unigene
• Unique Gene (Unigene) Project to create gene-oriented clusters by
partitioning ESTs into non-redundant sets
– http://www.ncbi.nlm.nih.gov/unigene
– Ultimately there should be only 1 cluster per gene
– Usually more than 1 due to errors
– Types of errors
• 2 or more clusters may represent different parts of the same gene
• Sequence errors
• Cloning artifacts (DNA transcribed during creation of cDNA that
doesn’t correspond to authentic transcript)
EST’s
Unigene Cluster
Unigene
http://www.ncbi.nlm.nih.gov/unigene
GenBank Flatfile
•
•
•
•
A format for organizing genomic sequence data. Includes the following:
Sequence and annotations
Header
–
–
–
–
–
Locus name or accession number: unique to sequence description
Size: number of nucleotide bases or amino acid residues
Molecule: DNA, RNA, strandedness (ds, ss), and type of RNA or DNA
Genbank division code: 18 divisions (PRI = primate, PLN = plant, BAC = bacterial, etc.)
Date of last modification
•
•
Definition Line: brief description of sequence (e.g. source organism, protein/gene
name, function)
Accession: unique identifier for a record
Version
•
•
•
•
•
•
•
Keywords
Source: organism or clone description
Reference: publications that discuss data reported
Authors and Journal publication info
PubMed identifier: link to sequence record (abstract)
Features: vary (chromosomal info., coding info, protein id, % of each nucleotide)
Sequence Data
– May be more than one accession
– Record modification (accession.1; accession.2)
– GI: is specific to version; may be more than one
Jump to example
What is an accession number?
An accession number is label that is used to identify a
sequence. It is a (unique) string of letters and/or numbers
that corresponds to a molecular sequence.
Examples (all for retinol-binding protein, RBP4):
X02775
NT_030059
Rs7079946
GenBank genomic DNA sequence
Genomic contig (overlapping DNA fragments)
dbSNP (single nucleotide polymorphism)
DNA
N91759.1
NM_006744
An expressed sequence tag (1 of 170)
RefSeq DNA sequence (from a transcript)
RNA
NP_007635
AAC02945
Q28369
1KT7
RefSeq protein
GenBank protein
SwissProt protein
Protein Data Bank structure record
protein
NCBI’s important RefSeq project: best
representative sequences
RefSeq (accessible via the main page of NCBI)
provides an expertly curated accession number that
corresponds to the most stable, agreed-upon “reference”
version of a sequence.
RefSeq identifiers include the following formats:
Complete genome
Complete chromosome
Genomic contig
mRNA (DNA format)
Protein
NC_######
NC_######
NT_######
NM_###### e.g. NM_006744
NP_###### e.g. NP_006735
UniGene Name Search: Oncomodulin
All
results
listed
Allows
filtering
UniGene Name Search: Select Human Oncomodulin
• 4 Expressed Sequence Tags from 1 complementary DNA library
• Identifies chromosome and map position on chromosome
• Compares cluster transcripts with refseq proteins
UniGene Name Search: Select Human Oncomodulin
Click on link for
menu of other links:
Conserved domains
Gene summary
Protein sequence
Clicking on Protein sequence link then takes you to predicted protein
sequence file (NP_006179.2)
UniGene Name Search: Select Human Oncomodulin
1
2
3
4
Once here, you can:
1. Open FASTA file
2. Run BLAST
3. Identify and view conserved domains
4. See related proteins
Access to sequences: Gene at NCBI
Gene is a great starting point: it collects
key information on each gene/protein from
major databases. It covers all major organisms.
Example: RefSeq provides a curated, optimal
accession number for each DNA (NM_000518 for beta
globin DNA corresponding to mRNA) or protein
(NP_000509)
These references should be more reliable data
Gene Name Search: Oncomodulin
Returns list of gene entries for oncomodulin for different organisms
Click on a highlighted link to see details
Gene Name Search: Select Human Oncomodulin
Summary of all gene information, including mapping (when available). Note that this
sequence has been validated as a RefSeq.
Scrolling down, you can find link to protein data through UniProt.
Gene Name Search: Link to Oncomodulin Protein
Protein Name Search: Oncomodulin
Notice that I filtered this search so that results show only human oncomodulin
You can change the display (as shown)…
FASTA format:
versatile, compact with one header line
followed by a string of nucleotides or amino acids
in the single letter code
Comparison of Gene to other resources
Gene: collects key information on each
gene/protein from major databases. It covers all
major organisms.
UniGene: Database with information on where in
a body, when in development, and how
abundantly a transcript is expressed
HomoloGene: Gathers information on sets of
related proteins based on common genetic
ancestry.
Homologene Name Search: Oncomodulin
Provides list of
homologous
(related) genes
Homologene Name Search: Oncomodulin
Shows conserved
domains of protein
sequences. If you
click on graphic,
takes you to
summary of
domain/family
information.
ExPASy to access protein and DNA sequences
• ExPASy (Expert Protein Analysis System) sequence
retrieval system
• Visit http://www.expasy.ch/
• Similar to Entrez for NCBI
Example: Search
for calmodulin
Jump to Prosite
UniProt:
a centralized
protein
database
(uniprot.org)
This is
separate from
NCBI, and
interlinked.
UniProt: Calmodulin
• Search Results for bovine calmodulin (P62157)
Protein Secondary Structure: PDBSum (EMBL-EBI)
•http://www.ebi.ac.uk/pdbsum/
•Either enter PDB file or can load new/existing sequence
ExPASy: vast proteomics resources (www.expasy.ch)
Genome Browsers
Genomic DNA is organized in chromosomes. Genome
browsers display ideograms (pictures) of chromosomes,
with user-selected “annotation tracks” that display many
kinds of information.
The two most essential human genome browsers are at
Ensembl and UCSC. We will focus on UCSC (but the two
are equally important). The browser at NCBI is not
commonly used.
Ensembl genome browser (www.ensembl.org)
click
human
enter
beta globin
Ensembl output for
beta globin includes views of
chromosome 11 (top), the region
(middle), and a detailed view
(bottom).
There are various horizontal
annotation tracks.
The UCSC Genome Browser
• This browser’s focus is on humans and other eukaryotes
• you can select which tracks to display (and how much
information for each track)
• tracks are based on data generated by the UCSC team
and by the broad research community
• you can create “custom tracks” of your own data! Just
format a spreadsheet properly and upload it
• The Table Browser is equally important as the more visual
Genome Browser, and you can move between the two
[1] Visit http://genome.ucsc.edu/, click Genome Browser
[2] Choose organisms, enter query (beta globin), hit submit
[4] On the UCSC Genome Browser:
--choose which tracks to display
Protein Databases
• What do they contain?
– Amino acid sequences
• Primary sequence
– Direct submissions - protein sequencing
– SWISS-PROT, PIR
• Secondary sequence
– Translations - putative proteins resulting from modifying (i.e.
intron splicing) nucleic acid sequence
– GenPept, TrEMBL
• Structure
– Protein Data Bank
– Annotations
• Function, domains, etc.
SWISS-PROT
• Created by Amos Bairoch in 1986 at the Department of
Medical Biochemistry in Geneva
• Maintained by the Swiss Institute of Bio-informatics (SIB) and
funded by GeneBio
• Few redundancies
• Direct submission (from sequencing, not translation)
PIR
• PIR (The Protein Information Resource) was created by M.O.
Dayhoff in 1965
• Maintained by many
• In 2004, joined with other databases (Swiss-Prot and TrEMBL)
to become part of the UniProt consortium
Protein Data Bank
• Archive of 3-D structural data of
biological macromolecules
• Based on experimental data
• Managed by the Research
Collaboratory for Structural
Bioinformatics (RCSB)
– Rutgers & UCSD
• As of January 11, 2012 contained
78477 structures
• ~ 5000 membrane proteins
http://www.rcsb.org/pdb/statistics/contentGrowthChart.do?content=total&seqid=100
PDB: Source of protein sequence and structure data