bioninf1 - NDSU Computer Science

Download Report

Transcript bioninf1 - NDSU Computer Science

Introduction to Biology
Saleet Jafri
GMU
EUKARYA
BACTERIA
ARCHAEA
Cholesterol
Water seeking
head group
Presumed common progenitor of all extant organism
water
fatty chains
Presumed common progenitor of archaebacteria
and eukaryotes
Cell wall (outer membrane)
Cell wall (inner membrane)
Ribosome
DNA
RNA
septum mesosome
DNA Nucleoid
Periplasmic space and cell wall
0.5 m
|_____|
inner (plasma) membrane
Cell wall
Periplasmic space
Outer membrane
nucleoid
outer
Membrane inner membrane
Eucharyotic cell (organelles)
Nuclear membrane
Plasma (cell) membrane
Gogli vesicles
Mitochondrion
Peroxisome
lysosome
1 m
Nucleus
|____|
Secretory vesicle
Rough endoplasmic reticulum
Cell membrane
Nucleus
Cytoplasm
Endoplasmic Reticulum (ER) – rough and smooth – A membranous organelle system in the cytoplasm.
The outer surface may be ribosome-studded (rough) or not (smooth).
Gogli apparatus –receives newly formed proteins from ER; modifies; directs them to final destination.
Mitochondria – respiratory centers, have their own circular DNA, of bacterial origin.
Chromosomes – chromatin, histones, centromeres and arms (2 pair in Eukaryotes).
Lysosomes – contain acid hydrolases – nucleases, proteases, glycodidases, lipases, phosphatases,
sulfatases, phosopholipases.
Peroxisomes – use oxygen to remove hydrogen from substrates forming H2O2, abundant in kidney and
liver detoxification.
Cytoskeleton – an internal array of microtubules, microfilaments, and intermediate filaments that
confer shape and the ability to move on a Eukaryote cell.
Eukaryote
Membrane
Exterior
oligosaccharide
glycoprotien
glycolipid
leaflets
Phospholipid
bilayer
Hydrophobic core
Fatty acyi
tails
Integral protein
phospholipids
Peripheral proteins
Hydrophilic polar head
Interior
2 kinds of Nucleic Acids (RNA = ribonucleic acid and DNA = deoxyribonucleic acid)
Nucleic acid structure: purines:
adenine=A guanine=G
pyrimidines: uracil=U
thymine=T
cytosine=C
A always pairs with T and C always pairs with G (each pair is called a base pair in double helix DNA)
DNA may consist of millions of base pairs
A short sequence (<100) is called an oligonucleotide
RNA: different sugar (ribose instead of 2’-deoxyribose)
Uracil (U) instead of thymine (U binds with A)
RNA does not form a complex 3-D structure (like DNA and other protein)
Protein = functional and structural units of the cell
Central Dogma: DNA  RNA  protein (flow of information is unidirectional)
Gene or DNA transcription
RNA molecules synthesized by RNA polymerase.
RNA polymerase binds very tightly to promoter.
region on DNA.
Promoter region contains start site.
Transcription ends at termination signal site.
Primary transcript: direct coding of DNA  RNA.
RNA splicing: introns removed to make mRNA.
mRNA has codon sequence that codes for a protein.
Uracil replaces thymine
Splicing and alternative splicing happens.
Translation
Transfer RNA (tRNA) makes connection between
specific codons in mRNA and amino acids.
As tRNA binds to the next codon in mRNA, its
amino acid is bound to the last amino acid in the
protein chain.
When a STOP codon is encountered, the ribosome
releases the mRNA and synthesis ends.
.------Chromosomic DNA (gene)--------------------------.
||||||||||||| exon1 |||||||||| exon2 |||||||||||||| exon3 |||||||||||
Promoter
intron1
intron2
transcription
.
Nuclear RNA
.
| exon1 |||||||||| exon2 |||||||||||||| exon3|
intron1
intron2
translation
.
mRNA
.
| exon1 | exon2 | exon3|
intron3
tRNA links an amino acid to the codon on the mRNA
via the anti-codon.
rRNA = RNA found in ribosomes
Ribosomes = large and small subunit, made of
protein and rRNA
Initiator tRNA always carries methionine
Initiation factors=proteins catalyzing start of transcription
Endoplasmic reticulum
Post-transcriptional modification
Central Dogma
deoxyribonucleic acid
In eukaryotes, 1 mRNA = 1 protein. (in bacteria, 1 mRNA can
be polycistronic, or code for several proteins)
DNA
DNA in eukaryotes forms a stable, compacted complex with
histones (in bacteria, DNA is not in a permanently condensed state
Information stored in DNA is transferred
residue-by-residue to RNA which in turn
transfers the information residue-by
residue to protein.
Eukaryotic DNA contains large regions of repetitive DNA.
(in bacteria, DNA rarely contains any "extra" DNA)
Much of eukaryotic DNA does not code for proteins (~98% is
non-coding in humans; in bacteria, often less than 5% of genome)
Proposed by Francis Crick in 1958 to
describe the flow of information in a cell.
ribonucleic acid
RNA
Sometimes, eukaryotes can use controlled gene rearrangement for
increasing number of specific genes. (in bacteria, happens rarely)
Eukaryotic genes are split into exons and introns.
(in bacteria, genes are almost never split)
In eukaryotes, mRNA is synthesized in nucleus, then processed
and exported to cytoplasm. (in bacteria, transcription and
translation can take place simultaneously off same piece of DNA Protein
The Central Dogma was proposed by
Crick to help scientists think about
molecular biology.
It has undergone numerous revisions in
the past 45 years.
Concept of gene is historically defined on basic of genetic inheritance of phenotype (Mendellian Inheritance)
DNA of an organism encodes genetic info. It’s made up of double stranded helix composed of ribose sugars
Adenine(A), Citosine (C), Guanine (G) and Thymine (T).
[note that only 4 values nees be encode ACGT.. Which can be done using 2 bits.. But to allow redundant letter combinations
(like N means any 4 nucleotides), one usually resorts to a 4 bit alphabet.]
DNA
base: thymine
(pyrimidine)
monophosphate
DNA: terminology
sugar: 2’-deoxyribose

base
5’
4’
3’
(5’ to 3’)
sugar
1’
2’
3’ linkage
nucleoside
base:adenine
(purine)
5’ linkage
no 2’-hydroxyl
base
phosphate(s)
sugar
nucleotides (nucleoside mono-, di-, and triphosphates)
DNA: structure
DNA is double stranded
DNA strands are antiparallel
G-C pairs have 3 hydrogen bonds
A-T pairs have 2 hydrogen bonds
One strand is the complement of the
other
Major and minor grooves present
different surfaces
Cellular DNA is almost exclusively BDNA
B-DNA has ~10.5 bp/turn of the helix
RNA:
terminology
base
Base
Nucleoside (RNA)
Deoxynucleoside (DNA)
Adenine
Guanine
Cytosine
Uracil
Thymine
Adenosine
Guanosine
Cytidine
Uridine
(not usually found)
Deoxyadenosine
Deoxyguanosine
Deoxycytidine
(not usually found)
(Deoxy)thymidine
sugar
nucleoside
RNA can be single or double
stranded
G-C pairs have 3 hydrogen bonds
A-U pairs have 2 hydrogen bonds
Single-stranded, double-stranded,
and loop RNA present
different surfaces
carboxyl group
Protein
amino group
20 amino acids
Peptide bond
Protein structure
-helix
antiparallel -sheet
The Central Dogma
(gene)
ATGAGTAACGCG
TACTCATTGCGC
Replication
duplication of DNA using DNA as the template
ATGAGTAACGCG
TACTCATTGCGC
DNA
+
(nontemplate, antisense) ATGAGTAACGCG
(template, sense) TACTCATTGCGC
Transcription
synthesis of RNA using DNA as the template
(mRNA) AUGAGUAACGCG
codon
RNA
tRNA
ribosomes
Translation
(protein) MetSerAsnAla
synthesis of proteins using RNA as the template
Protein
The Central Dogma
Replication
Repair and recombination
DNA
Transcription
RNA processing
RNA
1.
DNA pol and 
2.
DNA pol and 
1.
2.
3.
RNA pol I-ribosomal RNA (rRNA)
RNA pol II-messenger RNA (mRNA)
RNA pol III-5S rRNA, snRNA, tRNA
1.
2.
3.
mRNA splicing
rRNA and tRNA processing
capping and polyadenylation
1.
2.
3.
phosphorylation
methylation
ubiquitination
Translation
Protein
Post-translational modification
Compartmentalization of processes
(transport is important)
replication
Splicing out introns?
Regulation occurs at each step of a process
1.
Initiation (starting)
-what is the signal that initiates the process?
-what are the factors involved in initiation (cis-and trans-acting)?
2.
Elongation (continuation)
-how is the process maintained with high fidelity once initiated?
-what are the factors involved in elongation (cis- and trans-acting)?
3.
Termination (ending)
-what is the signal that stops the process?
-what are the factors involved in termination (cis- and trans-acting)?
Other general regulatory considerations
1.
2.
3.
4.
How is the rate of a process regulated?
How are the steps regulated in a cell, tissue, or gene-specific manner?
Stability of biomolecules
Cellular localization of biomolecules
Exceptions to the Central Dogma
Nobel Prizes
DNA
retroviruses use reverse transcriptase
to replicate their genome
(David Baltimore and Howard Temin)
mRNA introns (splicing)
(Philip Sharp and Richard Roberts)
RNA editing (deamination of cytosine
RNA
to yield uracil in mRNA)
RNA interference (RNAi) a mechanism
of post-transcriptional gene silencing
utilizing double-stranded RNA
RNAs (ribozymes) can catalyze an
enzymatic reaction
(Thomas Cech and Sidney Altman)
Epigenetic marks, such as patterns of
DNA methylation, can be inherited and
provide information other than the DNA
sequence
Protein
RNA viruses
Prions are heritable proteins responsible
for neurological infectious diseases
(e.g. scrapie and mad cow)
(Stanley Pruisner)
The Flow of Biotechnology
Information
Gene
> DNA sequence
AATTCATGAAAATCGTATACTGGTCTGGTACCGGCAACAC
TGAGAAAATGGCAGAGCTCATCGCTAAAGGTATCATCGAA
TCTGGTAAAGACGTCAACACCATCAACGTGTCTGACGTTA
ACATCGATGAACTGCTGAACGAAGATATCCTGATCCTGGG
TTGCTCTGCCATGGGCGATGAAGTTCTCGAGGAAAGCGAA
TTTGAACCGTTCATCGAAGAGATCTCTACCAAAATCTCTG
GTAAGAAGGTTGCGCTGTTCGGTTCTTACGGTTGGGGCGA
CGGTAAGTGGATGCGTGACTTCGAAGAACGTATGAACGGC
TACGGTTGCGTTGTTGTTGAGACCCCGCTGATCGTTCAGA
ACGAGCCGGACGAAGCTGAGCAGGACTGCATCGAATTTGG
TAAGAAGATCGCGAACATCTAGTAGA
Function
> Protein sequence
MKIVYWSGTGNTEKMAELIAKGIIESGKDVNTINVSDVNI
DELLNEDILILGCSAMGDEVLEESEFEPFIEEISTKISGK
KVALFGSYGWGDGKWMRDFEERMNGYGCVVVETPLIVQNE
PDEAEQDCIEFGKKIANI
Upstream (5’)
promoter
TAC
Gene region
Downstream (3’)
DNA
Transcription (gene is encoded on minus strand
.. And the reverse complement is read into mRNA)
ATG
5´ UTR
Prokaryotes
mRNA
3´ UTR
CoDing Sequence (CDS)
ATG
Translation: tRNA reads off each codon (3 bases at a time)
starting at start codon until it reaches a STOP codon.
(intronless protein
coding genes)
protein
Why does nature bother with mRNA? Why would the cell want to have an intermediate between
DNA and the proteins it encodes?
•Gene information can be amplified by having many copies of an RNA made from one copy of DNA.
•Regulation of gene expression can be effected by having specific controls at each element of the
pathway between DNA and proteins. The more elements there are in the pathway, the more
opportunities there are to control it in different circumstances.
•In Eukaryotes, DNA can then stay pristine and protected, away from caustic chemistry of cytoplasm.
prokaryote (operon structure): Sometimes genes that are part of same operational pathway are grouped together under
single promoter - then produce a pre-mRNA which eventually produces 3 separates mRNAs
upstream promoter
downstream
Gene 1
Gene 2
Gene 3
Genetic Code: How does an
Bacterial Gene Structure of signals
mRNA specify an amino acid seq?
It would be impossible for each
amino acid to be specified by one
nucleotide, because there are only 4
nucleotides and 20 amino acids.
2 nucs could specify 16; 3 ~ 64.
Each amino acid is specified by up
to 6 different combos of 3
Bacterial genomes have simple gene structure.
nucleotides, called codons, each
coding for one amino acid.
- Transcription factor binding site.
1st codon is START, and usually
coincides with Methionine. (M
- Promoters
which has codon code ‘ATG’)
Last codon is STOP, and does NOT
-35 sequence (T82T84G78A65C54A45) 15-20 bases
code for an amino acid. It is
-10 sequence (T80A95T45A60A50T96) 5-9 bases
sometimes represented by ‘*’
CoDing region (CDS) starts at
-Start of transcription : initiation start: Purine90
START codon and ends at STOP.
Different organisms have different (sometimes it’s the “A” in CAT)
frequencies of codon usage.
- translation binding site (shine-dalgarno
A handful of species vary from this
codon association and use different 10 bp upstream of AUG (AGGAGG)
codons for different amino acids.
- One or more Open Reading Frame
How do tRNAs recognize to which
- start-codon (unless sequence is partial)
codon to should an amino acid?
tRNA has anticodon on its mRNA
- until next in-frame stop codon on that strand ..
binding end, complementary to the
codon on the mRNA. Each tRNA
Separated by intercistronic sequences.
only binds appropriate amino acid
- Termination
for its anticodon.
RNA
RNA has the same primary structure as DNA. It
consists of a sugar-phosphate backbone,
with nucleotides attached to the 1' carbon of
the sugar. DNA/RNA differnces are:
RNA has a hydroxyl group on the 2' carbon of the
sugar (thus, the difference between
deoxyribonucleic acid and ribonucleic acid.
Instead of using the nucleotide thymine, RNA
uses another nucleotide called uracil:
Because of the extra hydroxyl group on the sugar,
RNA is too bulky to form a stable double
helix. RNA exists as a single-stranded
molecule. However, regions of double helix
can form where there is some base pair
complementation (U and A , G and C),
resulting in hairpin loops. The RNA
molecule with its hairpin loops is said to
have a secondary structure.
Because the RNA molecule is not restricted to a
rigid double helix, it can form many
different stable three-dimensional tertiary
structures.
tRNA ( transfer RNA) is a small RNA that has a
very specific secondary and tertiary structure such that
it can bind an amino acid at one end, and mRNA at the
other end. It acts as an adaptor to carry the amino acid
elements of a protein to the appropriate place as coded
for by the mRNA. T
3-D Tertiary
structure
tRNA Secondary
structure
Bacterial Gene Prediction
Most of the consensus sequences are known from ecoli studies. So for each bacteria the exact distribution of
consensus will change.
Most modern gene prediction programs need to be “trained”. E.g. they find their own consensus and
assembly rules given a few examples genes.
A few programs find their own rules from a completely unannotated bacterial genome by trying to find
conserved patterns. This is feasible because ORF’s restrict the search space of possible gene candidates.
E.g. selfid program([email protected])
OPEN READING FRAME: On a given piece of DNA, there can
be 6 possible frames. The ORF can be either on + or minus strand
and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
Some progs have other conventions for naming frames (0..5, 1-6..)
Gene finding in eukaryotic cDNA uses ORF finding +blastx as well.
http://www.ncbi.nlm.nih.gov/gorf/gorf.html
try with gi=41 ( or your own piece of DNA)
Eukaryotic Central Dogma
In Eukaryotes ( cells where the DNA is sequestered in a separate nucleus)
The DNA does not contain a duplicate of the coding gene, rather exons must be spliced. (
many eukaryotes genes contain no introns! .. Particularly true in ´lower´ organisms)
mRNA – (messenger RNA) Contains the assembled copy of the gene. The mRNA acts as a
messenger to carry the information stored in the DNA in the nucleus to the cytoplasm
where the ribosomes can make it into protein.
Eukaryotic Nuclear Gene Structure
Gene prediction for Pol II transcribed genes.
• Upstream Enhancer elements.
• Upstream Promoter elements.
• GC box(-90nt) (20bp), CAAT box(-75 nt)(22bp)
• TATA promoter (-30 nt) (70%, 15 nt
consensus (Bucher et al (1990))
• 14-20 nt spacer DNA
• CAP site (8 bp)
• Transcription Initiation.
• Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon.
• polyA signal (AATAAA 99%,other)
introns
•Transcript region, interrupted by introns. Each
introns
•starts with a donor site consensus
(G100T100A62A68G84T63..)
•Has a branch site near 3’ end of intron
(one not very conserved consensus
UACUAAC)
•ends with an acceptor site consensus.
(12Py..NC65A100G100)
UACUAAC
AG
Exons
•The exons of the transcript region are
composed of:
•5’UTR (mean length of 769 bp) with a
specific base composition, that
depends on local G+C content of
genome)
•AUG (or other start codon)
•Remainder of coding region
•Stop Codon
•3’ UTR (mean length of 457, with a
specific base composition that
depends on local G+C content of
genome)
Structure of the Eukaryotic Genome
~6-12% of human DNA encodes
proteins(higher fraction in
nematode)
~10% of human DNA codes for
UTR
~90% of human DNA is noncoding.
Non-Coding Eukaryotic DNA
Untranslated regions (UTR’s)
•introns (can be genes within
introns of another gene!)
•intergenic regions.
- repetitive elements
- pseudogenes (dead
genes that may(or not)
have been retroposed back in the
genome as a single-exon “gene”
Pseudogenes
Pseudogenes:
Dna sequence that might code for a
gene, but that is unable to result in a protein.
This deficiency might be in transcription (lack of
promoter, for example) or in translation or both.
Processed pseudogenes:
Gene retroposed back in the genome
after being processed by the splicing apperatus.
Thus it is fully spliced and has polyA tail.
Insertion process flanks mRNA sequence with
short direct repeats.
Thus no promoters.. Unless is accidentally
retroposed downstream of the promoter
sequence.
Do not confuse with single-exon genes.
Repeats
Each repeat family has many subfamilies.
- ALU: ~ 300nt long; 600,000 elements in human
genome. can cause false homology with mRNA.
Many have an Alu1 restriction site.
- Retroposons. ( can get copied back into
genome)
- Telltale sign: Direct or inverted repeat flank
the repeated element. That repeat was the
priming site for the RNA that was inserted.
LINEs (Long INtersped Elements)
L1 1-7kb long, 50000 copies
Have two ORFs!!!!! Will cause problems
for gene prediction programs.
SINEs (Short Intersped Elements)
Low-Complexity Elements
• When analyzing sequences, one often rely on the
fact that two stretches are similar to infer that they
are homologous (and therefore related).. But
sequences with repeated patterns will match
without there being any philogenetic relation!
• Sequences like ATATATACTTATATA which are
mostly two letters are called low-complexity.
• Triplet repeats (particularly CAG) have a tendency
to make the replication machinery stutter.. So they
are amplified.
• The low-complexity sequence can also be hidden
at the translated protein level.
Masking
•To avoid finding spurious matches in alignment programs, you
should always mask out the query sequence.
•Before predicting genes it is a good idea to mask out repeats (at
least those containing ORFs).
•Before running blastn against a genomic record, you must mask
out the repeats.
•Most used Programs:
CENSOR:
Repeat Masker:
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
More Non-Protein genes
rRNA - ribosomal RNA
is one of the structural components of the ribosome. It has sequence
complementarity to regions of the mRNA so that the ribosome knows where to
bind to an mRNA it needs to make protein from.
snRNA - small nuclear RNA
is involved in the machinery that processes RNA's as they travel between the
nucleus and the cytoplasm.
hnRNA – hetero-nuclear RNA.
small RNA involved in transcription.
Protein Processing & localization.
The protein as read off from the mRNA may not be in the final
form that will be used in the cell. Some proteins contains
• Signal Peptide (located at N-terminus (beginning)), this signal
peptide is used to guide the protein out of the nucleus towards it´s
final cellular localization. This signal peptide is cleaved-out at
the cleavage site once the protein has reach (or is near) it´s final
destination.
•Various Post-Translational modifications (phosphorylation)
The final protein is called the “mature peptide”
Convention for nucleotides in database
Because the mRNA is actually read off the minus strand
of the DNA, the nucleotide sequence are always quoted
on the minus strand.
In bioinformatics the sequence format does NOT make a
difference between Uracil and Thymine. There is no
symbol for Uracil.. It is always represented by a ´T´
Even genomic sequence follows that convention. A gene
on the ´plus´ strand is quoted so that it is in the same
strand as it´s product mRNA.
Change DNA Sequence
Protein
Change RNA Sequence
Change Amino Acid Sequence
Engineering
mRNA Reading Direction Corresponds to
Protein Chemical Directionality
3’
5’
NH2-terminus
mRNA
COOH-terminus
Backbone Torsion Angles
Determine Secondary Structure
Protein Tertiary Structure Tied to Function
Biomolecular
Energetics
COO+H N
3
CH3 H3C
OH N
Electrostatic Interactions
Hydrophobic/van der Waals Interactions
Hydrogen Bonding Interactions
Biology Information on the
Internet
Biology Information on the Internet
• Introduction to Databases
• Searching the Internet for Biology
Information.
– General Search methods
– Biology Web sites
• Introduction to Genbank file format.
• Introduction to Entrez and Pubmed
• Ref: Chapters 1,2,5,6 of “Bioinformatics”
• Databases:
– A collection of Records.
– Each record has many fields.
Spread-sheet – Each field contain specific information.
– Each field has a data type.
Flat-file
» E.g. money, currency,Text Field, Integer,
version of a
date,address(text field) ,citation (text field)
database.
– Each record has a primary key. A UNIQUE
identifier that unambiguously defines this
record.
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 06/01/00 PRI
9606 homo sapiens
6226762 NM_000014
2 10/12/99 PRI
9606 homo sapiens
4557224 NM_000014
1 02/04/99 PRI
9606 homo sapiens
41 X63129
1 06/06/96 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 01/06/2000 PRI
9606 homo sapiens
6226762 NM_000014
2 12/10/1999 PRI
9606 homo sapiens
4557224 NM_000014
1 04/02/1999 PRI
9606 homo sapiens
41 X63129
1 06/06/1996 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
Gi = Genbank Identifier: Unique Key : Primary Key
GI Changes with each update of the sequence
record.
Accession Number: Secondary key: Points to same locus and sequence
despite sequence updates.
Accession + Version Number equivalent to Gi
gi
Accession version date
Genbank Division taxid organims
6226959 NM_000014
3 01/06/2000 PRI
9606 homo sapiens
6226762 NM_000014
2 12/10/1999 PRI
9606 homo sapiens
4557224 NM_000014
1 04/02/1999 PRI
9606 homo sapiens
41 X63129
1 06/06/1996 MAM
9913 bos taurus
Number of Chromosomes
22 diploid + X+Y
22 diploid + X+Y
22 diploid + X+Y
29+X+Y
Relational Database (Normalizing a database for repeated subelements of a database.. Splitting it into smaller databases, relating
the sub-databases to the first one using the primary key.)
gi
6226959
6226762
4557224
41
Accession
NM_000014
NM_000014
NM_000014
X63129
version
3
2
1
1
date
01/06/2000
12/10/1999
04/02/1999
06/06/1996
Genbank Division taxid
PRI
9606
PRI
9606
PRI
9606
MAM
9913
taxid
organims
Number of Chromosomes
9606 homo sapiens 22 diploid + X+Y
9913 bos taurus
29+X+Y
Types of Relational databases.
• The Internet can be though of as one
enormous relational database.
– The “links”/URL are the primary keys.
• SQL (Standard Query Language)
– Sybase; Oracle ; Access; (Databases systems)
• Sybase used at NCBI.
– SRS(One type of database querying system of
use in Biology)
Indexed searches.
• To allow easy searching of a database, make
an index.
• An index is a list of primary keys
corresponding to a key in a given field (or to
a collection of fields)
Genbank division
PRI
6226959;6226762;4557224;…
MAM
41;…
Accession
NM_000014
6226959;6226762;4557224;
X63129 41;
Indexed searches.
• Boolean Query: Merging and Intersecting lists:
– AND (in both lists) (e.g. human AND genome)
– +human +genome
– human && genome
– OR (in either lists) (e.g. human OR genome)
– human || genome
Search strategies
• Search engines use complex strategies that go
beyond Boolean queries.
– Phrases matching:
• human genome -> “human genome”
– togetherness: documents with human close to genome
are scored higher.
– Term expansion & synomyms:
• human -> homo sapiens
– neigbours:
– human genome-> genome projects, chromosomes,genetics
– Frequency of links (www.google.com)
• To avoid these term mapping, enclose your queries in quotes:
“human” AND “genome”
Search strategies
• Search engines use complex strategies that
go beyond Boolean queries.
• To avoid these term mapping, enclose your queries in
quotes: “human” AND “genome”
• To require that ALL the terms in your query be important,
precede them with a “+” . This also prevents term
mapping.
• To force the order of the words to be important, group
sentences within strings. “biology of mammals”.
Indexed searches.
Example
• find the advanced query page at
http://www.altavista.com
• type human (and hit the Search button)
• Type genome:
• type human AND genome
• type “human genome” (finds the least matches)
• type human OR genome (finds the most matches)
• Search Engines:
– Web Spiders: Collection of All web pages, but
since Web pages change all the time and new
ones appear, they must constantly roam the web
and re-index.. Or depend on people submitting
their own pages.
•
•
•
•
•
•
•
www.google.com (BEST!)
www.infoseek.com
www.lycos.com
www.exite.com
www.webcrawler.com
www.lycos.com
www.looksmart.com (country specific)
• Search Engines:
• www.google.com (BEST!)
• Google ranks pages according to how many pages with those
terms refer to the pages you are asking for. Not only must one
document contain ALL the search terms, but other documents
which refer to this one must also contain all the terms.
• Great when you know what you are looking for! You can also
use “” to require immediate proximity and order of terms.
• E.g. type
» Web server for the blast program.
But google only indexes about 40% of the web.. So you may
have to use other web spiders.
(disclaimer.. I don’t own stock in that company.. But I’d like to)
• Search Engines:
– Curated Collections: Not comprehensive:
Contains list of best sites for commonly
requested topics, but is missing important sites
for more specialized topics (like biology)
• www.yahoo.com (Has travel maps too!)
– Answer-based curated collections: Easy to
use english-like queries. First looks at list of
predefined answers, then refines answers based
on user interaction. Also answer new questions.
•
•
•
•
www.askjeeves.com
www.magellan.com
www.altavista.com(has translation TOOLS)
www.hotbot.com
• Search Engines:
– Meta-Search Engines: Polls several search
engines, and returns the consensus of all results.
Is likely to miss sites, but the sites it returns are
very relevant to the query.
– Other operating mode is to return the sum of all
the results.. Then becomes very sensitive to a
very detailled query.
•
•
•
•
•
www.metacrawler.com
www.savvysearch.com
www.1blink.com (fast)
www.metafind.com
www.dogpile.com
• Virtual Libraries: Curated collections of
links for Biologists.(by Biologists)
– Pedro’s BioMolecular Research Tools:(1996)
• http://www.public.iastate.edu/~pedro/
– Virtual Library: Bio Sciences
• http://vlib.org/Biosciences.html
– Publications and abstract search.
• http://www.ncbi.nlm.nih.gov/
– Expasy server
• http://www.expasy.ch
– EBI Biocatalog (software & databases list)
• http://www.ebi.ac.uk/biocat/
Biological Databases
• Nucleotide databases:
– Genbank: International Collaboration
• NCBI(USA), EMBL(Europe), DDBJ (Japan and Asia)
• A “bank” No curation.. Submission to these database is
required for publication in a journal.
– Organism specific databases (Exercize: Find URLs
using search engines)
•
•
•
•
•
•
FlyBase
ChickGBASE
pigbase
wormpep
YPD (Yeast Protein Database)
SGD(Saccharomyces Genome Database)
• Protein Databases:
– NCBI:
– Swiss Prot:(Free for academic use, otherwise
commercial. Licensing restrictions on discoveries made
using the DB. 1998 version free of any licensing)
• http://www.expasy.ch(latest pay version)
• NCBI has the latest free version.
• Translated Proteins from Genbank Submissions
– EMBL
• TrEMBL is a computer-annotated supplement of SWISS-PROT
that contains all the translations of EMBL nucleotide sequence
entries not yet integrated in SWISS-PROT
– PIR
• Structure databases:
– PDB: Protein structure database.
• Http://www.rscb.org/pdb/
– MMDB: NCBI’s version of PDB with entrez
links.
• Http://www.ncbi.nlm.nih.gov
• Genome Mapping Information:
– http://www.il-st-acad-sci.org/health/genebase.html
– NCBI(Human)
– Genome Centers:
• Stanford, Washington University, Stanford
– Research Centers and Universities
• Litterature databases:
– NCBI: Pubmed: All biomedical litterature.
• Www.ncbi.nlm.nih.gov
• Abstracts and links to publisher sites for
– full text retrieval/ordering
– journal browsing.
– Publisher web sites.
– Biomednet: Commercial site for litterature
search.
• Pathways Database:
– KEGG: Kyoto Encyclopedia of Genes and
Genomes: www.genome.ad.jp/kegg/kegg/html
• Database Identifiers: Primary keys
– GI (changes with each sequence update for
NCBI only)
• Annotation may change without the gi changing!
–
–
–
–
Accession(stable)
version(changes with each sequence update)
“Version” also refers to Accession.version
Secondary accession: Records may have been
merged in the past.. So the records which were
not chosen as the primary were made
secondary.
Primary Databases
• A primary Database is a repository of data
derived from experiments or from research
knowledge.
–
–
–
–
–
–
Genbank (Nucleotide repository)
Protein DB, Swissprot
PDB (MMDB) are primary databases.
Pubmed (litterature)
Genome Mapping databases.
Kegg Database.(pathways)
Secondary Databases
• A secondary database contains information
derived from other sources.
– Refseq (Currated collection of Genbank at
NCBI)
– Unigene (Clustering of ESTs at NCBI)
• Organism-specific databases are often a mix
between primary and secondary.
Genbank Records
• A Bank: No attempt at reconciliation.
• Submit a sequence  Get an Accession Number!
– Cannot modify sequences without submitter’s consent.
– No attempt at reconciliation.(not a unique collection per
LOCUS/gene)
– Entries of various sequence quality and different
sources==> Separate in various divisions based on
• High Quality sequences in taxon specific divisions.
• Low Quality sequences in Usage specific databases.
• A Collaboration between NCBI, EMBL and
DDBJ. They contain (nearly) the same
information, only the data format differs.
EMBL does not differentiate between the different types of RNA
records, while NCBI (and DDBJ) do. In Entrez EMBL records are
patched up to add that information.
Refseq and LocusLink
• Attempt to produce 1 mRNA, 1 protein, and
1 genomic gene for each frequently
occuring allele of a protein expressing gene.
• www.ncbi.nlm.nih.gov/LocusLink
• Special non-genbank Accession numbers
–
–
–
–
–
NM_nnnnnn mRNA refseq
NP_nnnnnn protein refseq
NC_nnnnnn refseq genomic contig
NT_nnnnnn temporary genomic contig
NX_nnnnnn predicted gene
Genbank divisions
Sequences in genbank are split into various categories based
on
1) The quality and type of sequences
2) The high quality nucleotide sequences are divided into
organism-dependant divisions.
• Genbank Entry type: (and query to restrict to that
field)
– mRNA (1/10000 errors)
• biomol_mRNA [PROP]
– cDNA (EST, 95-99% accuracy, single pass )
• gbdiv_EST [PROP]
– genomic ( biomol_genomic [PROP])
• in HTGS division: >99% accuracy;
– gbdiv_HTG [PROP]
• GSS(low-quality genome survey sequences)
– gbdiv_GSS [PROP]
• rest of Genbank; 1/10000 accuracy.
– Human gbdiv_PRI [PROP]
– mouse gbdiv_ROD [PROP]
– bovine gbdiv_MAM [PROP]
– STS(EST or cDNA used in mapping)
• gbdiv_STS [PROP]
FASTA Format
MOST important
data format!!!
>identifier descriptive text
nucleotide of amino-acid
sequence on multiple lines if needed.
Example:
>gi|41|emb|X63129.1|BTA1AT B.taurus mRNA for alpha-1-anti-trypsin
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
Modified FASTA Format
1) A few tools follow the convention that
lower case sequences are masked. (repeat
masker, some versions of blast, megablast,
blastz)
2) A few analysis tools (like CLUSTAL)
want a simplified identifier on the defline..
So they can have a short string for the
alignment.
>X63129.1
GACCAGCCCTGACCTAGGACAGTGAATCGATAATGGCACTCTC
CATCACGCGGGGCCTTCTGCTGCTGGC ….
• WIM now will talk about GCG …
Feature table
(NCBI;EMBL/DDBJ)
• http://www.ncbi.nlm.nih.gov/collab/FT/inde
x.html
Genbank Data format
41
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
MAM
30-APR-1992
DEFINITION B.taurus mRNA for alpha-1-antitrypsin.
ACCESSION X63129
NID
g41
VERSION X63129.1 GI:41
KEYWORDS alpha-1 antitrypsin; serine protease inhibitor; serpin.
SOURCE
Bos taurus.
ORGANISM Bos taurus
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Artiodactyla; Ruminantia; Pecora; Bovoidea; Bovidae; Bovinae; Bos.
Genbank References
•
•
•
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
MAM
30-APR-1992
...
REFERENCE 1 (bases 1 to 1380)
AUTHORS Sinha,D.
TITLE Direct Submission
JOURNAL Submitted (22-OCT-1991) D. Sinha, Dept of Biochemistry,
Temple University, 3400 North Broad Street, Philadelphia, PA 19140, USA
REFERENCE 2 (bases 1 to 1380)
AUTHORS Sinha,D., Bakhshi,M.R. and Kirby,E.P.
TITLE Complete cDNA sequence of bovine alpha 1-antitrypsin
JOURNAL Biochim. Biophys. Acta 1130 (2), 209-212 (1992)
MEDLINE 92223096
FEATURES
Location/Qualifiers
Genbank Source Qualifier
•
•
•
•
•
•
•
•
•
•
•
•
•
LOCUS
BTA1AT
1380 bp mRNA
...
FEATURES
Location/Qualifiers
source
1..1380
/organism="Bos taurus"
/db_xref="taxon:9913"
/tissue_type="liver"
/cell_type="hepatocyte"
/clone_lib="lambda gt11"
/clone="2f-Ic"
mRNA
<1..>1380
sig_peptide 33..104
...
MAM
30-APR-1992
Genbank mRNA+CDS features
•
•
•
•
•
•
•
•
•
•
•
•
•
•
mRNA
<1..>1380
sig_peptide 33..104
CDS
33..1283
/codon_start=1
/product="alpha-1-antitrypsin"
/protein_id="CAA44840.1"
/db_xref="PID:g42"
/db_xref="GI:42"
/db_xref="SWISS-PROT:P34955"
/translation="MALSITRGLLLLAALCCLAPISLAGVLQGHAVQETDDTSHQEAAC
HKIAPNLANFAFSIYHHLAHQSNTSNIFFSPVSIASAFAMLSLGAKGNTHTEILKG
LGFNLTELAEAEIHKGFQHLLHTLNQPNHQLQLTTGNGLFINESAKLVDTFLEDV
KNLYHSEAFSINFRDAEEAKKKINDYVEKGSHGKIVELVKVLDPNTVFALVNYIS
FKGKWEKPFEMKHTTERDFHVDEQTTVKVPMMNRLGMFDLHYCDKLASWVL
LLDYVGNVTACFILPDLGKLQQLEDKLNNELLAKFLEKKYASSANLHLPKLSISE
TYDLKSVLGDVGITEVFSDRADLSGITKEQPLKVSKALHKAALTIDEKGTEAVG
STFLEAIPMSLPPDVEFNRPFLCILYDRNTKSPLFVGKVVNPTQA"
mat_peptide 105..1280
/product="alpha-1-antitrypsin"
polyA_signal 1343..1348
polyA_site
1368
•
•
•
•
•
•
•
•
•
•
•
•
•
Genbank Sequence format
...
BASE COUNT
357 a
413 c
322 g
288 t
ORIGIN
1 gaccagccct gacctaggac agtgaatcga taatggcact
61 tgctgctggc agccctgtgc tgcctggccc ccatctccct
121 acgctgtcca agagacagat gatacatccc accaggaagc
181 ccaacctggc caactttgcc ttcagcatat accaccattt
241 gcaacatctt cttctccccc gtgagcatcg cttcagcctt
301 ccaagggcaa cactcacact gagatcctga agggcctggg
361 cagaggctga gatccacaaa ggctttcagc atcttctcca
...
1321 gtccccccac tccctccatg gcattaaagg atgactgacc
//
ctccatcacg
ggctggagtt
agcgtgccac
ggctcatcag
tgcgatgctc
tttcaacctc
caccctgaac
cggggccttc
ctccaaggac
aagattgccc
tccaacacca
tccctgggag
actgagctcg
cagccaaacc
tagccccgaa aaaaaaaaaa
EMBL DATA FORMAT
• Embl: http://www.ebi.ac.uk/Databases/
• http://www.ebi.ac.uk/cgi-bin/emblfetch
• Use Accession X63129
DDBJ DATA FORMAT
• DDBJ: http://www.ddbj.nig.ac.jp/
• http://ftp2.ddbj.nig.ac.jp:8000/getstarte.html
• Use Accession X63129
• Flat file format same as NCBI/Genbank
format.
Entrez
• Index Based search system. Each field in
the database is searchable individually or as
agregate.
– (e.g. CDS [FKEY])
– default is agregate [ALL FIELDS] *
• All primary databases are interlinked as one
big relational database.
– (e.g. Pubmed links in Genbank records)
• Phrase matching.
– Human genome -> “human genome”
Entrez
• Available neighbours (related documents or
related sequences)
• In Pubmed searches: Term mapping to
neighbouring documents and neighbouring terms.
• Term mapping to chemical names.
– In pubmed: term [All Fields] is term mapped to
chemical names + MeSH terms + Text Fields.
– .. Unless “term” is whithin double quotes.
Entrez
• http://www.ncbi.nlm.nih.gov/Entrez/
• Tutorials:
• http://www.ncbi.nlm.nih.gov/Class/MLACo
urse/Genetics/index.html
• http://www.ncbi.nlm.nih.gov/Literature/pubmed_s
earch.html
• http://www.ncbi.nlm.nih.gov/Database.tut1.html
SWISSPROT
http://www.expasy.ch/sprot/sprot_details.html
1. Core data: protein sequence data; the citation information and the
taxonomic data
2. Annotation
• Function(s) of the protein
• Domains and sites. For example calcium binding regions, ATPbinding sites, zinc fingers, homeobox, kringle, etc.
• Post-translational modification(s). For example carbohydrates,
phosphorylation, acetylation, GPI-anchor, etc.
• Secondary structure
• Quaternary structure. For example homodimer, heterotrimer, etc
• Similarities to other proteins
• Disease(s) associated with deficiencie(s) in the protein
• Sequence conflicts, variants, etc.
SWISSPROT
http://www.expasy.ch/cgi-bin/get-random-entry.pl?S
REBASE (Restriction enzymes dataBASE)
Restriction enzymes have a pattern recognition sequence, and then
within or a few bases away from that pattern is the actual
cutting site
http://rebase.neb.com/rebase/rebase.html
I prefer the bairoch format (SWISSPROT format)
http://rebase.neb.com/rebase/rebase.f19.html
ID enzyme name
ET enzyme type
OS microorganism name
PT prototype
RS recognition sequence, cut site
MS methylation site (type)
CR commercial sources for the restriction enzyme
CM commercial sources for the methylase
RN [count]
RA authors
RL jour, vol, pages, year, etc.