As it is applied to the Bacillus
What we are going to talk about
• Why we are doing all this DNA sequencing
• What genes look like and where they are
• How we can compare sequences between
• How genes move between species
Bioinformatics is based on the fact that DNA
sequencing is cheap, and becoming easier and
cheaper very quickly.
– the Human Genome Project cost roughly $3
billion and took 12 years (1991-2003).
– Sequencing James Watson’s genome in 2007
cost $2 million and took 2 months
– Today, you could get your genome sequenced
for about $100,000 and it would take a month.
– The Archon X prize: you win $10 million if
you can sequence 100 human genomes in 10
days, at a cost of $10,000 per genome.
– It is realistic to envision $100 per genome
within 10 years: everyone’s genome could be
sequenced if they wanted or needed it.
Why it’s useful
• All of the information needed to build an
organism is contained in its DNA. If we could
understand it, we would know how life works.
– Preventing and curing diseases like cancer (which is
caused by mutations in DNA) and inherited diseases.
– Curing infectious diseases (everything from AIDS and
malaria to the common cold). If we understand how a
microorganism works, we can figure out how to block
– Understanding genetic and evolutionary relationships
– Understanding genetic relationships between humans.
Projects exist to understand human genetic diversity.
Also, sequencing the Neanderthal genome.
• Ancient DNA: currently it is thought that under ideal
conditions (continuously kept frozen), there is a limit of
about 1 million years for DNA survival. So, Jurassic Park
will probably remain fiction.
From DNA to Gene
But: extracting that information is difficult. How to convert a string of
ACGT’s into knowledge of how the organism works is hard.
Most of the work is on the computer, with key confirming experiments
done in the “wet lab”.
The sequence below contains a gene critical for life: the gene that initiates
replication of the DNA. Can you spot it?
We are now going to spend some time on what genes look like and how
we can find them.
DNA is just a long string of 4 letters
(nucleotides, or bases): Adenine,
Guanine, Cytosine, and Thymine.
– Which we will just refer to as A, C,
G, and T
– and we are skipping lots of details
Each DNA molecule has 2 strands,
with the bases paired in the center
– A on one strand always pairs with T
on the other strand
– G pairs with C.
– the strands run in opposite directions
Since the two DNA strands are
complementary, there is no need to
write down both strands
Chromosomes and Genes
each chromosome is a long piece of DNA
– B. megaterium genome is a circle (like most bacteria) of
about 5 million bases.
– Human chromosomes are 100-200 million bases long.
We have 46 chromosomes (2 sets of 23, one set from
genes are just regions on that DNA. It is not obvious
where genes are if you look at a DNA sequence.
– there is a lot of DNA that is not part of genes: in humans
only 2% at most of the DNA is part of any gene.
– Bacteria use more of their DNA: 80% of the B. meg
chromosome is genes.
B. meg has about 1 gene per 1000 base pairs (bp) of
DNA. About 5000 genes
Humans have about 25,000 genes.
– We are far more complicated than bacteria: regulation of
the genes is very complicated in humans
– We use the same gene in different ways in different
Genes and Proteins
• Most genes code for proteins: each gene contains
the information necessary to make one protein.
• Proteins are the most important type of
– Structure: collagen in skin, keratin in hair, crystallin
– Enzymes: all metabolic transformations, building
up, rearranging, and breaking down of organic
compounds, are done by enzymes, which are
– Transport: oxygen in the blood is carried by
hemoglobin, everything that goes in or out of a cell
(except water and a few gasses) is carried by
– Also: nutrition (egg yolk), hormones, defense,
The Genetic Code
Proteins are long chains of amino acids.
There are 20 different amino acids coded in DNA
There are only 4 DNA bases, so you need 3 DNA
bases to code for the 20 amino acids
4 x 4 x 4 = 64 possible 3 base combinations (codons)
Each codon codes for one amino acid
Most amino acids have more than one possible codon
Genes start at a start codon and end at a stop codon.
3 codons are stop codons: all genes end at a stop
Start codons are a bit trickier, since they are used in
the middle of genes as well as at the beginning
in eukaryotes, ATG is always the start codon, making
Methionine (Met) the first amino acid in all proteins
(but in many proteins it is immediately removed).
In prokaryotes, ATG, GTG, or TTG can be used as a
start codon. B. meg prefers ATG, but about 30% of
the genes start with GTG or TTG.
In bioinformatics, we generally
ignore the fact that RNA uses the
base uracil (U) in place of T.
How do you get a protein from a gene?
A two-step process (called the Central Dogma of
– First, the gene has to be copied (transcribed) into an
• The RNA copy (messenger RNA) is exactly like
the gene itself, except RNA replaces T with U.
• Most gene regulation: whether the gene is “on” or
“off” happens here
– Second, the RNA is translated into protein by
ribosomes, which are complex RNA/protein hybrid
• With the help of transfer RNA molecules, which have one
end that matches the 3 base codon and the other end that
is attached to the proper amino acid.
• The ribosome starts at the start codon and moves down
the messenger RNA, adding one amino acid at a time to
the growing chain. When the ribosome reaches a stop
codon, it falls off, releasing the new protein.
• Here we get a bit subtle.
• Since codons consist of 3 bases, there are
3 “reading frames” possible on an RNA
(or DNA), depending on whether you
start reading from the first base, the
second base, or the third base.
– The different reading frames give entirely
– Consider ATGCCATC, and refer to the
genetic code. (X is junk)
• Reading frame 1 divides this into ATGCCA-TC, which translates to Met-Pro-X
• Reading frame 2 divides this into A-TGCCAT-C, which translates to X-Cys-His-X
• Reading frame 3 divides this into ATGCC-ATC, which translates to X-Ala-Ile
• Each gene uses a single reading frame, so
once the ribosome gets started, it just has
to count off groups of 3 bases to produce
the proper protein.
Open Reading Frames
Ribosomes are very obedient to stop codons: when a
stop codon is reached, the protein is finished. Thus, all
genes end at the first stop codon in their reading frame.
Since 3 out of the 64 codons are stop codons, random
DNA has stop codons very frequently.
However, genes do something necessary for survival, so
natural selection keeps stop codons out of the middle of
That is, if a mutation arises that creates a stop codon in the
middle of a gene, the organism dies and leaves no
Open reading frames (ORFs) are regions with no stop
codons. All genes reside in long open reading frames
Note that stop codons in other reading frames have no
effect on the gene.
The start codon must occur “upstream” in the same
reading frame as the stop codon. It is usually near the
beginning of the ORF, but not necessarily the first
possible start codon.
Determining the exact start codon is not easy or obvious.
But, the first stop codon in an open reading frame is always
a reasonable guess
This is a map of the stop
codons in all 3 reading
frames in a stretch of DNA.
The long ORF in reading frame
1 is highlighted in black.
Genes can occur on either DNA strand.
– If they are on the reverse strand, the DNA sequence needs to be reversed and
In bacteria, most of the DNA is part of a gene. Most long open reading
frames (say 100 bp or longer) that don’t overlap other long ORFs contain
Most genes do not overlap each other.
– Sometimes there are very short overlaps (50 bp or less), especially if the two genes
are functionally related.
In bacteria, genes that affect the same biochemical pathway or function are
sometimes adjacent to each other on the same DNA strand (not necessarily
the same reading frame), allowing them to be co-regulated
– This group of genes is called an “operon”
– Operons only exist in bacteria; they are not present in eukaryotes at all.
First job is to find long ORFs, examining the longest ORFs first and putting
together a set with minimal overlaps.
– It is also necessary to identify potential start codons, with the furthest upstream
start codon as the easiest choice.
Then, how do we know that the ORF contains a real gene? The most
definitive way is to match it with a gene known from other species
– conservation of a sequence between species strongly suggests that the sequence has
a function that is being conserved by natural selection
We compare protein sequences, not DNA, because protein is more conserved
in evolution than DNA
– The organism’s survival depends on the protein being functional, which means
having the proper amino acids sequence
– Since the genetic code is degenerate, many different DNA sequences will give
– The protein 3-dimensional structure is even more conserved, because it is more
closely related to enzyme activity than the amino acid sequence is.
• However, we don’t have good ways of determining 3-D structure from a DNA
• So, we compare our ORF sequence to a database of known protein
sequences from many species.
– BLAST is the standard sequence alignment tool (BLAST = Basic Local
Alignment Search Tool)
• BLAST is based on the concept that if you compare the same (that is,
homologous) protein from many different species, you can see that some
amino acids readily substitute for each other and others almost never do.
– A substitution matrix, giving a score for each amino acid position in the
proteins being compared.
• BLAST itself is a bit of software that can be run on almost any
computer, but the database needed for a good cross-species comparison
is quite large
– the database is called “nr” for “non-redundant”, and it contains at least 20
Gb of sequence data
• We are going to use the BLAST service at UniProt, a European
consortium that contains a comprehensive collection of protein
– Nearly all derived from DNA sequences: direct sequencing of proteins is
• Terminology: your sequence, which you paste into the box on the web
site, is the query sequence. Sequences in the database that match yours
are called subject sequences.
A Sequence to BLAST
• This is a more-or-less
randomly chosen gene
from B. meg.
– It is 174 amino acids long
• It is written in “fasta”
format: the first line starts
with > and is immediately
followed by an identifier
(ORF00135), and then
• After that the sequence is
written without spaces or
Results are arranged with the best ones on top
The most important score is the Expect value, or E-value, which can be defined the
number of hits any random sequence (with the same length as yours) would have in the
– E-values for good hits are usually written something like: 3e-42, which is the same
as 3 x 10-42 , a very small number
– Bad hits are very common, and they have e-values in a more familiar form: for
example, 0.004 or 1.2
– A really good e-values is less than 1e-180, which underflows the computer’s
processing capabilities, so it written as 0.0
– E-values are affected by the length of the query sequence as well as the size of the
database, so even perfect matches with short sequences give poor e-values
In this case we see many hits with good e-values, and the top e-values all are quite
Before we can conclude that our protein is a homologue of the proteins BLAST matches
it with, we would like them to have roughly the same length and have a high percentage
of identical amino acids.
– the lengths of the query and subject sequences should be within 20% of each other
– There should be at least 30% identical amino acids
– In this case we can be quite sure we have a good match
BLAST also returns a fourth value, the bit score, which we are going to ignore.
Mostly genes are named with the function of their protein.
– at some point, some related genes had their function determined through lab work: by
examining the effects of mutations in the gene, by isolating and studying the protein
produced by the gene, etc.
– Enzymes (end in –ase), transport across the cell membrane, genetic information
processing (DNA->RNA->protein), structural proteins, sporulation and germination, and
Many genes (maybe 1/4 of them in a typical genome) have no known function, although they
are found in several different species: conserved hypothetical genes
Every new genome has some genes that are unique: no matching BLAST hits in the database.
– Are they real genes? Sometimes there is evidence in the form of messenger RNA, but
usually we don’t know
– call them hypothetical genes
“putative” means that we think we know the gene’s function but we aren’t sure. Putative
should be followed by the function name.
More Gene Names
• One question of interest: do the names of the top BLAST hits
agree with each other? They should, but there are always
annotation errors, and our knowledge of gene function
increases over time.
– With some sloppiness due to different naming conventions practiced
by different scientists
• Here we have a classic case of mis-naming. Why is the top
hit ribosomal protein S2, with no other hit having this name?
– Ribosomal proteins are highly conserved in evolution
– Some checking on my part showed that no homology exists between
this gene and the ribosomal protein S2 found in any other Bacillus
• The other names are similar, although not identical.
– What is “PAP2”? A quick Google search shows that it stands for
“phosphatidic acid phosphatase”, which fits the other names well.
– There is probably some uncertainty about its exact function, given the
variety of names and the “family protein” designation in several of
Horizontal and Vertical Gene Transfer
We are accustomed to thinking of genes being passed
from parent to offspring, always staying within the
species, with very occasional splitting of one species
– This is called vertical gene transfer.
But, we know that some genes are transferred across
species lines, not by the standard genetic mechanisms.
– This is called horizontal gene transfer
– It is rare in humans and other higher organisms
– In bacteria 10% or more of genes have been transferred
B meg genes that come from vertical descent have
other Bacillus species (or another closely related
species) as the closest BLAST hit
Horizontally transferred genes can come from almost
anywhere: other bacteria, Archaea, eukaryotes: plants,
– The general mechanisms are well known, including
conjugation (direct transfer of DNA between two
bacteria), transduction (transfer of DNA using a virus as
a carrier), and transformation (the bacteria pick up DNA
molecules from their environment.
• “Kings Play Chess On
Fine Ground Sand”
Bacteria is the domain
Firmicutes is the phylum
Bacilli is the class
Bacillales is the order
Bacillaceae is the family
Bacillus is the genus.
• Most of the top hits are from various Bacillus species: there is little doubt
that this gene is the results of normal, vertical gene flow.
• What about “Anoxybacillus flavithermus”?
– Click on the accession number to get more information, including its
– Taxonomic lineage = Bacteria > Firmicutes > Bacillales > Bacillaceae >
– Same family as B meg.
• You can see the aligned sequences by clicking on the
“Local alignment” diagrams
– Query sequence on top, subject below
– Identical amino acids are in the middle of the alignment, and
similar ones have a + sign.
– Gaps: regions where one sequence has amino acids not found
in the other sequence, are indicated with ---.
• This protein is very typical in that the best matches are
in the middle of the protein, with fewer identical amino
acids near the ends.
– Also, the match doesn’t quite make it to the very beginning of
the proteins, although they are almost identical in length.
– The active site of most enzymes is in the middle
– The ends of proteins are often not well conserved
Local Alignment Result
Click on Graphical Overview (just under
the BLAST box on the left) to get an
overview of all the aligned sequences
– The extent of the matching region is shown
with the colored boxes, with non-matching
regions drawn as a line.
– Color indicates percent of identical amino
You can see that mostly our query and the
various subjects (matches) line up along
almost all of their lengths.
– This is a good way to check whether our
start site is reasonable.
A few odd ones lower down.
– Genes, and pieces of genes, can move to
new locations in the genome, fuse with
other genes, break apart, etc. Always
subject to natural selection: if the altered
gene doesn’t work, the organism will die
and we won’t see it.
– And of course, sequencing and annotation
The Basic Points
DNA can be read in 3 different reading frames, a
consequence of the genetic code (3 bases = 1 amino
Genes are found in long open reading frames, areas
where there are no stop codons.
BLAST is the tool we use to compare sequences
Gene sequences are conserved between species by
BLAST scores (e-values) describe the probability of finding a
random sequence in the database
DNA sequences outside of genes are much less conserved
Most genes are transferred vertically, from parent to
offspring, but a significant number are transferred
horizontally, from unrelated species).
• Within-species BLAST--are there duplicate genes? Do their names
match? What is most closely related species? Present in both strains?
• Are nearby genes related by subsystem?