Transcript Document
Unit 2.5: Genome Annotation, Gene Prediction, and DNA motifs
Objectives:
-learn what is meant by “genome annotation” and why this is
important
-understand why some genomic features are relatively easy to
annotate, and others are not
-learn the major ways of representing transcription factor binding
sites and the advantages and disadvantages of each
Assigned reading:
Stein, L. 2001. Genome annotation: from sequence to biology. Nat Rev Genet 2: 493-503.
D'Haeseleer, P. 2006. What are DNA sequence motifs? Nat Biotechnol 24: 423-425.
Genome Sequencing
(As of 6/25/04) (As of 7/25/05) (As of 7/20/07)As of 6/29/08
(1128)(1496)(2680) (3825) genome projects:
199 (274)(579) 827complete (includes (28) (36) (49) 94 eukaryotes)
508 (728) (1285) 1932 prokaryotic genomes in progress
421 (494) (721) 936 eukaryotic genomes in progress
small: archaebacterium Nanoarchaeum equitans
500 kb
Bacillus anthracis (anthrax)
5228 kb
S. cerivisiae (yeast)
12,069 kb
Arabidopsis thaliana
115,428 kb
Drosophila melanogaster (fruit fly)
137,000 kb
Anopheles gambiae (malaria mosquito)
278,000 kb
Oryza sativa (rice)
420,000 kb
Mus musculus (mouse)
2,493,000 kb
Homo sapiens (human)
2,900,000 kb
http://www.genomesonline.org/
so what?
Genome sequencing helps in:
• identifying new genes (“gene discovery”)
• looking at chromosome organization and structure
• finding gene regulatory sequences
• comparative genomics
These in turn lead to advances in:
•medicine
•agriculture
•biotechnology
•understanding evolution and other basic science questions
Because of the vast amounts of data that are
generated, we need new approaches
•high throughput assays
•robotics
•high speed computing
•statistics
•bioinformatics
Understanding the genome
We know the sequence—but can we understand it?
Anna Pavlovna's drawing room was gradually filling. The highest Petersburg
society was assembled there: people differing widely in age and character but
alike in the social circle to which they belonged. Prince Vasili's daughter, the
beautiful Helene, came to take her father to the ambassador's entertainment; she
wore a ball dress and her badge as maid of honor. The youthful little Princess
Bolkonskaya, known as la femme la plus seduisante de Petersbourg, was also
there. She had been married during the previous winter, and being pregnant did
not go to any large gatherings, but only to small receptions. Prince Vasili's son,
Hippolyte, had come with Mortemart, whom he introduced. The Abbe Morio and
many others had also come.
To each new arrival Anna Pavlovna said, "You have not yet seen my aunt," or "You
do not know my aunt?" and very gravely conducted him or her to a little old lady,
wearing large bows of ribbon in her cap, who had come sailing in from another
room as soon as the guests began to arrive; and slowly turning her eyes from the
visitor to her aunt, Anna Pavlovna mentioned each one's name and then left them.
--Tolstoy, War and Peace
Understanding the genome
We don’t know the language:
Гостиная Анны Павловны начала понемногу наполняться. Приехала высшая знать Петербурга,
люди самые разнородные по возрастам и характерам, но одинаковые по обществу, в каком все
жили; приехала дочь князя Василия, красавица Элен, заехавшая за отцом, чтобы с ним вместе
ехать на праздник посланника. Она была в шифре и бальном платье. Приехала и известная, как
la femme la plus séduisante de Pétersbourg 1, молодая, маленькая княгиня Болконская, прошлую
зиму вышедшая замуж и теперь не выезжавшая в большой свет по причине своей
беременности, но ездившая еще на небольшие вечера. Приехал князь Ипполит, сын князя
Василия, с Мортемаром, которого он представил; приехал и аббат Морио и многие другие.
— Вы не видали еще, — или: — вы не знакомы с ma tante? 2 — говорила Анна Павловна
приезжавшим гостям и весьма серьезно подводила их к маленькой старушке в высоких бантах,
выплывшей из другой комнаты, как скоро стали приезжать гости, называла их по имени,
медленно переводя глаза с гостя на ma tante, и потом отходила.
Все гости совершали обряд приветствования никому не известной, никому не интересной и не
нужной тетушки. Анна Павловна с грустным, торжественным участием следила за их
приветствиями, молчаливо одобряя их. Ma tante каждому говорила в одних и тех же
выражениях о его здоровье, о своем здоровье и о здоровье ее величества, которое нынче было,
слава Богу, лучше. Все подходившие, из приличия не выказывая поспешности, с чувством
облегчения исполненной тяжелой обязанности отходили от старушки, чтоб уж весь вечер ни
--Tolstoy, War and Peace
Understanding the genome
Even if we did, we don’t know the grammar, or punctuation:
annapavlovnasdrawingroomwasgraduallyfillingthehighestpetersburgsocietywasassembledt
herepeopledifferingwidelyinageandcharacterbutalikeinthesocialcircletowhichtheybelonged
princevasilisdaughterthebeautifulhelenecametotakeherfathertotheambassadorsentertainmen
tsheworeaballdressandherbadgeasmaidofhonortheyouthfullittleprincessbolkonskayaknown
aslafemmelaplusseduisantedepetersbourgwasalsothereshehadbeenmarriedduringthepreviou
swinterandbeingpregnantdidnotgotoanylargegatheringsbutonlytosmallreceptionsprincevasil
issonhippolytehadcomewithmortemartwhomheintroducedtheabbemorioandmanyothershada
lsocometoeachnewarrivalannapavlovnasaidyouhavenotyetseenmyauntoryoudonotknowmya
untandverygravelyconductedhimorhertoalittleoldladywearinglargebowsofribboninhercapw
hohadcomesailinginfromanotherroomassoonastheguestsbegantoarriveandslowlyturninghere
yesfromthevisitortoherauntannapavlovnamentionedeachonesnameandthenleftthemeachvisit
orperformedtheceremonyofgreetingthisoldauntwhomnotoneofthemknewnotoneofthemwant
edtoknowandnotoneofthemcaredaboutannapavlovnaobservedthesegreetingswithmournfula
ndsolemninterestandsilentapprovaltheauntspoketoeachoftheminthesamewordsabouttheirhea
lthandherownandthehealthofhermajestywhothankgodwasbettertodayandeachvisitorthoughp
olitenesspreventedhisshowingimpatiencelefttheoldwomanwithasenseofreliefathavingperfor
medavexatiousdutyanddidnotreturntoherthewholeeveningtheyoungprincessbolkonskayahad
broughtsomeworkinagold--Tolstoy, War and Peace
In order to make use of the genome sequence, we need
to understand all of its components. Assigning
identities and functions to sequences within the
genome is called genome annotation.
“With the complete human genome sequence now in hand,
we face the enormous challenge of interpreting it and
learning how to use that information to understand the
biology of human health and disease. The ENCyclopedia Of
DNA Elements (ENCODE) Project is predicated on the
belief that a comprehensive catalog of the structural and
functional components encoded in the human genome
sequence will be critical for understanding human biology
well enough to address those fundamental aims of
biomedical research. Such a complete catalog, or "parts
list," would include protein-coding genes, non–proteincoding genes, transcriptional regulatory elements,
and sequences that mediate chromosome structure
and dynamics; undoubtedly, additional, yet-to-bedefined types of functional sequences will also need
to be included.”
What’s in a genome?
Genes (i.e., protein coding)
But. . . only <2% of the human genome encodes proteins
Other than protein coding genes, what is there?
• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)
• structural sequences (scaffold attachment regions)
• regulatory sequences
• non-functional “junk” ?
It’s still uncertain/controversial how much of the genome is
composed of any of these classes
The answers will come from experimentation and
bioinformatics.
Current human genome annotations can be viewed using
the UCSC genome browser, as we saw in Unit 2-4.
The ENCyclopedia Of DNA Elements (ENCODE)
Project aims to identify all functional elements in
the human genome sequence.
•pilot phase focused on 30 Mb (~ 1%) of the genome
•international consortium of computational and
laboratory-based scientists working to develop and apply
high-throughput approaches for detecting all sequence
elements that confer biological function
•now in its second phase, extending study to entire
human genome
Functional genomic elements being identified by the ENCODE pilot phase
The ENCODE Project Consortium Science 306, 636 -640 (2004)
Published by AAAS
protein-coding genes, non–proteincoding genes
•easier to find than other functional elements
•why?
•genes are transcribed—which means that we can
identify them by looking at RNA
•traditionally this has been done by cDNA or
EST sequencing, more recently by microarray,
SAGE, MPSS, etc.
protein-coding genes, non–proteincoding genes
•we can also find genes ab initio using computational
methods
•this is most suited to protein-coding genes
•why?
•protein-coding genes have recognizable features
•open reading frames (ORFs)
•codon bias
•known transcription and translational start and stop motifs
(promoters, 3’ poly-A sites)
•splice consensus sequences at intron-exon boundaries
ab initio gene discovery
•Protein-coding genes have recognizable features
•We can design software to scan the genome and identify these
features
•Some of these programs work quite well, especially in bacteria
and simpler eukaryotes with smaller and more compact genomes
•It’s a lot harder for the higher eukaryotes where there are a lot of
long introns, genes can be found within introns of other genes, etc.
•We tend to do OK finding protein coding regions, but miss a
lot of non-coding 5’ exons and the like
ab initio gene discovery—validating
predictions and refining gene models
•Standard types of evidence for validation of predictions include:
•match to previously annotated cDNA
•match to EST from same organism
•similarity of nucleotide or conceptually translated protein
sequence to sequences in GenBank
(translation works better—why?)
•protein structure prediction match to a PFAM domain
•associated with recognized promoter sequences, ie TATA box,
CpG island
•known phenotype from mutation of the locus
Finding non–protein-coding genes
•e.g., tRNA, rRNA, snoRNA, miRNA, various other ncRNAs
•Harder to find than protein-coding genes
•Why?
•often not poly-A tailed—don’t end up in cDNA libraries
•no ORF
•constraint on sequence divergence at nucleotide not protein
level, so homology is harder to detect
•So, how do we find these?
Finding non–protein-coding genes
•secondary structure
•homology, especially alignment of related species
•experimentally
•isolation through non-polyA dependent cloning methods
•microarrays
ab initio gene discovery—approaches
Most gene-discovery programs makes use of some form of
machine learning algorithm. A machine learning algorithm
requires a training set of input data that the computer uses to
“learn” how to find a pattern.
Two common machine learning approaches used in gene
discovery (and many other bioinformatics applications) are
artificial neural networks (ANNs) and hidden Markov models
(HMMs).
ab initio gene discovery—HMMs
An example state diagram for an HMM for gene discovery is this
simplified version of one used by Genescan:
5’ UTR
begin
gene
region
initial
exon
start
translation
final
exon
exon
donor
splice
site
acceptor
splice
site
3’ UTR
stop
translation
end
gene
region
intron
A,T,G,C
single exon
Each box and arrow has associated transition probabilities, and
emission probabilities for emission of nucleotides (dotted arrow).
These are learned from examples of known gene models and
provide the probability that a stretch of sequence is a gene.
adapted from Gibson and Muse, A Primer of Genome Science
What about other genomic features?
Other than protein coding genes, what is there?
• genes for noncoding RNAs (rRNA, tRNA, miRNAs, etc.)
• structural sequences (scaffold attachment regions)
• regulatory sequences
• non-functional “junk” ?
We can begin to annotate regulatory sequences such as
transcription factor binding sites and cis-regulatory modules.
Remember from Unit 2-2:
Control of Gene Expression—Transcription Factors
Transcription factors (TFs) are proteins that bind to the DNA
and help to control gene expression. We call the sequences to
which they bind transcription factor binding sites (TFBSs),
which are a type of cis-regulatory sequence.
Transcription factors bind to specific DNA
sequences
Isalan et al. Biochemistry 37:12026
Usually, binding sites are first determined empirically.
Most transcription factors can bind to a range of similar
sequences. We can represent these in either of two ways, as a
consensus sequence, or as a position weight matrix (PWM).
Once we know the binding site, we can search the genome to
find all of the (predicted) binding sites.
Control of Gene Expression—Transcription Factors
Most transcription factors can bind to a range of similar
sequences. We call this a binding “motif.”
Wasserman, W. W. and A. Sandelin (2004). Nat Rev Genet 5(4): 276-287.
We can represent these motifs either as a consensus
sequence or as a frequency (or weight) matrix.
Binding site (motif) representations
TCCGGAAGC
TCCGGATGC
TCCGGATCT
CATGGATGC
CCAGGAAGT
GGTGGATGC
ACCGGATGC
T CC GGAAGC
C
T
7 characterized
binding sites for a
certain transcription
factor:
consensus sequence:
Frequency matrix and
its graphical depiction,
a sequence logo:
A
T
G
C
111007200
302000502
110770060
254000015
Binding site (motif) representations
A consensus sequence is a one-line description of the TFBS,
based on a column-by-column alignment of the individual
known binding sites. The usual rule is:
A single base is shown if it occurs in more than half
the sites and at least twice as often as the second
most frequent base. Otherwise, a double degenerate
symbol (e.g., G/C= S) is used if two bases occur
in more than 75% of the sites, or a triple degenerate
symbol when one base does not occur at all.
A frequency matrix shows the actual frequencies of each base
in each column. This can be easily converted to a position
weight matrix (PWM), which is a normalized version of the
frequency matrix that is therefore not dependent on the
number of sites in the alignment.
Finding binding sites in the genome
T
C
CC TGGATGC
Consensus sequences make searching easy—it’s a simple text search
that can even be done using a word processor, or very simply
programmed in a computer language such as Perl:
while(<SEQUENCE>){
if ($_ =~ /[T|C]C[T|C]GGATGC/)
{do something;}
}
All positions in the motif are treated the same.
Identifying transcription factor
binding sites
But PWMs are generally more useful:
•they allow us to assign more importance to more
invariant positions
•they are related to the binding energy of the DNA-protein
interaction
•we can compare PWMs and we can score PWMs
Scores are based on the probability of a given nucleotide
being in a given position.
Identifying transcription factor binding sites
A
T
G
C
1
3
1
2
1
0
1
5
1
2
0
4
0
0
7
0
T
C
C
G G A T GC
C
T
0
0
7
0
7
0
0
0
2
5
0
0
0
0
6
1
0
2
0
5
TCCGGAAGC
TCCGGAACT
TCCGGAAAA
Example 1:
TCCGGAAGC scores higher than TCCGGAACT scores
higher than TCCGGAAAA as GC > CT > AA in the last
two positions. Note that the latter two sequences would
score the same if using only the consensus representation.
Identifying transcription factor binding sites
A
T
G
C
T
C
C
111007200
302000502
110770060
254000015
C
T
G G A T G C
Example 2:
TCGGGAAGC and TCCAGATCT both have a single
mismatch compared to the consensus. But the first is a
much better binding site when scored using a PWM due to
the strong conservation of the G in position 4 versus the
weak requirement for the C or T in position 3.
Issues with finding binding sites in the genome
But it’s important to use caution: just because a sequence in the
genome is a reasonable match to a known TFBS, this doesn’t
necessarily mean that the TF is binding there in vivo. By crude
calculation:
The probability of finding a 7 bp motif is 4-7 = 1/16,384
i.e., expect only about 1 motif every 16 kb.
So in human genome, this sequence should be present
over 183,000 times! (>7x per gene!) Even in a 10 Mb
genome, the sequence would occur over 600 times.
And this calculation does not even take into account motif
degeneracy!
So we need to consider additional factors in deciding what
predicited binding sites are important—such as how regulatory
regions are organized
Empirical methods, such as ChIP-chip (see Unit 2-3) are a good
alternative for looking at in vivo binding; bioinformatics
methods can be combined with this to determine the
transcription factor binding motifs.
Genome Annotation—Transcription Factors
Because of the difficulty in accurately predicting bona
fide, functional TFBSs, most current genome
annotation focuses on empirically determined sites.
Several databases curate these data, e.g. the Open
Regulatory Annotation database (ORegAnno) and the
Regulatory Element Database for Drosophila (REDfly).
Tracks displaying these data can be found in the UCSC
Genome Browser. These databases also curate cisregulatory module sequences, which at present can
only reliably be determined by empirical methods.
Genome Annotation—much work remains
Despite good progress in identifying both protein
coding and non-protein coding genes, much work
remains to be done before even the best-studied
genomes are fully annotated. For the higher eukaryotes,
only a tiny percentage of features such as TFBSs,
CRMs, and other non-gene features have so far been
indentified.