Transcript Genomes 3/e

Chapter 5:
Understanding a Genome
Sequence
Copyright © Garland Science 2007
Understanding Genome Sequence
• Most important step of genomics
• Genome annotation techniques
5-1. Locate genes
5-2. Function annotation
5-3. An example: yeast genome
(Will cover microbial genomes later)
Understanding Genome Sequence
5-1. Locate genes
by in silico analysis
or experimental techniques
Figure 5.1 Genomes 3 (© Garland Science 2007)
Open reading frame scanning
finds start & stop codons
spanning >100 codons
(theoretical frequency is 1 stop
codon per 50 codons but average
317 in E. coli & 450 in human)
The coding regions of genes
are ORFs; both strands can
be coding strand; therefore
6 possibilities for a given
dsDNA sequence.
Figure 5.2 Genomes 3 (© Garland Science 2007)
ORF scanning is effective (if not completely accurate)
for bacterial genome.
E.g. E. coli lactose operon below. Red is the real gene,
yellow is the predicted ORF.
Figure 5.3 Genomes 3 (© Garland Science 2007)
ORF scanning is not optimal for large
eukaryotic genomes, because:
Unlike bacteria (11% intergenic in E. coli),
eukaryotic genes are widely spaced by noncoding regions (62% in human)
Unlike bacteria, eukaryotic genes are not
continuous (split by introns) & sometimes
overlap
ORF scans are complicated by introns.
Line 2 is the real amino acid sequence. Intron is
excised during mRNA modification.
Line 3 is the predicted amino acid sequence w/o
consideration of intron, shorter than it really is.
Figure 5.4 Genomes 3 (© Garland Science 2007)
ORF scanning can be improved for
eukaryotic genomes by considering:
Codon bias: Not all codons are used equally (not
fully understood why but helpful for ORF search)
Exon-intron boundaries: upstream GT &
downstream AG (they are consensus but not
always the case)
Upstream regulatory sequences: distinctive
sequence features (but can be variable) to identify
where genes begin
Upstream consensus where
eukaryotic genes usually start
Figure 5.5 Genomes 3 (© Garland Science 2007)
5-1. Locate genes
by in silico
analysis
Search functional RNA
(rRNA & tRNA) not
encoded by ORF
But have very distinctive
features to form stemloop structures
(intramolecular base
pairing)
Figure 5.6a Genomes 3 (© Garland Science 2007)
Homology search & comparative genomics
help gene location.
Evolutionarily related genes share homologous
regions in coding sequences.
Figure 5.8 Genomes 3 (© Garland Science 2007)
Homology search & comparative genomics
help gene location (Cont.)
Locate a gene by comparison of closely related
genomes (e.g. within the same species).
Figure 5.9 Genomes 3 (© Garland Science 2007)
Computer-assisted genome annotation.
All-in-one: Scan ORFs + exon-intron boundaries,
upstream regulatory sequences, homology test,
cDNA search, etc. (below: 15-Kb human genome
by Genotator)
Figure 5.10 Genomes 3 (© Garland Science 2007)
5-1. Locate genes
by experimental
techniques
Detection of RNA
transcribed from genes
by northern hybrid
detects if a DNA
fragment contains
transcribed sequences
Note: 1 gene can give
>2 transcripts w/
different lengths; some
genes not expressed
under certain conditions
Figure 5.11 Genomes 3 (© Garland Science 2007)
5-1. Locate genes
by experimental techniques
Northern blotting gives no gene positional info.
Therefore, need cDNA sequencing which can
map genes (find exon-intron boundaries) in
DNA fragments
cDNA=mRNA copy =leader+gene+tailer
1. Construct a cDNA library (containing all
expressed genes)
2. Use the target DNA fragment to hybrid with
cDNA library
3. Repeat hybrid for multiple times (for those
poorly expressed genes, called ”cDNA
capture”)
Accurate cDNA
sequencing depends
on reverse
transcription of a
complete mRNA
Truncated cDNA
always happens (lack
of complete synthesis
of gene 5’)
Precise mapping the 5’
end of transcripts by
Rapid Amplification
of cDNA Ends
(RACE)
5-1. Locate genes
by RACE
Purpose: amplify
shorter/partial cDNA
molecule but cover
the complete 5’ end
Prerequisite: a basic
knowledge of the
gene
1. An internal primer
anneals close to 5’end
2. RT synthesis cDNA
Figure 5.13 part 1 of 2 Genomes 3 (© Garland Science 2007)
5-1. Locate genes
by RACE (Cont.)
3. Add poly A tail
4. Anneal anchor
primer
5. Continue as a
regular PCR
6. Sequence the PCR
amplicon
End product: a
fragment w/5’-end
of the mRNA
3’ end can be analyzed
in a similar way
Figure 5.13 part 2 of 2 Genomes 3 (© Garland Science 2007)
5-1. Locate genes
by heteroduplex
analysis
Purpose: located
start/end of a gene
based on mRNA
Prerequisite: a M13
library clone
spanning the gene
end is available
Use of S1 nuclease to
trim dsDNA molecule
Figure 5.14 Genomes 3 (© Garland Science 2007)
Locate exon
boundary by
exon trapping
Purpose: to find exon
boundaries by using
an exon-trap vector
Followed by PCR & DNA
sequencing analysis
Figure 5.15 Genomes 3 (© Garland Science 2007)
5-2. Determine the gene functions
Genome is sequenced, then putative genes
(start+end) are identified, but the work is
just started. How these genes function?
An example: E. coli K-12 has 4288 genes, only
1853 genes (43%) had been identified in the
past >100 years of research; yeast (30%);
human (largely unknown) by 2006.
Therefore, the most important step is to study
of functions of genes, referred as functional
genomics
5-2. Determine the gene functions
5-2-1. Computer in silico analysis (mainly
by homology search)
5-2-2. Experimental analysis (by gene
inactivation or over-expression)
5-2-1. Homology search
To what extent, an unknown gene is similar to a
known gene from a different organism.
Assumption: homologous genes share a
common evolutionary ancestor. Two
categories: orthologous & paralogous
Ancestor predates
speciation
Figure 5.16 Genomes 3 (© Garland Science 2007)
e.g. myoglobin &
β–globin duplicated
550 Myr ago
5-2-1. Homology search (Cont.)
Convert & align amino acid sequence (not simply
DNA sequence) & give a score of identity
Plus, consider the relatedness of translated amino
acids (e.g. give leucine & isoleucine a higher
score than cysteine & tyrosin)
BLAST & PSI-BLAST (e.g. below 76% DNA
identity vs. 28% amino acid identity)
Figure 5.18 Genomes 3 (© Garland Science 2007)
5-2-1. Homology
search (Cont.)
Identify functional
domain is another
alternative
Genes become different
(low similarity) but
contain conserved
functional domain
An example (left): tudor
domain is conserved
between fruit fly &
human (RNA
metabolism)
Figure 5.19 Genomes 3 (© Garland Science 2007)
5-2-1. Homology search (Cont.)
It is
surprising
by the
fact how
genetically
close we
are with
the bugs
that
ferment
beers.
5-2-1. Homology search (Cont.)
Homology search helps finding functionally
conserved genes across genus & studying
human disease (e.g. many metabolic genes are
conserved in yeast & human )
Table 5.1 Genomes 3 (© Garland Science 2007)
5-2-2. Experimental
analysis of gene
functions
Most genes cannot be
in silico compared
Need to reverse the
process (from
genotype to
phenotype)
Inactivate gene & find
out the altered
phenotype by
homologous
recombination
Figure 5.20 Genomes 3 (© Garland Science 2007)
Inactivate yeast
gene by
homologous
recombination of a
deletion cassette
(antibiotic
resistance marker +
two homologous
region of the target
gene)
Figure 5.21 Genomes 3 (© Garland Science 2007)
Inactivate mouse gene
by homologous
recombination of a
deletion cassette in
embryonic stem cell &
screen non-chimeric
knockout mouse
5-2-2. Experimental
analysis of gene
functions (Cont.)
Inactivate gene by
transposon tagging
Most genomes contain
transposons. Most
quiescent, a few active.
(Left) genetically
engineered yeast
transposon, responsive
to an external stimulus
(e.g. galactose)
Figure 5.22 Genomes 3 (© Garland Science 2007)
5-2-2. Experimental
analysis of gene
functions (Cont.)
Transposon tagging is
random & hard to target
specific genes
Alternative method
RNA interference
(RNAi)
Naturally occurring
during gene expression
regulation; degrade
mRNA instead of gene
insertional inactivation
Figure 5.23 Genomes 3 (© Garland Science 2007)
RNA interference was
initially found in bacteriaeating worms
Presence of dsRNA in cell
prevents protein
synthesis & lead to cell
death; but 21-22 bp
siRNA can circumvent
Useful for 8K of 35K
human genes but
challenge is from in vitro
to in vivo
5-2-2. Experimental
analysis of gene
functions (Cont.)
Gene overexpression
Instead of making a
gene disappear, what
about a gene (or its
product) is excessively
presented in a cell?
Multiplies to 40200 copies/ cell
Figure 5.24 Genomes 3 (© Garland Science 2007)
e.g. high
density
bones found
Sometimes, phenotypic
effect of gene
inactivation or overexpression is difficult to
discern
A list of phenotypes
needs to be examined
for the target organism
e.g. mutation of the
largest gene in yeast
seemed no apparent
effect but later found to
be low pH intolerant
Table 5.2 part 1 of 2 Genomes 3 (© Garland Science 2007)
5-2-2. Experimental
analysis of gene
functions (Cont.)
Only 10% of 19 K genes
in C. elegans is found to
cause phenotypic
changes
No discernable
phenotypic changes
pose the challenge to
identify gene functions
Table 5.2 part 2 of 2 Genomes 3 (© Garland Science 2007)
5-2-2. Experimental
analysis of gene
functions (Cont.)
Site-directed
mutagenesis
Useful for replacing a
target gene w/ a partially
modified gene
Two-step homologous
recombination; Loss of
marker gene phenotype
In this case, mutated gene
still can be expressed.
Figure 5.25 Genomes 3 (© Garland Science 2007)
5-2-2. Experimental
analysis of gene
functions (Cont.)
Gene expression is
sometimes restricted
to a particular organ &
a developmental
stage.
Reporter genes &
immunocytochemistry can
help to locate where &
when genes are
expressed.
Figure 5.27 Genomes 3 (© Garland Science 2007)
5-3. Annotation
of yeast genome
sequence
No homolog & no
function assigned
Genome
sequenced
completed in
1996.
6274 ORFs
identified w/100codon cut-off.
Homolog
found but no
function
assigned
Figure 5.28 Genomes 3 (© Garland Science 2007)
30% are true
genes (previously
identified)
5-3. Annotation of
yeast genome
sequence (Cont.)
100,000 ORFs
identified if 15codon cut-off.
Above: A short ORF containing a
small protein (38 amino acids):
predicted eukaryotic homolog of
prokaryotic ribosomal protein
L36
Find short genes is a
huge task & can
use: comparative
genomics, evidence
of transcription, or
transposon tagging,
etc.
5-3. Annotation of
yeast genome
sequence (Cont.)
In-frame fusion of
lacZ gene w/o
start codon.
If lacZ gene is
expressed (detected
by X-gal test), a
functional gene is
present.
Figure 5.29 Genomes 3 (© Garland Science 2007)
5-3. Annotation of
yeast genome
sequence (Cont.)
Barcode deletion
strategy can be used
for high-throughput
screening of mutant
library.
Barcode sequence (20
bp) can be different for
each homologous
recombination deletion.
Figure 5.30 Genomes 3 (© Garland Science 2007)
Chapter 5 Summary
A variety of methods are used for identification of
genes in a genome sequence, including computerbased analysis (e.g. ORF scanning or homology
searching) & experimental techniques (cDNA
sequencing or transcript mapping).
Gene functions can be annotated by computer
analysis (e.g. homology searching) & experimental
techniques as well (e.g. gene inactivation by
transposon, RNA interference, gene overexpression, site-directed homologous
recombination, reporter genes, etc).