Genomics and Gene Recognition

Download Report

Transcript Genomics and Gene Recognition

Genomics and Gene
Recognition
CIS 667 April 27, 2004
Genomics and Gene
Recognition
• How do we recognize the genes given the
raw sequence data?
• Two different cases:
 Prokaryotes: relatively easy
 Eukaryotes: relatively difficult
 Much “junk DNA” to search through
• Signals determine the beginnings and
ends of genes
 Need to find the signals
Prokaryotic Genomes
• Genomic information of prokaryotes
dedicated mainly to basic tasks
 Make and replicate DNA
 Make new proteins
 Obtain and store energy
• Over 60 prokaryotic genomes have been
completely sequenced since mid-1990s
Prokaryotic Genomes
• Recall - prokaryotes have a single circular
chromosome
• Also - no cell nucleus, therefore no splicing out
of introns
• Therefore, prokaryotic gene structure is quite
simple
Translational
Translational
Promoter
region
start site (AUG)
stop site
Open Reading Frame
Transcriptional
start site
Operator
sequence
Transcriptional
stop site
Promoter Elements
• Gene expression begins with transcription
 RNA copy of a gene made by an RNA
polymerase
 Prokaryotic RNA polymerases are assemblies
of several different proteins
 b’ protein binds to DNA template
 b protein links nucleotides
 a protein holds subunits together
 s protein recognizes specific nucleotide sequences
of promoters
Promoter Elements
• b’, b and a often very similar from one
bacterial species to another
• s can vary (less well conserved)
 Several variants often found in a cell
 The ability to use several different s factors
allows a cell to turn on or off expression of
whole sets of genes
 For example, s32 turns on gene expressions for
genes associated with heat shock while s54 does
the same for nitrogen stress and genes that always
need to be expressed are transcribed by
polymerases with s70
Promoter Elements
• Each s factor recognizes a particular
sequence of nucleotides upstream from
the gene
 s70 looks for -35 sequence TTGACA and -10
sequence TATAAT
 Other s factors look for other -35 and -10
sequences
 The match need not always be exact
 The better the match, the more likely transcription
will be initiated
Promoter Elements
• Protein products from some genes are always
used in tandem with those from some other
genes
 These related genes may share a single promoter in
prokaryotic genomes and be arranged in an operon
 When one gene is transcribed, so are all of the others
- one polycistronic RNA molecule is produced
 The lactose operon contains three genes involved in
metabolism of the sugar lactose in bacterial cells
Operon
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Operon
• The protein encoded by the regulatory
gene (pLacI) can bind to lactose or to the
operator sequence of the operon
 So when lactose is abundant, less likely to
bind to operator sequence
 When it does, it blocks transcription, thus acting as
a negative regulator
 Even without negative regulation, we have low
levels of operon expression due to poor match of
consensus sequence for the s factor
• A positive regulator (CRP) promotes expression
Operon
QuickTime™ and a
Animation decompressor
are needed to see this picture.
Lac Operon
QuickTime™ and a
decompressor
are needed to see this picture.
Open Reading Frames
• Recall - 3 of the 64 codons are stop codons
(UAA, UAG, UGA) - they cause translation to
stop
• Most prokaryotic proteins are longer than 60
amino acids
 Since on average we expect to find a stop codon
once in every 21 (3/64) codons, the presence of a run
of 30 or more codons with no stop codons (an Open
Reading Frame - ORF) is good evidence that we are
looking at the coding sequence of a prokaryotic gene
Open Reading Frames
• AUG is a start codon
 Defines where translation begins
 If no likely promoter sequences are found
upstream of a start codon at the start of an
ORF before the end of the preceding ORF,
assume the two genes are part of an operon
whose promoter sequence is further upstream
Termination Sequence
• Most prokaryotic operons contain specific
signals for the termination of transcription called
intrinsic terminators
 Must have a sequence of nucleotides that includes an
inverted repeat followed by
 A run of roughly six uracils
 The inverted repeat allows the RNA to form a loop
structure that greatly slows down RNA synthesis
 Together with the chemical properties of uracil, this is enough
to end transcription
Termination Sequence
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
GC Content in Prokaryotic
Genomes
• For every G within a double-stranded DNA
genome there must be a C - likewise an A for
every T
 Only constraint on fraction of nucleotides that are G/C
as opposed to A/T is that the two must add to 100%
 Can use genomic GC content to identify bacterial
species (ranges from 25% to 75%)
 Can also use GC content to identify genes that have
been obtained from other bacteria by horizontal gene
transfer
Prokaryotic Gene Densities
• Gene density within prokaryotic genomes
is very high
 Between 85% and 88% of the nucleotides are
typically associated with coding regions of
genes
 Just as large portions of chromosomes can be
acquired, they can also be deleted
 Portions left are those which code for essential
genes
Gene Recognition in
Prokaryotes
• Long ORFs (60 or more codons)
• Matches to simple promoter sequences
• Recognizable transcriptional termination
signal (inverted repeats followed by run or
uracils)
• Comparison with nucleotide (or amino
acid) sequences of known protein coding
regions from other organisms
Eukaryotic Genomes
• Much more complex
 Internal membrane-bound compartments
allows wide variety of chemical environments
in each cell
 Multicellular organisms
 Each cell type has distinct gene expression
 Size of genome may be larger
 Allows for “junk DNA”
• Gene expression more complex and
flexible than in prokaryotes
Eukaryotic Gene Structure
Promoter Elements
• Each different cell type requires different
gene expression
 Therefore eukaryotes have elaborate
mechanisms for starting transcription
 Prokaryotes have a single RNA polymerase eukaryotes have three
 RNA polymerase I - Ribosomal RNAs
 RNA polymerase II - Protein-coding genes
 RNA polymerase III - tRNAs, other small RNAs
Promoter Elements
• Most RNA polymerase II promoters
contain a set of sequences known
as a basal promoter where an
initiation complex is assembled and
transcription begins
• Also have several upstream
promoter elements (typically at least
5) to which other proteins bind
 Without the proteins binding upstream,
initiation complex assembly is difficult
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Promoter Elements
• RNA polymerase II does not
directly recognize the basal
sequences of promoters
 Basal transcription factors
including a TATA-binding
protein (TBP) and at least 12
TBP-associated factors bind to
the promoter in a specific
order, facilitating binding of
RNA polymerase
 TATA-box 5’-TATAWAW-3’ (W is A
or T) at -25 relative to
transcriptional start site
 Initiator sequence 5’-YYCARR-3’
(Y is C or T and R is G or A) at
transcriptional start site
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
QuickTime™ and a
decompressor
are needed to see this picture.
Transcription
Regulatory Protein Binding Sites
• Transcription initiation in eukaryotes relies
heavily on positive regulation
 Constitutive factors work on many genes and
don’t respond to external signals
 Regulatory factors have limited number of
genes and respond to external signals
 Response factors (e.g. heat shock)
 Cell-specific factors (e.g. pituitary cells only)
 Developmental factors (e.g. early embryo
organization)
Open Reading Frames
• Before translation, a heterogeneous RNA
(hnRNA) is transformed into mRNA by
being
 Capped
 5’ end chemically altered
 Spliced
 Various splicings can occur
 Polyadenylated
 Long stretch of A’s added at 3’ end
Introns and Exons
• The introns are spliced out of the hnRNA
 Protein-coding genes conform to the GU-AG
rule
 These are the nucleotides at the 5’ and 3’ end of
the intron
 Other nucleotides are examined as well
• Most of these are inside the intron
• These signals constrain introns to be at least 60 bp long but there is no upper limit
Alternative Splicing
• About 20% of human genes give rise to
more than one type of mRNA sequence
due to alternative splicing
• Splice junctions can be masked, causing
an exon to be spliced out
• The following slide shows how alternative
splicing based on different splicing factors
(proteins) can stop a useful protein from
being produced
Alternative Splicing
GC Content
• Overall GC content between different
genomes does not vary as much in
eukaryotes as in prokaryotes
 However variations in GC content within a
genome can help us to recognize genes
 Of all of the pairs of nucleotides, statistically,
CG is found only at 20% of its expected value
 No other pair is under or over represented
GC Content
• The expected levels of are found,
however, in stretches of 1 -2 kbp at the
end of the 5’ ends of many human genes
 These are called CpG islands and are
associated with methylation
 Can cause make it easy for CG to mutate to TG or
CA
 High levels of methylation imply low levels of
acetylation of histones (a protein which, when
acetylated makes transcription of DNA possible)
Isochores
• Vertebrates and plants display a level of
organization called isochores that is
intermediate between that of genes and
chromosomes
 The GC content of an isochore is relatively uniform
throughout
 There are five classes of isochores depending on the
level of GC content
 Those with high GC content also have high gene density
 The types of genes found in different classes differs as well
Codon Usage Bias
• Another hint for gene hunting can be
derived from the fact that every organism
prefers some equivalent triplet codon to
code for proteins
• Real exons generally reflect the bias while
randomly chosen strings of triplets do not
Gene Recognition
• In summary, useful DNA sequence
features for gene hunting include
 Known promoter elements (I.e. TATA boxes)
 CpG islands
 Splicing signals associated with introns
 ORFs with characteristic codon utilization
 Similarity to the sequences of ESTs or genes
from other organisms.
Gene Expression
• Expression varies greatly however
• Tools for determining gene expression
levels include cDNAs and ESTs
 Complementary DNAs are synthesized from
mRNAs and can be used to provide
expressed sequence tags useful for contig
assembly or gene recognition
cDNA
Microarrays
• Gene expression patterns can be studied
using microarrays
 Small silica (glass) chips covered with
thousands of short sequences of nucleotides
of known sequence
 The microarray can then be used to compare
the expression of all of the genes in the
genome simultaneously
 A gene is represented by a set of 16 probes
Microarrays
• The probes representing genes are arranged in
a grid on the chip
• Fluorescently labeled cDNA from the
tissue/organism we want to test is washed over
the chip from the tissue/organism we want to
test
• If a gene is expressed, it will bind to the genes
tags
• We can detect this through pattern recognition
Microarrays
Make cDNA
from cells
before treatment
with a drug
Make cDNA
from cells
after treatment
with a drug
Microarrays
Transposition
• Transposons result from insertion of
duplicate sequence from another part of
the genome aided by a transposase
enzyme
 If inserted in “junk DNA”, not harmful
 More common are retrotransposons which are
by retroviruses (encapsulated RNA and
reverse transcriptase which use a host to
duplicate) like HIV
Retrovirus Replication
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Virus Replication
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.