Transcript Lecture 4

Gene Structure and
Function
Genetic Code
All genomes, from virus to humans, are designed around linear sequences of
nucleotides, share a universal code.
An mRNA specify amino acid sequence through the genetic code.
We know one amino acid only could specify one nucleotide.
Two nucleotide combinations could only specify 16 amino acids.
Three nucleotides (64 possibilities), called a codon, is enough to specify each
amino acid.
Each 3 nucleotide code for one amino acid.
•The first codon is the start codon, and usually coincides with the Amino
Acid Methionine. (M which has codon code ‘ATG’)
•The last codon is the stop codon and does NOT code for an amino acid. It is
sometimes represented by ‘*’ to indicate the ‘STOP’ codon.
•A coding region (abbreviation CDS) starts at the START codon and ends at
the STOP codon.
Codon table
Each amino acid might have up
to six codons that specify it.
A handful of species vary from
the codon association described
above, and use different codons
for different amino acids.
RNA
RNA consists of a sugar-phosphate backbone, with nucleotides attached to the 1'
carbon of the sugar.
The differences between DNA and RNA are that:
 RNA has a hydroxyl group on the 2' carbon of the sugar.
 Not like DNA uses thymine (T), RNA uses uracil (U).
 Because of the extra hydroxyl group on the sugar, RNA is too bulky to
form a stable double helix. RNA exists as a single-stranded molecule.
However, regions of double helix can form where there is some base pair
complementation (U and A , G and C), resulting in hairpin loops. The
RNA molecule with its hairpin loops is said to have a secondary
structure.
 RNA molecule can form many different stable three-dimensional tertiary
structures, because it is not restricted to a rigid double helix.
Open Reading Frames (ORF)
On a given piece of DNA, there can be 6 possible frames. The ORF can be either
on the + or - strand and on any of 3 possible frames
Frame 1: 1st base of start codon can either start at base 1,4,7,10,...
Frame 2: 1st base of start codon can either start at base 2,5,8,11,...
Frame 3: 1st base of start codon can either start at base 3,6,9,12,...
(frame –1,-2,-3 are on minus strand)
An open reading frames starts with ATG in most species, and ends with a
stop codon (TAA, TAG or TGA)
A program called SIXFRAME, you can visit the site directly
http://searchlauncher.bcm.tmc.edu/seq-util/Options/sixframe.html
Eukaryotic Nuclear Gene Structure
Gene prediction for Pol II transcribed genes.
 Upstream Enhancer elements.
 Upstream Promoter elements.
 GC box (-90nt) (20bp), CAAT box (-75 nt)(22bp)
 TATA promoter (-30 nt) (70%, 15 nt consensus
(Bucher et al (1990))
 14-20 nt spacer DNA
 CAP site (8 bp)
 Transcription Initiation.
 Transcript region, interrupted by introns.
Translation Initiation (Kozak signal 12 bp
consensus) 6 bp prior to initiation codon.
 polyA signal (AATAAA 99%,other)
Introns
Transcript region, interrupted by introns.
Each introns
starts with a donor site consensus
(G100T100A62A68G84T63..)
Has a branch site near 3’ end of intron
(one not very conserved consensus
UACUAAC)
ends with an acceptor site consensus.
(12Py..NC65A100G100)
UACUAAC
AG
Exons
The exons of the transcript region are
composed of:
5’UTR (mean length of 769 bp) with
a specific base composition, that
depends on local G+C content of
genome)
AUG (or other start codon)
Remainder of coding region
Stop Codon
3’ UTR (mean length of 457, with a
specific base composition that
depends on local G+C content of
genome)
Non-Coding Eukaryotic DNA
Untranslated regions (UTR’s)
introns (can be genes within introns of
another gene!)
intergenic regions.

- repetitive elements

- pseudogenes (dead genes that
may(or not) have been retroposed back
in the genome as a single-exon “gene”)
Repeats
Each repeat family has many subfamilies.
ALU: ~ 300nt long; 600,000 elements in
human genome. can cause false homology
with mRNA. Many have an Alu1 restriction
site.
Retroposons. ( can get copied back into
genome)
LINEs (Long INtersped Elements)
L1 1-7kb long, 50000 copies
SINEs (Short Intersped Elements)
Low-Complexity Elements




When analyzing sequences, one often rely on the fact that
two stretches are similar to infer that they are homologous
(and therefore related).. But sequences with repeated
patterns will match without there being any philogenetic
relation!
Sequences like ATATATACTTATATA which are mostly two
letters are called low-complexity.
Triplet repeats (particularly CAG) have a tendency to make
the replication machinery stutter.. So they are amplified.
The low-complexity sequence can also be hidden at the
translated protein level.
Structure of the Eukaryotic Genome
~6-12% of human DNA encodes proteins.
~10% of human DNA codes for UTR
~90% of human DNA is non-coding.
Masking
To avoid finding spurious matches in alignment programs,
you should always mask out the query sequence.
Before predicting genes it is a good idea to mask out repeats
(at least those containing ORFs).
Before running blastn against a genomic record, you must
mask out the repeats.
Most used Programs:
GenScan:http://genes.mit.edu/GENSCAN.html
Repeat Masker:
http://ftp.genome.washington.edu/cgi-bin/RepeatMasker
Chromosomal structure
• Located in the nucleus
• Each chromosome consists of a single molecule of
DNA and its associated proteins
The DNA and protein complex found in eukaryotic
chromosomes is called chromatin
1/3 DNA and 2/3 protein
•Complex interactions between proteins and nucleic
acids in the chromosomes regulate gene and
chromosomal function
Ideogram
•Diagramatic representation
of a karyotype
•Individual chromsomes are
recognized by
-arm lengths
p, short
q, long
-centromere position
metacentric
sub-metacentric
acrocentric
telocentric
-staining (banding) patterns
From Miller & Therman (2001) Human
Chromosomes, Springer
Chromsome banding



Q (quinicrine) & G (Giemsa) banding
preferentially stain AT rich regions
R (reverse banding) preferentially stains GCrich regions
C-banding (denaturation & staining)
preferentially stains constitutive
heterochromatin, found in the centromere
regions and distal Yq
June 26, 2000 at the Whitehouse
Initial Analysis of the Human Genome
http://www.sanger.ac.uk/HGP/draft2000/gfx/fig2.gif
Genome Mapping
STS – sequence-tagged sites (short segments of unique
DNA on every chromosome – defined by a pair of PCR
primers that amplified only one segment of the genome)
BAC – Bacterial artificial chromosome, 100-400kb
YAC – Yeast artificial chromosome, 150kb-1.5Mb
Contig – assembled contiguous overlapping segments of
DNA from BACs and YACs
ESTs – Expressed Sequence Tags
UniGene Database – a database for ESTs
Shotgun Sequencing
Concepts in Biochemistry, 2nd Ed., R. Boyer
• Segments are short ~2kb
• Problem with repeated segments or genes
History of the Human Genome Project
1956 Physical map. 24 types and total set of 46 chromosomes
1977 Sanger publishes dideoxy sequencing method
1980 Botstein proposes human genetic map using RFLPs
1987 US DOE publishes report discussing HGP
1988 HUGO is established
1990 Official start of HGP with 3 billion $ and a 15 year horizon.
1991 Genome Database GB is established
1992 Genethon publishes map based on microsatelites.
1995 Lander et al. detailed map based on sequence tagged sites.
1998 Comprehensive map based on gene markers.
1999 Sanger Centre publishes chromosome 22
2001 Draft Genome published: Celera & Public
2003 Completion (almost) of Human Genome
Strachan and Read, HMG3 p213
1
2
The Human Genome I
3
4
5
6
7
8
9
10
11
12
13 14
16
15
104
279
221
251
17
18
19 20
72
88
66
21
22
mitochondria
Y
.016
45 48
51
86
118 107 100
148
143
142
176 163 148 140
3.2*109 bp
163
Myoglobin
a globin
197 198
X
*5.000
b-globin
(chromosome 11)
6*104 bp
Exon 3
Exon 1 Exon 2
*20
5’ flanking
3’ flanking
DNA:
3*103 bp
*103
ATTGCCATGTCGATAATTGGACTATTTGGA
30 bp
Protein:
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
aa
http://www.sanger.ac.uk/HGP/ & R.Harding & HMG (2004) p 245
The Human Genome II
Gene families
Clustered
a-globins (7), growth hormone (5), Class I HLA heavy chain (20),….
Dispersed
Pyruvate dehydrogenase (2), Aldolase (5), PAX (>12),..
Clustered and Dispersed
HOX (38 – 4), Histones (61 – 2), Olfactory receptors (>900 – 25),…
Strachan and Read (2004) Chapter 9 + Lander et al.(2001),
http://www.sanger.ac.uk/HGP/
Human Genes and Gene Structures I
Presently estimated Gene Number: 24.000 (reference: )
Average Gene Size: 27 kb
The largest gene: Dystrophin 2.4 Mb - 0.6% coding – 16 hours to transcribe.
The shortest gene: tRNATYR 100% coding
Largest exon:
ApoB exon 26 is 7.6 kb
Smallest: <10bp
Average exon number: 9
Largest exon number: Titin 363
Smallest: 1
Largest intron: WWOX intron 8 is 800 kb
Largest polypeptide: Titin 38.138
Smallest: 10s of bp
smallest: tens – small hormones.
Intronless Genes: mitochondrial genes, many RNA genes, Interferons,
Histones,..
Jobling, Hurles & Tyler-Smith (2004) HEG p 29 + HMG chapt. 9
How do we differ? – Let me count the ways
 Single nucleotide polymorphisms
 1 every few hundred bp, mutation rate* ≈ 10-9
 Short indels (=insertion/deletion)
 1 every few kb, mutation rate v. variable
TGCATTGCGTAGGC
TGCATTCCGTAGGC
TGCATT---TAGGC
TGCATTCCGTAGGC
TGCTCATCATCATCAGC
TGCTCATCA------GC
 Microsatellite (STR) repeat number
 1 every few kb, mutation rate ≤ 10-3
 Minisatellites
 1 every few kb, mutation rate ≤ 10-1
≤100bp
 Repeated genes
 rRNA, histones
 Large inversions, deletions
 Rare, e.g. Y chromosome
1-5kb
*per generation
Gene Number









Walter Gilbert [1980s] 100k
Antequera & Bird [1993] 70-80k
John Quackenbush et al. (TIGR) [2000] 120k
Ewing & Green [2000] 30k
Tetraodon analysis [2001] 35k
Human Genome Project (public) [2001] ~ 31k
Human Genome Project (Celera) [2001] 24-40k
Mouse Genome Project (public) [2002] 25k -30k
Lee Rowen [2003] 25,947
Gene finding
 Rules
 ATG
 TAA, TGA, TAG
 GT…..AG
 Compositional features
 Exon lengths
 Intron lengths
 Codon bias
 General genomic properties
 Homology
?
?