BB30055: Genes and genomes

Transcript BB30055: Genes and genomes

BB30055: Genes and genomes
Genomes - Dr. MV Hejmadi (bssmvh)
BB30055: Genomes - MVH
3 broad areas
(A)Genomes, transcriptomes, proteomes
(B) Applications of the human genome
project
(C) Genome evolution
Why sequence the genome?
3 main reasons
• description of sequence of every gene valuable.
Includes regulatory regions which help in
understanding not only the molecular activities of the
cell but also ways in which they are controlled.
• identify & characterise important inheritable disease
genes or bacterial genes (for industrial use)
• Role of intergenic sequences e.g. satellites, intronic
regions etc
History of Human Genome Project (HGP)
1953 – DNA structure (Watson & Crick)
1972 – Recombinant DNA (Paul Berg)
1977 – DNA sequencing (Maxam, Gilbert and Sanger)
1985 – PCR technology (Kary Mullis)
1986 – automated sequencing (Leroy Hood & Lloyd Smith
1988 – IHGSC established (NIH, DOE) Watson leads
1990 – IHGSC scaled up, BLAST published (Lipman+Myers)
1992 – Watson quits, Venter sets up TIGR
1993 – F Collins heads IHGSC, Sanger Centre (Sulston)
1995 – cDNA microarray
1998 – Celera genomics (J Craig Venter)
2001 – Working draft of human genome sequence published
2003 – Finished sequence announced
Human Genome Project (HGP)
Goal: Obtain the entire DNA sequence of human genome
Players:
(A) International Human Genome Sequence Consortium
(IHGSC)
- public funding, free access to all, started earlier
- used mapping overlapping clones method
(B) Celera Genomics
– private funding, pay to view
- started in 1998
- used whole genome shotgun strategy
Whose genome is it anyway?
(A) International Human Genome Sequence Consortium
(IHGSC)
- composite from several different people generated
from 10-20 primary samples taken from numerous
anonymous donors across racial and ethnic groups
(B) Celera Genomics
– 5 different donors (one of whom was J Craig
Venter himself !!!)
Strategies for sequencing the human genome
sequencing larger genomes
Mapping phase
Sequencing phase
Result….
~30 - 40,000 protein-coding genes estimated
based on known genes and predictions
definite genes
possible genes
IHGSC
24,500
5000
Celera
26,383
12,000
Organisation of human genome
Nuclear genome (3.2 Gbp)
24 types of chromosomes
Y- 51Mb and chr1 -279Mbp
Mitochondrial genome
General organisation of human genome
Coding & regulatory - (mRNA)
Non polypeptide–coding: RNA
Gene organisation
Rare bicistronic transcription units
E.g. UBA52 transcription generates ubiquitin and
a ribosomal protein S27a
Pseudogenes ()
non functional copies of exonic
sequences of an active gene.
Thought to arise by genomic
insertion of a cDNA as a
result of retroposition
Contributes to overall
repetitive elements (<1%)
processed pseudogenes -
Pseudogenes in globin gene cluster
Gene fragments or truncated genes
Gene fragments: small
segments of a gene
(e.g. single exon from
a multiexon gene)
Truncated genes:
Short components
of functional genes
(e.g. 5’ or 3’ end)
Thought to arise due to unequal crossover or exchange
Repetitive elements
Main classes based on origin
 Tandem repeats
 Interspersed repeats
 Segmental duplications
1) Tandem repeats
Blocks of tandem repeats at
 subtelomeres
 pericentromeres
 Short arms of acrocentric
chromosomes
 Ribosomal gene clusters
Tandem / clustered repeats
Broadly divided into 4 types based on size
class
Size of
repeat
Repeat
block
Major
chromosomal
location
Satellite
5-171 bp
> 100kb
centromeric
heterochromatin
minisatellite
9-64 bp
0.1–20kb
Telomeres
microsatellites
1-13 bp
< 150 bp
Dispersed
HMG3 by Strachan and Read pp 265-268
Satellites
Large arrays of repeats
Some examples
Satellite 1,2 & 3
a (Alphoid DNA)
- found in all
chromosomes
b satellite
HMG3 by Strachan and Read pp 265-268
Minisatellites
Moderate sized arrays of repeats
Some examples
Hypervariable minisatellite DNA
- core of GGGCAGGAXG
- found in telomeric regions
- used in original DNA
fingerprinting technique by Alec
Jeffreys
HMG3 by Strachan and Read pp 265-268
Microsatellites
VNTRs - Variable Number of Tandem Repeats,
SSR - Simple Sequence Repeats
1-13 bp repeats e.g. (A)n ; (AC)n
2% of genome (dinucleotides - 0.5%)
Used as genetic markers (especially for disease
mapping)
Individual genotype
HMG3 by Strachan and Read pp 265-268
Microsatellite genotyping
design PCR primers unique to one locus in the genome
.a single pair of PCR primers will produce different sized products for
each of the different length microsatellites
2) Interspersed repeats
A.k.a. Transposon-derived repeats
45% of genome
Arise mainly as a result of
transposition either through
a DNA or a RNA intermediate
Interspersed repeats (transposon-derived)
major types
class
size
Copy
%
number genome*
LINE L1 (Kpn family)
L2
~6.4kb
0.5x106
0.3 x 106
16.9
3.2
SINE
Alu
~0.3kb
1.1x106
10.6
LTR
e.g.HERV
~1.3kb
0.3x106
8.3
mariner
~0.25kb
1-2x104
2.8
DNA
transposon
family
* Updated from HGP publications
HMG3 by Strachan & Read pp268-272
LINEs (long interspersed elements)
Most ancient of eukaryotic genomes
 Autonomous transposition (reverse trancriptase)
 ~6-8kb long
 Internal polymerase II promoter and 2 ORFs
 3 related LINE families in humans
– LINE-1, LINE-2, LINE-3.
 Believed to be responsible for retrotransposition of
SINEs and creation of processed pseudogenes
LINEs (long interspersed elements)
Most ancient of eukaryotic genomes
Autonomous transposition (reverse trancriptase)
~6-8kb long
Internal polymerase II promoter and 2 ORFs
3 related LINE families in humans
– LINE-1, LINE-2, LINE-3.
Believed to be responsible for retrotransposition
of SINEs and creation of processed pseudogenes
Nature (2001) pp879-880
HMG3 by Strachan & Read pp268-272
SINEs (short interspersed elements)
Non-autonomous (successful freeloaders! ‘borrow’
RT from other sources such as LINEs)
~100-300bp long
Internal polymerase III promoter
No proteins
Share 3’ ends with LINEs
3 related SINE families in humans
– active Alu, inactive MIR and Ther2/MIR3.
LINES and SINEs have preferred insertion sites
• In this example,
yellow represents
the distribution of
mys (a type of LINE)
over a mouse
genome where
chromosomes are
orange. There are
more mys inserted
in the sex (X)
chromosomes.
Try the link below to do an online experiment
which shows how an Alu insertion
polymorphism has been used as a tool to
reconstruct the human lineage
http://www.geneticorigins.org/geneticorigins/
pv92/intro.html
Long Terminal Repeats (LTR)
Repeats on the same orientation on both sides of element e.g.
ATATATNNNNNNNATATAT
• contain sequences that serve as transcription promoters
• as well as terminators.
• These sequences allow the element to code for an mRNA
molecule that is processed and polyadenylated.
• At least two genes coded within the element to supply
essential
• activities for the retrotransposition mechanism.
• The RNA contains a specific primer binding site (PBS) for
initiating reverse transcription.
• A hallmark of almost all mobile elements is that they form
small direct repeats formed at the site of integration.
Long Terminal Repeats (LTR)
Autonomous or non-autonomous
Autonomous retroposons encode gag, pol
genes which encode the protease, reverse
transcriptase, RNAseH and integrase
Nature (2001) pp879-880
HMG3 by Strachan & Read pp268-272
DNA transposons (lateral transfer?)
DNA transposons
Inverted repeats on both sides of element
e.g. ATGCNNNNNNNNNNNCGTA
Nature (2001) pp879-880
From GenesVII by Levin
3) Segmental duplications
 Closely related sequence blocks at
different genomic loci
 Transfer of 1-200kb blocks of genomic
sequence
 Segmental duplications can occur on
homologous chromosomes
(intrachromosomal) or non homologous
chromosomes (interchromosomal)
 Not always tandemly arranged
 Relatively recent
Segmental duplications
Interchromosomal
segments duplicated
among non-homologous
chromosomes
Intrachromosomal
duplications occur
within a chromosome / arm
Nature Reviews Genetics 2, 791-800 (2001);
Segmental
duplications
in chromosome 22
Segmental
duplications
Segmental duplications - chromosome 7.
Nature Reviews Genetics 2, 791-800 (2001)
Major insights from the HGP
1) Gene size, content and distribution
2) Proteome content
3) SNP identification
4) Distribution of GC content
5) CpG islands
6) Recombination rates
7) Repeat content
Nature (2001) 15th Feb Vol 409 special issue; pgs 814 & 875-914.
1) Gene size
Gene content….
More genes: Twice as many as drosophila / C.elegans
Uneven gene distribution: Gene-rich and gene-poor
regions
More paralogs: some gene families have extended the
number of paralogs e.g. olfactory gene family has
1000 genes
More alternative transcripts: Increased RNA splice
variants produced thereby expanding the primary
proteins by 5 fold (e.g. neurexin genes)
Gene distribution
Genes generally dispersed (~1 gene per 100kb)
Class III complex at HLA 6p21.3
Overlapping genes (transcribed from 2 DNA strands) - Rare
Genes- within genes E.g. NF1 gene
HMG3 Fig 9.8
Uneven gene distribution
Gene-rich
E.g. MHC on chromosome 6 has 60 genes
with a GC content of 54%
Gene-poor regions
82 gene deserts identified
? Large or unidentified genes
What is the functional significance of these
variations?
2) Proteome content
proteome more complex than invertebrates
Protein Domains (sections with identifiable
shape/function)
Domain arrangements in humans
largest total number of domains is 130
largest number of domain types per protein is 9
Mostly identical arrangement of domains
A
A
B
B
B
C
C
C
C
C
Protein X
Proteome more complex than invertebrates……
no huge difference in domain number in humans
BUT, frequency of domain sharing very high in human
proteins (structural proteins and proteins involved in
signal transduction and immune function)
However, only 3 cases where a combination of 3 domain
types shared by human & yeast proteins.
e.g carbomyl-phosphate synthase (involved in the first 3
steps of de novo pyrimidine biosynthesis) has 7 domain
types, which occurs once in human and yeast but twice in
drosophila
3) SNPs (single nucleotide polymorphisms)
Sites that result from point mutations
in individual base pairs
 biallelic
 ~60,000 SNPs lie within exons and
untranslated regions (85% of exons lie
within 5kb of a SNP)
 May or may not affect the ORF
 Most SNPs may be regulatory

More than 1.4million SNPs identified
One every 1.9kb length on average
Densities vary over regions and chromosomes
e.g. HLA region has a high SNP density, reflecting
maintenance of diverse haplotypes over many MYears
Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928
How does one distinguish sequence errors
from polymorphisms?
sequence errors
Each piece of genome sequenced at least 10 times
to reduce error rate (0.01%)
Polymorphisms
Sequence variation between individuals is 0.1%
To be defined as a polymorphism, the altered
sequence must be present in a significant
population
Rate of polymorphisms in diploid human genome is about 1 in
500 bp
Nature (2001) 15th Feb Vol 409 special issue; pgs 821-823 & 928
SNPs and disease
3) SNPs……and risk of disease
N(291)S
3) SNPs……and risk of disease
late-onset Alzheimer's disease (LOAD)
Apolipoprotein e4 haplotype is a genetic risk factor
3 major alleles (APO E2, E3, and E4)
APO E2: Cys112 / Cys158
APO E3: Cys112 / Arg158
APO E4: Arg112 / Arg158
3) SNPs……and pharmacogenomics
4) Distribution of GC content
Genome wide average of 41%
Huge regional variations exist
E.g.distal 48Mb of chromosome 1p-47%
but chromosome 13 has only 36%
Confirms cytogenetic staining with G-bands
(Giemsa)
dark G-bands – low GC content (37%)
light G-bands – high GC content (45%)
Nature (2001) 15th Feb Vol 409 special issue; pg 876-877
5) CpG islands
CpG
Methyl CpG
methylated at C
TpG
Deamination
CpG islands show no methylation
Significance of CpG islands
1) Non-methylated CpG islands associated
with the 5’ ends of genes
2) Aberrant methylation of CpG islands is
one mechanism of inactivating tumor
suppressor genes (TSGs) in neoplasia
http://www.sanger.ac.uk/HGP/cgi.shtml
CpG islands
Greatly under-represented in human
genome
• ~28,890 in number
• Variable density
e.g. Y – 2.9/Mb but
16,17 & 22 have 19-22/Mb
Average is 10.5/Mb
Nature (2001) 15th Feb Vol 409 special issue; pg 877-888
6) Recombination rates
2 main observations
• Recombination rate increases with
decreasing arm length
• Recombination rate suppressed near
the centromeres and increases
towards the distal 20-35Mb
7) Repeat content
a) Age distribution
b) Comparison with other genomes
c) Variation in distribution of repeats
d) Distribution by GC content
e) Y chromosome
Nature (2001) 409: pp 881-891
Repeat content…….
a) Age distribution
 Most interspersed repeats predate eutherian
radiation (confirms the slow rate of clearance
of nonfunctional sequence from vertebrate
genomes)
 LINEs and SINEs have extremely long lives
 2 major peaks of transposon activity
 No DNA transposition in the past 50MYr
 LTR retroposons teetering on the brink of
extinction
a) Age distribution
overall decline in interspersed repeat activity in
hominid lineage in the past 35-40MYr
compared to mouse genome, which shows a
younger and more dynamic genome
b) Comparison with other genomes




Higher density of
transposable elements
in euchromatic portion
of genome
Higher abundance of
ancient transposons
60% of IR made up of
LINE1 and Alu repeats
whereas DNA
transposons represent
only 6%
(a few human genes
appear likely to have
resulted from
horizontal transfer
from bacteria!!)
c) Variation in distribution of repeats
Some regions show either
High repeat density
e.g. chromosome Xp11 – a 525kb region shows
89% repeat density
Low repeat density
e.g. HOX homeobox gene cluster (<2% repeats)
(indicative of regulatory elements which have low
tolerance for insertions)
d) Distribution by GC content
High GC – gene rich ; High AT – gene poor
LINEs abundant in AT-rich regions
SINEs lower in AT-rich regions
Alu repeats in particular retained in actively transcribed GC rich
regions E.g. chromosme 19 has 5% Alus compared to Y chromosome
e) The Y chromosome !
Unusually young genome (high tolerance
to gaining insertions)
Mutation rate is 2.1X higher in male
germline
Possibly due to cell division rates or
different repair mechanisms
• Working draft published – Feb 2001
• Finished sequence – April 2003
• Annotation of genes going on
(refer: International Human Genome Sequencing
Consortium. Finishing the euchromatic
sequence of the human genome. Nature 21
October 2004 (doi: 10.1038/nature03001)
Other genomes sequenced
1997
4,200 genes
1998
19,099 genes
2002
38,000 genes
2002
Mus musculus
36,000 genes
Sept 2003
Canis
18,473
human orthologs
31Aug 2005
Pan troglodytes
28% identical
Human orthologs
Science (26 Sep 2003)Vol301(5641)pp1854-1855
References
Text:
1) HMG 3 by Strachan and Read
Chapter 9 pp 265-268
References
Nature (2001) 409: pp 879-891
Batzer MA, Deininger PL Alu repeats and human
genomic diversity Nature Rev Genet 3 (5): 370379 May 2002

BB30055: Genes and genomes

Transcript BB30055: Genes and genomes

Directory