Analyzing Copy Number Variation in the Human Genome
Download
Report
Transcript Analyzing Copy Number Variation in the Human Genome
Analyzing Copy Number Variation
in the Human Genome
Jeff Bailey
S5-432
Continuum of Genomic Variation
Forms Single
of genetic
variation.
base-pair
changes
Nucleotide
Point mutations (1 per 800 bp)
Small insertions/deletions
Large-scale Deletions
Segmental Duplications
Local Rearangements
Chromosomal variation
Cytogenetics
Translocation, inversion, fusion
Structural Variants (SV)
Retroelement insertions (300bp -10 kb)
Large-scale genomic copy
number variation (>10 kb)
Copy Number Variation
Frameshift, microsatellite, minisatellite
Mobile elements
METHOD 1: Copy Number Variation:
Array Comparative Genomic Hybridization
Two genomic surveys of normal individuals identified 76
and 255 CNV regions by array CGH ( Sebat et al. Science
2004; Iafrate et al. Nat Genet 2004)
Gain
Gain
>green
>red
(blue line)
Loss
Modified:Feuk et al. Nat Rev Genet 2006
30% CNVs overlap duplicated regions (variant SD = CNV)
( Sebat et al. Science 2004)
Segmental Duplications (SD)
5.4% of the genome (>90% identity and >1 kb)
chr22
Properties:
•Clustered
•Complex regions
99.1% identical over 180 kb (VCF/DiGeorge Syndrome in 1 in 3000 births)
Bailey and Eichler (2006) Nat Rev Genet
SDs predispose to copy number variation
I
D
Cen
D’
Tel
I
D
Cen
D’
Non-allelic Homologous Recombination (Lupski, 1999)
Cen
I
D
D’- D
I
D’
Tel
GAMETES
Cen
D - D’
Tel
Change in Dosage Sensitive Genes → phenotype or disease
Dynamic Regions – predisposed to further rearrangements
Complex disease associations
1) Recurrent germline rearrangements causing congenital disease
2) Rare CNVs causing disease in a small proportion of affected individuals
in a Mendelian fashion
3) Common CNVs that are responsible for a proportion
of complex genetic risk in many individuals
CNV
Disease Association
CCL3L1
Decreased copies cause HIV/AIDS susceptibility (Gonzalez et al. 2005). Increased copies
increase risk of rheumatoid arthritis.(Mckinney et al. 2008)
FCGR3B
Decreased copies increases risk for lupus nephritis (Aitman et al. 2006)
APP
UGTB17
Synuclein
DEFB4
LCE3B &
LCE3C
Duplication leading to (Rovelet,Lecrux et al. 2006)
Deletion associated with 2-fold increased risk of osteoporosis (Yang et al. 2008)
Triplication causes Parkinson Disease (Singleton et al 2003)
More than 5 copies of beta-defensins associated with 1.7-fold increased risk of psoriasis
(Hollox et al. 2008). Less than 4 copies is associated with 3-fold increased risk for Crohn
disease.(Fellermann et al. 2006)
Multigene deletion of late cornified envelope genes are associated with psoriasis (de Cid, et
al. 2009)
Method 2: End-Sequence Pair (ESP)
Analysis
fosmid
~1.1 million fosmid end-sequence
pairs derived from a single donor
(sequenced by MIT to help close
gaps in the reference genome)
Fosmid insert size tightly distributed
around mean (40 kb)
insert
< 32 kb
>48 kb
Putative
Insertion
within
fosmid
Putative
Deletion
within
fosmid
Compare fosmid optimal placements
to detect deviations from expected.
Fosmid:
Concordant
Insertion
Deletion
Inversions
Reference
Genome
Dataset: 1,122,408 fosmid pairs preprocessed (15.5X genome coverage)
639,204 fosmid pairs BEST pairs (8.8X genome coverage)
Results:
Tuzun*, Bailey*, Sharp* et al. Nat. Genet 2005
Fosmid SV Project
Fosmid End Sequencing 8 HapMap Individuals
1695 structural variants
525 novel insertion sequences
(Kidd et al. 2008 453:56)
NAHR-non-allelic homologous
recombination
NHEJ-- repair of double strand breaks
VNTR-- strand slippage
Retrotransposition-- insertion of L1, SVA or
Alu element
Method 3: Whole Genome
Sequencing
Genome Resequencing Studies
SNPs: 3,2 M bases
Non-SNP: 9.1 M bases
22% events, 74% variant bases
(Levy et al Plos Biol 2007:e266)
Read Depth, Mismapping Pairs
Future: Perfect Whole Genome Assembly
Summary of Human Genome Copy
Number Variation (12/2006)
Summary of recent analyses of structural variation in the human genome (12/06).
Reference
Mills, 2006
Hinds, 2006
McCarrol, 2006
Conrad, 2006
Tuzun, 2005
Redon, 2006
Iafrate, 2004
Sharp, 2006
Wong, 2006
Sebat, 2004
Redon, 2006
All Vars
All Vars > 1 kb
Analysis
# Individuals # Events Av. Bp Median (bp)
Align trace data
36
415434
20
2
Oligo arrayCGH
1000
1379
947
HapMap SNP genotyping
269
538
16874
6887
HapMap SNP genotyping
180*
609
34996
17217
Paired End-sequence
1
269
55706
25230
Affyx 500 K data
269
980
165996
63140
BAC Array-CGH
55**
246
146189
150395
BAC Array-CGH
47
124
170019
164704
BAC Array-CGH
105 1365***
185504
175314
ROMA-CGH
20
72
350670
199800
BAC Array-CGH
269
913
349880
227889
NA
NA
323573
1901
2
NA
NA
4131
148578
93356
Total Mbp
8.36
0.14
9.08
21.31
14.98
162.68
35.96
21.08
253.21
25.25
319.44
615.10
613.77
*- effectively independent individuals equal to number of trios
** - 39 healthy controls, 16 with karyotype abnormalities
*** - accounting for only those sites that showed in 2 or more individuals
20% of the human genome is CNV?
3000+ genes with exons in these regions CNV?
(Currently 30% of genome and 9473 genes)
How many genes are truly CNV?
Lack of Breakpoint Precision?
BAC
BACs: 150-250 kb clones of which
only a part of the sequence may be CNV
False positives?
TP
Study#1
#2
#3
Multiple studies: Increase
the proportion of false
positives since true positives
tend to overlap
CNV gene
FP
Design of Custom oligonucleotide aCGH
•Equal number of probes per exon (exon size 3 bp – 10 Kb).
•Limitation: NimbleGen algorithm creates equally spaced
probes across a region.
1
2
3
Select genomic regions to target for probe design
Merge overlapping regions
Select oligonucleotide probe sequences (average 12/exon) and place on microarray
Bailey et al. Cytogenet Genome Res 2008
Detection Method
Exon
Structure
Exon
1
Exon 2
Exon 3
Exon 4
Exon 5
Probe Regions
Hybridization
Log2
probe
intensity
Mean
intensity
difference
-0.2
SD
+1.1 SD
+1.4 SD
+0.6 SD
+1.2 SD
-0.2
SD
+1.1 SD
+1.4 SD
+0.6 SD
+1.2 SD
Step #1:
Seed
Step #2:
Extension
4-exon Partial-gene CNV
Bailey et al. Cytogenet Genome Res 2008
CNV in RHD
25
Chr 1 (kb)
25,350
25,370
25,390
Gene Model
Exons
Probe Regions
GM12878
GM18517
GM18507
GM18956
GM19129
GM12156
GM18502
GM19240
GM18555
Segmental
Duplications
Bailey et al. Cytogenet Genome Res 2008
Detecting >500 bp and >5% freq
8,599 CNV regions: 3.7% of genome (112.7 Mb)
2 genomes: 1,098 CNVs 0.78% (24 Mb)
Conrad, et al. 2009 Nature
Causal CNVs
Conrad, et al. 2009 Nature
Infectious Disease Genetics
Human
Genome
Pathogen
Genome
Environment
Vector
Genome
Complex interplay that results in infectious disease
phenotype
Potential host defense responses and pathogen virulence are encode
in respective genomes.
SD and CNV represent key mechanisms for adaptation and
diversification of responses for both host and pathogen.
The study of SD and CNV is necessary to fully
understand the genetics and biology of infectious
disease pathogenesis.
Human CNV typing
and association studies
Comprehensive CNV Typing Chip (1st generation)
Collaboration with the Eichler Lab
Preferentially targeting gene CNVs
(5,000 CNVs → 1000 genic regions → 30% host defense)
Agilent and NimbleGen oligoarray platforms
Defining copy number responsive probes
Defining copy specific probes to remove crosshybridization
Case-control studies to examine infectious disease
and immune phenotypes for association with CNVs
Human Malaria
Malaria: 2-3 million deaths per year
“strongest known force for evolutionary selection in
the recent history of the human genome” (Kwitkowski
2005 Am J Hum Genet)
HbS, HbC, HbE, thalassemia, ABO, Duffy null, SE
Asian ovalocytosis, IL-4, CR1, HLA-DRB ...
Hypothesis: Strong selection will have impacted
CNVs
Testing case-control samples for CNV
associations with resistance to infection and
cerebral malaria.