Lecture 10 Analyzing the DNA by array and deep sequencing (1)
Download
Report
Transcript Lecture 10 Analyzing the DNA by array and deep sequencing (1)
Analyzing DNA using Microarray and
Next Generation Sequencing (1)
Background
SNP Array
Basic design
Applications: CNV, LOH, GWAS
Deep sequencing
Alignment and Assembly
Applications: structural changes, GWAS
The chromosome
SNP
Variations in DNA sequence.
Single Nucleotide Polymorphism
(SNP) --- a single letter change in the
DNA.
Common SNPs occur every few
hundred bases. Each form is called an
“allele”. Almost all SNPs have only two
alleles. Allele frequencies are often
different between ethnic groups.
http://upload.wikimedia.org/wiki
pedia/commons/thumb/2/2e/Dn
a-SNP.svg/180px-DnaSNP.svg.png
Correlations between SNPs
Why measure the SNP alleles?
DNA change in two ways
during evolution:
Point mutation SNPs
Recombination
This happens in large
segments. Alleles of
adjacent SNPs are highly
dependent.
Haplotype: A group of alleles
linked closely enough to be
inherited mostly as a unit.
http://www.evolutionpages.com/images/
crossing_over.gif
Why SNP?
http://www.hapmap.org/originhaplotype.
html.en
Figure 1: This diagram shows two
ancestral chromosomes being
scrambled through recombination over
many generations to yield different
descendant chromosomes. If a genetic
variant marked by the A on the
ancestral chromosome increases the
risk of a particular disease, the two
individuals in the current generation
who inherit that part of the ancestral
chromosome will be at increased risk.
Adjacent to the variant marked by
the A are many SNPs that can be
used to identify the location of the
variant.
Why SNP?
Nature Genetics 26, 151 - 157 (2000)
SNPs
Figure 1. Schematic model of trait
aetiology. The phenotype under study,
Ph, is influenced by diverse genetic,
environmental and cultural factors (with
interactions indicated in simplified form).
Genetic factors may include many loci of
small or large effect, GPi, and polygenic
background. Marker genotypes, Gx, are
near to (and hopefully correlated with)
genetic factor, Gp, that affects the
phenotype. Genetic epidemiology tries to
correlate Gx with Ph to localize Gp. Above
the diagram, the horizontal lines represent
different copies of a chromosome; vertical
hash marks show marker loci in and around
the gene, Gp, affecting the trait. The red Pi
are the chromosomal locations of
aetiologically relevant variants, relative to
Ph.
The gene deciding pheonotype
SNP array
The SNP array
Affymetrix.com
SNP array
The SNP array
40 probes per SNP (20 for forward
strand and 20 for reverse strand.)
PM/MM strategy.
Data summary (generating
AA/AB/BB calls) omitted here.
Affymetrix.com
SNP array
Association analysis
Genotype calls
Linkage analysis
SNP array
Loss of Heterozygosity
Signal strength
Copy number abberation
CNA --- Background
Copy Number Aberration (CNA):
A form of chromosomal aberration
Deviation from the regular 2 copies for some
segments of the chromosomes
One of the key characteristics of cancer
CNA in cancer:
Reduce the copy number of tumor-suppressor genes
Increase the copy number of oncogenes
Possibly related to metastasis
CNA --- the statistician’s task
High density arrays allow us to identify “focused CNA”:
copy number change in small DNA segments.
With the high per-probeset noise, how to achieve high
sensitivity AND specificity?
CNA – maximizing sensitivity/specificity
Two approaches that complement each other:
Reducing noise at the single probeset level:
Based on dose-response (Huang et al., 2006)
Based on sequence properties (Nannya et al., 2005)
Segmentation methods.
Smoothing; Hidden Markov Model-based methods;
Circular Binary Segmentation … …
HMM data segmentation
Fridlyand et al. Journal of Multivariate Analysis, June 2004, V. 90, pp. 132-153
Amplified
Normal
Deleted
Forward-backword
fragment assembling
Some example:
Top: model cell line, 3 copy segment in chromosome 9
Bottom: Cancer sample
LOH
Loss of Heterozygosity
(LOH)
Happens in segments
of DNA.
Keith W. Brown and Karim T.A. Malik, 2001,
Expert Reviews in Molecular Medicine
LOH
On SNP array, LOH will yield identical calls (AA or BB,
rather than AB) for a number of consecutive SNPs.
Discov Med. 2011 Jul;12(62):25-32.
GWAS
http://www.mpg.de/10680/Modern_psychiatry
© Pasieka, Science Photo Library
GWAS
GWAS
Genome-wide association study identifies
variants in the ABO locus associated with
susceptibility to pancreatic cancer
Nature Genetics 41, 986 - 990 (2009)
DNA sequencing
Background
Background
Background
Background
Alignment and Assembly
When a reference genome is available --- Alignment
Can rely on existing reference genome as a blue print.
Align the short reads onto the reference genome.
Need a few fold coverage to cover most regions.
Sequence a whole new genome? --- Assembly
Overlaps are required to construct the genome.
The reads are short need ~30 fold coverage.
If 3G data per run, need 30 runs for a new genome
similar to human size.
Alignment and Assembly
Hash table-based alignment. Similar to BLAST in principle.
(1) Find potential locations:
(2) Local alignment.
Alignment and Assembly
From read to graph:
Alignment and Assembly
Alignment and Assembly
de Bruijn graph assembly
Red:
read error.
Alignment and Assembly
de Bruijn graph assembly
Alignment and Assembly
de Bruijn graph assembly
Whole gnome/exome/transcriptome sequencing
Genomics
Whole genome sequencing detects all variants (SNP alleles,
rare variants, mutations)
Could be associated with disease:
Rare variants (burden testing by collapsing by gene)
De novo mutations (need family tree)
Rare Mendelian disorders
Structural variants in cancer
Structural changes
Identification of translocations from discordant paired-end reads.
Cancer Genetics 206 (2014) 432e440
Structural changes
CNV by depth of coverage
Cancer Genetics 206 (2014) 432e440
Structural changes
Cancer Genetics 206 (2014) 432e440
Genotype calling
http://www.geneious.com/features/sequence-analysis-annotation-prediction
Medical Genomics
Example: Extreme-case sequencing to find rare variants
associated with a disease.
Nature Reviews Genetics 11, 415
GWAS
GWAS