Transcript Slides

Single nucleotide
polymorphisms and
applications
Usman Roshan
BNFO 601
SNPs
• DNA sequence variations that occur when a
single nucleotide is altered.
• Must be present in at least 1% of the
population to be a SNP.
• Occur every 100 to 300 bases along the 3
billion-base human genome.
• Many have no effect on cell function but some
could affect disease risk and drug response.
Toy example
SNPs on the chromosome
SNP
Chromosome
Gene
Bi-allelic SNPs
• Most SNPs have one of two nucleotides
at a given position
• For example:
– A/G denotes the varying nucleotide as
either A or G. We call each of these an
allele
– Most SNPs have two alleles (bi-allelic)
SNP genotype
• We inherit two copies of each chromosome
(one from each parent)
• For a given SNP the genotype defines the
type of alleles we carry
• Example: for the SNP A/G one’s genotype
may be
–
–
–
–
AA if both copies of the chromosome have A
GG if both copies of the chromosome have G
AG or GA if one copy has A and the other has G
The first two cases are called homozygous and
latter two are heterozygous
SNP genotyping
Real SNPs
• SNP consortium: snp.cshl.org
• SNPedia: www.snpedia.com
Application of SNPs:
association with disease
• Experimental design to detect cancer
associated SNPs:
– Pick random humans with and without
cancer (say breast cancer)
– Perform SNP genotyping
– Look for associated SNPs
– Also called genome-wide association study
Case-control example
• Study of 100 people:
– Case: 50 subjects with
cancer
– Control: 50 subjects without
cancer
• Count number of alleles and
form a contingency table
#Allele1
#Allele2
Case
10
90
Control
2
98
Effect of population structure
on genome-wide association
studies
• Suppose our sample is drawn from a
population of two groups, I and II
• Assume that group I has a majority of allele
type I and group II has mostly the second
allele.
• Further assume that most case subjects
belong to group I and most control to group II
• This leads to the false association that the
major allele is associated with the disease
Effect of population structure
on genome-wide association
studies
• We can correct this effect if case and
control are equally sampled from all
sub-populations
• To do this we need to know the
population structure
Population structure prediction
• Treated as an unsupervised learning
problem (i.e. clustering)
Clustering
• Suppose we want to cluster n vectors in
Rd into two groups. Define C1 and C2 as
the two groups.
• Our objective is to find C1 and C2 that
minimize 2
2
 || x j  mi ||
i1 x j C i
where mi is the mean of class Ci
K-means algorithm for two clusters
Input:
x i  R d ,i  1
Algorithm:
n
1.
Initialize: assign xi to C1 or C2 with equal probability and
compute means:
1
1
m1 
x
m

xi


i
2
C1 x i C1
C2 x i C 2
2.
Recompute clusters: assign xi to C1 if ||xi-m1||<||xi-m2||,
otherwise assign to C2
Recompute
meansm1 and m2

Compute objective

3.
4.
2
 || x
2

m
||
j
i
i1 x j C i
5.
Compute objective of new clustering. If difference is
smaller than  then stop, otherwise go to step 2.
K-means
• Is it guaranteed to find the clustering
which optimizes the objective?
• It is guaranteed to find a local optimal
• We can prove that the objective
decreases with subsequence iterations
Proof sketch of convergence
of k-means
2

2
||
x

m
||
 j i 
i1 x j C i
2
  || x
2

m
||

j
i
i1 x j C i*
2

* 2
||
x

m
 j i ||
i1 x j C i*
Justification of first inequality: by
assigning xj to the closest mean the
objective decreases or stays the
same
Justification of second inequality:
for a given cluster its mean
minimizes squared error loss