L17 preview - Computer Science and Engineering

Download Report

Transcript L17 preview - Computer Science and Engineering

CSE182-L17
Clustering
Population Genetics: Basics
Unsupervised Clustering
• Given a set of points (in ndimensions), and k, compute the
k “best clusters”.
• In k-means, clustering is done
by choosing k centers (means).
• Each point is assigned to the
closest center.
• The notion of “best” is defined
by distances to the center.
• Question: How can we compute
the k best centers?
Cluster
s
Distance
•
Given a data point v and a set of
points X,
define the distance from v to X
d(v, X)
as the (Euclidean) distance from v
to the closest point from X.
•
Given a set of n data points
V={v1…vn} and a set of k points X,
define the Squared Error
Distortion
d(V,X) = ∑d(vi, X)2 / n
1<i<n
v
K-Means Clustering Problem: Formulation
• Input: A set, V, consisting of n points and a
parameter k
• Output: A set X consisting of k points (cluster
centers) that minimizes the squared error
distortion d(V,X) over all possible choices of X
This problem is NP-complete in general.
1-Means Clustering Problem: an Easy Case
•
•
Input: A set, V, consisting of n points.
Output: A single point X that minimizes d(V,X) over all possible
choices of X.
This problem is easy.
However, it becomes very difficult for more than one center.
An efficient heuristic method for k-Means clustering is the Lloyd
algorithm
K-means: Lloyd’s algorithm
• Choose k centers at random:
– X’ = {x1,x2,x3,…xk}
• Repeat
– X=X’
– Assign each v  V to the closest cluster j
• d(v,xj) = d(v,X)  Cj= Cj  {v}
– Recompute X’
• x’j  (∑
v  Cj
• until (X’ = X)
v) /|Cj|
expression in condition 2
5
4
x1
3
x2
2
1
x3
0
0
1
2
3
4
expression in condition 1
5
expression in condition 2
5
4
x1
3
x2
2
1
x3
0
0
1
2
3
4
expression in condition 1
5
expression in condition 2
5
4
x1
3
2
x3
x2
1
0
0
1
2
3
4
expression in condition 1
5
expression in condition 2
5
4
x1
3
2
x2
x3
1
0
0
1
2
3
4
expression in condition 1
5
Conservative K-Means Algorithm
•
Lloyd algorithm is fast but in each iteration it moves many data
points, not necessarily causing better convergence.
•
A more conservative method would be to move one point at a
time only if it improves the overall clustering cost
•
The smaller the clustering cost of a partition of data points
is the better that clustering is
•
Different methods can be used to measure this clustering
cost (for example in the last algorithm the squared error
distortion was used)
Microarray summary
• Microarrays (like MS) are a technology for probing
the dynamic state of the cell.
• We answered questions like the following:
– Which genes are coordinately regulated (They have
similar expression patterns in different conditions)?
– How can we reduce the dimensionality of the system?
– Using gene expression values from a sample, can you
predict if the sample is normal (state A) or diseased
(state B)
• The techniques employed for
classification/clustering etc. are general and can
be employed in a number of contexts.
Microarray non-summary
• We did not cover:
– How are the gene expression values measured
(the technology)? (CSE183)
– How do you control variability across different
experiments (normalization)? (CSE183)
– What controls the expression of a gene (gene
regulation), or a set of genes? (CSE 181)
Population Genetics
• The sequence of an individual does not say
anything about the diversity of a population.
• Small individual genetic differences can have a
profound impact on “phenotypes”
– Response to drugs
– Susceptibility to diseases
• Soon, we will have sequences of many individuals
from the same species. Studying the differences
will be a major challenge.
Population Structure
Africa
Eurasia
East Asia
Oceania
• 377 locations (loci) were sampled in 1000 people from 52
populations.
• 6 genetic clusters were obtained, which corresponded to 5
geographic regions (Rosenberg et al. Science 2003)
America
Population Genetics
• What is it about our genetic makeup that makes us
measurably different?
• These genetic differences are correlated with
phenotypic differences
• With cost reduction in sequencing and genotyping
technologies, we will know the sequence for entire
populations of individuals.
• Here, we will study the basics of this
polymorphism data, and tools that are being
developed to analyze it.
What causes variation in a
population?
• Mutations (may lead to SNPs)
• Recombinations
• Other genetic events (Ex: microsatellite
repeats)
• Deletions, inversions
Single Nucleotide Polymorphisms
Infinite Sites Assumption:
Each site mutates at most
once
00000101011
10001101001
01000101010
01000000011
00011110000
00101100110
Short Tandem Repeats
GCTAGATCATCATCATCATTGCTAG
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATTGCTAGTTA
GCTAGATCATCATCATCATCATTGC
4
3
5
3
3
5
STR can be used as a DNA
fingerprint
• Consider a collection of
regions with variable length
repeats.
• Variable length repeats will
lead to variable length DNA
• Vector of lengths is a fingerprint
4
3
5
3
3
5
2
3
1
2
1
3
positions
Recombination
00000000
11111111
00011111
What if there were no
recombinations?
• Life would be simpler
• Each sequence would have a single parent
• The relationship is expressed as a tree.
The Infinite Sites Assumption
00000000
3
00100000
5
8
00100001
•
•
•
00101000
The different sites are linked. A 1 in position 8 implies 0 in
position 5, and vice versa.
Some phenotypes could be linked to the polymorphisms
Some of the linkage is “destroyed” by recombination
Infinite sites assumption and Perfect Phylogeny
• Each site is mutated
at most once in the
history.
• All descendants must
carry the mutated
value, and all others
must carry the
ancestral value
i
1 in position i
0 in position i
Perfect Phylogeny
• Assume an evolutionary model in which no
recombination takes place, only mutation.
• The evolutionary history is explained by a tree in
which every mutation is on an edge of the tree. All
the species in one sub-tree contain a 0, and all
species in the other contain a 1. Such a tree is
called a perfect phylogeny.
• How can one reconstruct such a tree?
The 4-gamete condition
• A column i partitions the
set of species into two sets
i0, and i1
• A column is homogeneous
w.r.t a set of species, if it
has the same value for all
species. Otherwise, it is
heterogenous.
• EX: i is heterogenous w.r.t
{A,D,E}
A
i0 B
C
D
i1 E
F
i
0
0
0
1
1
1
4 Gamete Condition
• 4 Gamete Condition
– There exists a perfect phylogeny if and only if for all
pair of columns (i,j), either j is not heterogenous w.r.t i0,
or i1.
– Equivalent to
– There exists a perfect phylogeny if and only if for all
pairs of columns (i,j), the following 4 rows do not exist
(0,0), (0,1), (1,0), (1,1)
4-gamete condition: proof
• Depending on which
edge the mutation j
occurs, either i0, or i1
should be homogenous.
• (only if) Every perfect
phylogeny satisfies the
4-gamete condition
• (if) If the 4-gamete
condition is satisfied,
does a prefect
phylogeny exist?
i
i0
i1
An algorithm for constructing a
perfect phylogeny
• We will consider the case where 0 is the ancestral
state, and 1 is the mutated state. This will be
fixed later.
• In any tree, each node (except the root) has a
single parent.
– It is sufficient to construct a parent for every node.
• In each step, we add a column and refine some of
the nodes containing multiple children.
• Stop if all columns have been considered.
Inclusion Property
• For any pair of columns i,j
– i < j if and only if i1  j1
• Note that if i<j then the
edge containing i is an
ancestor of the edge
containing i
i
j
Example
1 2 3 4 5
A 1 1 0 0 0
B 0 0 1 0 0
C 1 1 0 1 0
D 0 0 1 0 1
E 1 0 0 0 0
r
A
B
Initially, there is a single clade r, and
each node has r as its parent
C D
E
Sort columns
• Sort columns according to the
inclusion property (note that
the columns are already sorted
here).
• This can be achieved by
considering the columns as
binary representations of
numbers (most significant bit
in row 1) and sorting in
decreasing order
A
B
C
D
E
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
5
0
0
0
1
0
Add first column
A
B
C
D
E
• In adding column i
– Check each edge and
decide which side you
belong.
– Finally add a node if
you can resolve a clade
1 2 3 4 5
1 1 0 0 0
0 0 1 0 0
1 1 0 1 0
0 0 1 0 1
1 0 0 0 0
r
u
A
C
E
B
D
Adding other columns
A
B
C
D
E
• Add other
columns on
edges using the
ordering
property
1
1
0
1
0
1
2
1
0
1
0
0
3
0
1
0
1
0
4
0
0
1
0
0
r
1
E
3
2
B
5
4
D
C
A
5
0
0
0
1
0
Unrooted case
• Switch the values in each column, so that 0
is the majority element.
• Apply the algorithm for the rooted case
Handling recombination
• A tree is not sufficient as a sequence may have 2
parents
• Recombination leads to loss of correlation between
columns
Linkage (Dis)-equilibrium (LD)
• Consider sites A &B
• Case 1: No recombination
– Pr[A,B=0,1] = 0.25
• Linkage disequilibrium
• Case 2:Extensive
recombination
– Pr[A,B=(0,1)=0.125
• Linkage equilibrium
A
0
0
0
0
1
1
1
1
B
1
1
0
0
0
0
0
0