slides - Washington University in St. Louis
Download
Report
Transcript slides - Washington University in St. Louis
Take a Walk and Cluster Genes:
A TSP-based Approach to Optimal
Rearrangement Clustering
Sharlee Climer and
Weixiong Zhang
This research was supported in part by NDSEG and Olin Fellowships
and by NSF grants IIS-0196057 and ITR/EIA-0113618.
Overview
Introduction
Example
Results
Conclusion
Sharlee Climer
Washington University in St. Louis
2
Introduction
Rearrangement clustering
Rearrange rows of a matrix
Minimize the sum of the differences between
adjacent rows
min S d(i, i+1)
Rows correspond to objects
Columns correspond to features
Sharlee Climer
Washington University in St. Louis
3
Introduction
Applications
Information retrieval
Manufacturing
Software engineering
Sharlee Climer
Washington University in St. Louis
4
Example
Sharlee Climer
Washington University in St. Louis
5
Example
Bond Energy Algorithm (BEA)
Introduced in 1972 (McCormick, Schweitzer,
White)
Approximate solution
Still widely used
Sharlee Climer
Washington University in St. Louis
6
Example
Sharlee Climer
Washington University in St. Louis
7
Example
Optimal solution
Lenstra (1974) observed equivalence to the
Traveling Salesman Problem (TSP)
Given n cities and the distance between each pair
Find shortest cycle visiting every city
NP-hard problem
Sharlee Climer
Washington University in St. Louis
8
Example
Transform into a TSP
Each object corresponds to a city
Distance between two cities equal to difference
between the corresponding objects
Dummy city added to problem
Costs from dummy city to all other cities equal a
constant
Location of dummy city indicates position to cut
cycle into a path
Sharlee Climer
Washington University in St. Louis
9
Example
TSP solvers extremely slow even for small
problems in the 70’s
Massive research efforts to solve TSP over
last three decades
Current solvers
Concorde (Applegate, Bixby, Chvatal, Cook,
2001)
Sharlee Climer
Solved a 15,112 city TSP
Washington University in St. Louis
10
Example
Sharlee Climer
Washington University in St. Louis
11
Example
BEA and TSP offer approximate and optimal
solutions
We have observed a flaw in the objective function
when the objects form natural clusters
The objective minimizes the sum of every pair of
adjacent rows
Inter-cluster distances tend to be significantly larger
than intra-cluster distances
Summation dominated by inter-cluster distances
Sharlee Climer
Washington University in St. Louis
12
Example
TSPCluster addresses this flaw
Add k dummy cities
TSP solver ignores inter-cluster distances
k clusters are specified by the output
Minimizes sum of intra-cluster distances
Use sufficiently small constant for distances
to/from dummy cities
Dummy cities never adjacent to each other
Sharlee Climer
Washington University in St. Louis
13
Example
Sharlee Climer
Washington University in St. Louis
14
Results
Arabidopsis
499 genes
25 conditions
Comparison with BEA
Used BEA similarity measure
BEA score: 447,070
TSPCluster score: 452,109 (k = 1)
Sharlee Climer
Washington University in St. Louis
15
Results
BEA
Sharlee Climer
TSPCluster
Washington University in St. Louis
16
Results
Compared with Cluster (Eisen et al., 1998)
and k-ary (Bar-Joseph et al., 2003)
Used Pearson correlation coefficient
Cluster: 398
k-ary: 427
TSPCluster: 436 (k = 1)
Sharlee Climer
Washington University in St. Louis
17
Results
Sharlee Climer
Cluster
k-ary TSPCluster
Washington University in St. Louis
18
Results
TSPCluster with k equal to 2 to 50
How many clusters?
Average inter-cluster distances
BEA local peaks:
Pearson correlation coefficient local peaks:
6, 13, 19, 26, 29, 35, 40, 47
3, 9, 12, 21, 26, 40
Computation time varied
Less than half minute to ~3 minutes
Sharlee Climer
Washington University in St. Louis
19
Results
Sharlee Climer
k = 26
k = 40
Washington University in St. Louis
20
Conclusion
Most problems have errors in their data
Error introduced by approximation algorithms
can’t be expected to “undo” this error
Computers are cheap
Computers and solvers are sophisticated
Don’t have to always resort on approximate
solutions even for NP-hard problems
Sharlee Climer
Washington University in St. Louis
21
Conclusion
Rearrangement clustering provides a linear
ordering
Linear ordering inherent to many applications
Information retrieval
Manufacturing
Software engineering
Sharlee Climer
Washington University in St. Louis
22
Conclusion
Gene data arranged in linear order to
examine data
Linear ordering not necessarily essential to
gene clustering problems
Current work
Optimally solve subproblems in clustering
algorithms
Sharlee Climer
Washington University in St. Louis
23
Questions?
Sharlee Climer
Washington University in St. Louis
24