slides - Washington University in St. Louis

Download Report

Transcript slides - Washington University in St. Louis

Take a Walk and Cluster Genes:
A TSP-based Approach to Optimal
Rearrangement Clustering
Sharlee Climer and
Weixiong Zhang
This research was supported in part by NDSEG and Olin Fellowships
and by NSF grants IIS-0196057 and ITR/EIA-0113618.
Overview




Introduction
Example
Results
Conclusion
Sharlee Climer
Washington University in St. Louis
2
Introduction

Rearrangement clustering





Rearrange rows of a matrix
Minimize the sum of the differences between
adjacent rows
min S d(i, i+1)
Rows correspond to objects
Columns correspond to features
Sharlee Climer
Washington University in St. Louis
3
Introduction

Applications



Information retrieval
Manufacturing
Software engineering
Sharlee Climer
Washington University in St. Louis
4
Example
Sharlee Climer
Washington University in St. Louis
5
Example

Bond Energy Algorithm (BEA)



Introduced in 1972 (McCormick, Schweitzer,
White)
Approximate solution
Still widely used
Sharlee Climer
Washington University in St. Louis
6
Example
Sharlee Climer
Washington University in St. Louis
7
Example


Optimal solution
Lenstra (1974) observed equivalence to the
Traveling Salesman Problem (TSP)



Given n cities and the distance between each pair
Find shortest cycle visiting every city
NP-hard problem
Sharlee Climer
Washington University in St. Louis
8
Example

Transform into a TSP



Each object corresponds to a city
Distance between two cities equal to difference
between the corresponding objects
Dummy city added to problem


Costs from dummy city to all other cities equal a
constant
Location of dummy city indicates position to cut
cycle into a path
Sharlee Climer
Washington University in St. Louis
9
Example



TSP solvers extremely slow even for small
problems in the 70’s
Massive research efforts to solve TSP over
last three decades
Current solvers

Concorde (Applegate, Bixby, Chvatal, Cook,
2001)

Sharlee Climer
Solved a 15,112 city TSP
Washington University in St. Louis
10
Example
Sharlee Climer
Washington University in St. Louis
11
Example





BEA and TSP offer approximate and optimal
solutions
We have observed a flaw in the objective function
when the objects form natural clusters
The objective minimizes the sum of every pair of
adjacent rows
Inter-cluster distances tend to be significantly larger
than intra-cluster distances
Summation dominated by inter-cluster distances
Sharlee Climer
Washington University in St. Louis
12
Example


TSPCluster addresses this flaw
Add k dummy cities


TSP solver ignores inter-cluster distances


k clusters are specified by the output
Minimizes sum of intra-cluster distances
Use sufficiently small constant for distances
to/from dummy cities

Dummy cities never adjacent to each other
Sharlee Climer
Washington University in St. Louis
13
Example
Sharlee Climer
Washington University in St. Louis
14
Results

Arabidopsis



499 genes
25 conditions
Comparison with BEA



Used BEA similarity measure
BEA score: 447,070
TSPCluster score: 452,109 (k = 1)
Sharlee Climer
Washington University in St. Louis
15
Results
BEA
Sharlee Climer
TSPCluster
Washington University in St. Louis
16
Results





Compared with Cluster (Eisen et al., 1998)
and k-ary (Bar-Joseph et al., 2003)
Used Pearson correlation coefficient
Cluster: 398
k-ary: 427
TSPCluster: 436 (k = 1)
Sharlee Climer
Washington University in St. Louis
17
Results
Sharlee Climer
Cluster
k-ary TSPCluster
Washington University in St. Louis
18
Results




TSPCluster with k equal to 2 to 50
How many clusters?
Average inter-cluster distances
BEA local peaks:


Pearson correlation coefficient local peaks:


6, 13, 19, 26, 29, 35, 40, 47
3, 9, 12, 21, 26, 40
Computation time varied

Less than half minute to ~3 minutes
Sharlee Climer
Washington University in St. Louis
19
Results
Sharlee Climer
k = 26
k = 40
Washington University in St. Louis
20
Conclusion





Most problems have errors in their data
Error introduced by approximation algorithms
can’t be expected to “undo” this error
Computers are cheap
Computers and solvers are sophisticated
Don’t have to always resort on approximate
solutions even for NP-hard problems
Sharlee Climer
Washington University in St. Louis
21
Conclusion


Rearrangement clustering provides a linear
ordering
Linear ordering inherent to many applications



Information retrieval
Manufacturing
Software engineering
Sharlee Climer
Washington University in St. Louis
22
Conclusion



Gene data arranged in linear order to
examine data
Linear ordering not necessarily essential to
gene clustering problems
Current work

Optimally solve subproblems in clustering
algorithms
Sharlee Climer
Washington University in St. Louis
23
Questions?
Sharlee Climer
Washington University in St. Louis
24