Othogonal Range Clustering - uni
Download
Report
Transcript Othogonal Range Clustering - uni
Ebrahim Ehsanfar
Technical University of Dortmund
A database in a bank records transactions
Amount, Date, Accounts Holders…
Query: find all the transactions (Pi) such that
• The amount is between $1000 and $2000
• It happened between 10:40 AM and 11:20AM
Range: R= [10:40,11:20]x[1000,2000]
Goal: Finding P∩ R
Geometric interpretation
Reported Points={
1
}
3
2
4
P
n
R
Range: R= [10:40,11:20]x[1000,2000]
Goal: Finding P∩ R
Definitions: d(x,C) mincC d(x, c)
d 2 (x,C) mincC d 2 (x, c)
cost(P,C) xP d (x,C)
2
(the sum of the squared distances)
Problem:
Given an input set P R d with P n
and k N, find a set C R d with C k that
minimizescost(P ,C)
Choose k initial centers C = c1,…,ck
While it gets better results
1.
2.
•
•
•
For i = 1,...,N
- find closest center ck of C to each point pi
- assign instance pi to cluster Ck
For k = 1,...,K
set ck to be the center of mass of all points in Ci
The algorithm terminates in a local minimum
Lloyds algorithm is very sensitive to the initial
values of the centers
We can give better initial assignments to
Lloyds algorithm (seeding algorithms)
A well known seeding procedure, K-means++
[Arthur et al. 2007]
In each iteration, the next center is chosen
randomly from the input points
The distribution over the points is not
uniform
Each point picked with probability
proportional to the minimum squared
distance from it to the picked centers
Gives an O(log(k)) approximation guarantee
Goal: Compute a set of k centers, such that
the sum of the squared distances between
each point and its nearest center is minimized
A common approach is to represent points as
a collection of canonical subsets
Recall ,to the range
reporting problem
Any set can be formed as the disjoint union of
canonical subsets
Many ways to select canonical subsets
Choice affects the space and time
complexities
Goal: Build a data structure to strike a
balance between:
# of total canonical subsets (Space)
# of canonical subsets needed to answer a query
(Time)
Canonical subsets through the use of a
partition tree:
rooted (binary)tree
Leaves: points of P
Any node u: associated with a subset of P
Canonical
▪ points stored in the leaves of the subtree rooted
subtree ofat u
{59,62,70,80}
The search paths to a and b (e.g. [25,90] )
Query time: O((log n)+k)
Space complexity: O(n)
(binary tree with ‘n’ leaves)
Assume d > 1
We want to perform range searching in
Build T with respect to the x1-coordinate
For each canonical set, build a (d-1) dimensional
data structure using coordinates
For answering d-dimensional query
Find the canonical nodes of T associated with
Make a d-1 dimensional Query on each canonical
subtree recursively, using
Total: O(n log n)
Suppose (d-1)-dim. range tree has size of
Then size of d-dim. range tree is
d
Time : O(log n + k)
Space : O(n log
(d -1)
n)
Key Idea:
Replace many points by one weighted
representative
A small weighted set, such that for any set of k
cluster centers the clustering cost of the coreset is
an approximation for the clustering cost of the
original set P with small relative error
Notation [Har-Peled, Mazumdar 2004] :
Let k N and 1. A weightedmultisetS R d
with positiveweight function w: s R is called (k, ) - coreset
of P iff for each C R d of size C k we have
(1 - )cost(P,C) costw (S, C) (1 )cost(P,C)
[Agarwal et. al. 2001], introducing coreset
[Har-Peled and Kushal 2005], size of the coreset
is independent of n
[Feldman, Monemizadeh, Sohler 2007]: weak
coreset
[Chen 2007]: Size of coresets is linear in d (High
dim. Data)
[Ackermann et al. 2010] Coreset by Adaptive
sampling
[Feldman, Schmidt, Sohler 2013] Coreset size is
independent of n and d
Observation: (Merge [Har-Peled, Mazumdar 2004] )
If
Then
S1 is a (k , ) coreset for P1
S2 is a (k , ) coreset for P2
S=S1∩S2 is a
(k , ) coreset for
P=P1∩P2
In any level i, you can get (almost) 2^i (n/(2^i))
Points in form of the canonical subsets
Replace any canonical subsets with the associated coreset
The final coreset range tree is ready for k- means clustering
the points within any orthogonal range queries
Dataset: Hip2 (Hipparcos 2-High precision
parallax collecting satellite)
Scientific mission of the European Space
Agency (ESA)
launched in 1989 and operated between 1989
and 1993
Get accurate measurements of the positions
of objects on the sky
The Hipparcos Catalogue (1997):
A high-precision catalogue of 117,995 stars
The lower precision data (more than a million)
Dimension: More than 40
Our experimented dataset:
500,000 of the more accurate points of lower
precision catalogue
Only the coordination of the stars (in Radian from 0 2*pi)
Linear size data structures
Simplex range clustering
Other machine learning methods (e.g.
regression)
Database index data structures
SELECT * FROM Book WHERE 50$<price<100$
k-Cluster * FROM Book WHERE 50$<price<100$