Powerpoint - Department of Computer Science and Engineering

Download Report

Transcript Powerpoint - Department of Computer Science and Engineering

SAXually Explicit Images:
Finding Unusual Shapes
Li Wei
Eamonn Keogh
Xiaopeng Xi
Computer Science & Engineering Dept.
University of California – Riverside
Appears as a Google tech talk , Google “Keogh SAXually”
http://video.google.com/videoplay?docid=6642985254445857159
Time Series Data Mining Group
Outline
• Shape Representations and Distance Measures
• Shape Discords (i.e. unusual shapes)
• Algorithm
– Shape Discords Discovery Framework
– Approximating the Optimal Ordering
• Empirical Evaluation
• Conclusion
Time Series Data Mining Group
Shape Datasets
Fruit fly wings
Skulls
Sea animals
Leaves
Petroglyphs
Butterflies
Nematodes
Lizards
Time Series Data Mining Group
Arrowheads
Shape Representations
0
200
400
600
800
1000
1200
We can convert shapes into a 1D signal. By doing this we remove information about scale and offset.
But we must deal with rotation in our algorithms …
There are three ways to be rotation invariant:
Landmarking, Rotation Invariant Features, Brute Force Rotation Alignment…
Time Series Data Mining Group
Landmarking*: Find the one “True” Rotation
?
• Domain Specific Landmarking
Find some fixed point in your
domain, eg. the nose on a face, the
stem of leaf, the tail of a fish …
Best Rotation Alignment
• Generic Landmarking
Find the major axis of the shape and
use that as the canonical alignment
Owl Monkey
(species unknown)
Owl Monkey
Northern Gray-Necked
• Problem
It does not work in many cases.
A
Orangutan
C
B
Generic Landmark Alignment
Generic Landmark Alignment
*
Xie, J. AND Heng P. Shape Modeling Using
Automatic Landmarking. MICCAI 2005.
A
A
B
Time Series Data Mining Group
Best Rotation Alignment
B
Rotation Invariant Features*
• Possible features include:
Ratio of perimeter to area, fractal
measures, elongatedness, circularity,
min/max/mean curvature, entropy,
perimeter of convex hull and
histograms
Orangutan
Orangutan
(juvenile)
Borneo
Orangutan
• Problem
When throwing away rotation
information, some useful information
are thrown away invariably.
Red Howler
Monkey
Histogram of the distances
between two randomly
chosen points on the
perimeter of the shape
Time Series Data Mining Group
Mantled Howler
Monkey
1D centriod
representation
* Cardone, A., Gupta, S. K., and Karnik, M. A Survey of Shape Similarity
Assessment Algorithms for Product Design and Manufacturing
Applications. ASME Journal, 2003
Brute Force Rotation Alignment
• Idea
Achieve true rotation invariance by exhaustive brute
force search over all possible rotations
• Rotation Matrix
Given a time series C of length n, its possible rotations
constitute a rotation matrix C of size n by n
C
C1
C2
C3
c1 , c2 ,  , cn1 , cn 


c2 ,  , cn1 , cn , c1 
C 




cn , c1 , c2 ,  , cn1 
C4
C5
C6
C7
• Rotation Invariant Euclidean Distance (RED)

RED (Q, C )  min ED Q, C j
• Problem
High computational cost
1 j  n

C8
C9
C10
C11
C12
C13
Time Series Data Mining Group
We have forcefully shown this is the right representation, see our VLDB 2006 paper
Shape Discord
• The shape that is least similar to other shapes in a dataset
(or has the largest distance to its nearest match)
1st Discord
SQUID Dataset (subset)
(Castroville
Cornertang)
Specimen 20773
1st Discord
Time Series Data Mining Group
2nd Discord
(Martindale
point)
1st Discord
Brute Force Shape Discord Discovery
For each shape in the dataset
Find the distance to its nearest neighbor
Check whether it is a better candidate as the
discord
Algorithm [dist, index] = BruteForce_Search(S)
1 best_so_far_dist = 0
2 best_so_far_index = NaN
3 For p = 1 to |S|
4
nearest_neighbor_dist = infinity
5
For q = 1 to |S|
6
If p!= q
7
If Dist (Cp , Cq ) < nearest_neighbor_dist
8
nearest_neighbor_dist = Dist (Cp , Cq)
9
End
10
End
11
End
12
If nearest_neighbor_dist > best_so_far_dist
13
best_so_far_dist = nearest_neighbor_dist
14
best_so_far_index = p
15
End
16 End
17 Return [best_so_far_dist, best_so_far_index]
Time Series Data Mining Group
For each shape in the dataset (row)
Find the distance to its nearest
neighbor (column)
Check whether it is a better candidate
for discord
Observations from Brute Force Algorithm
Brute Force
1
1
2
3
19.1 5.9
4
Early Abandon
6
5
29.3 19.5
nn_dist
1
2
3
4
5
bsf_dist = 5.9
10.1 29.0 2.4
3.0
2.4 < 5.9
28.1 4.1
8.4
4.1 < 5.9
5.9
1
10.1 29.0 2.4
3.0
2.4
2
19.1
28.1 4.1
8.4
4.1
3
5.9
26.7
4
29.3 29.0 28.1
2.4
5
19.5 2.4
4.1
26.7
6
18.4 3.0
8.4
28.8 3.4
19.1
3
5.9
4
29.3 29.0 28.1
5
19.5 2.4
4.1
26.7
6
18.4 3.0
8.4
28.8 3.4
10.1
26.7 28.8
3.4
3.0
comments
18.4
18.4
2
6
19.1 5.9
10.1
29.3 19.5
26.7 28.8
3.4
bsf_dist = 26.7
2.4 < 26.7
3.0 < 26.7
Magic
1
1
Time Series Data Mining Group
2
3
19.1 5.9
4
5
29.3 19.5
2
19.1
3
5.9
4
29.3 29.0 28.1
5
19.5 2.4
4.1
26.7
6
18.4 3.0
8.4
28.8 3.4
6
18.4
10.1 29.0 2.4
3.0
28.1 4.1
8.4
10.1
26.7 28.8
3.4
Order
Matters!
Heuristic Shape Discord Discovery
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)
best_so_far_dist = 0
best_so_far_index = NaN
For each index p given by heuristic Outer
nearest_neighbor_dist = infinity
For each index q given by heuristic Inner
If p!= q
If Dist (Cp , Cq ) < best_so_far_dist
break
End
If Dist (Cp , Cq ) < nearest_neighbor_dist
nearest_neighbor_dist = Dist (Cp , Cq )
End
End
End
If nearest_neighbor_dist > best_so_far_dist
best_so_far_dist = nearest_neighbor_dist
best_so_far_index = p
End
End
Return [ best_so_far_dist, best_so_far_index ]
Time Series Data Mining Group
Consider discord candidate in
Outer order
Visit other shapes in Inner order
Apply early abandoning
Observations from Heuristic Algorithm
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Algorithm [dist, index] = Heuristic_Search(S, Outer, Inner)
best_so_far_dist = 0
best_so_far_index = NaN
For each index p given by heuristic Outer
nearest_neighbor_dist = infinity
For each index q given by heuristic Inner
If p!= q
If Dist (Cp , Cq ) < best_so_far_dist
break
End
If Dist (Cp , Cq ) < nearest_neighbor_dist
nearest_neighbor_dist = Dist (Cp , Cq )
End
End
End
If nearest_neighbor_dist > best_so_far_dist
best_so_far_dist = nearest_neighbor_dist
best_so_far_index = p
End
End
Return [ best_so_far_dist, best_so_far_index ]
Time Series Data Mining Group
Observation 1
• We do not need a perfect outer ordering.
• Among the first few shapes being
examined, there is at least one that has a
large distance to its nearest neighbor.
Observation 2
• We do not need a perfect inner ordering.
• Among the first few shapes being
examined, there is at least one that has a
distance to the candidate that is less than
the current value of the best_so_far_dist
variable .
We want this conditional test be
true as often as possible!
Approximating the Optimal Ordering
• Step 1: symbolize the time series
• Step 2: use locality-sensitive hashing to estimate similarity
between shapes
• Step 3: generate heuristics for outer and inner loops
• Keep in mind:
– Outer heuristic (invoked only once) can take at most O(m) to calculate.
– Inner heuristic (invoked m times) can take at most O(1) to calculate.
Time Series Data Mining Group
SAX: Symbolic Aggregate approXimation
baabccbc
• Lower bounds Euclidean distance
• Achieves dimensionality reduction
• There are now well over 100 SAX papers, see
www.cs.ucr.edu/~eamonn/SAX.htm
Time Series Data Mining Group
Locality-sensitive Hash Function*
• Consider a string s of length w over an alphabet S and k
indices i1, … , ik chosen uniformly at random from the set
{1, … , w}, the locality-sensitive hash function f is defined as
• For example,
f ( s )  s[i1 ], s[i2 ],..., s[ik ]
adad f aa
• Property
dd
dada
– Strings similar to each other are more likely to be hashed to the same
value.
Time Series Data Mining Group
* Indyk, P., Motwani, R., Raghavan, P., and Vempala,
S. Locality-Preserving Hashing in Multidimensional
Spaces. STOC 1997.
Because of rotations, similar shapes may not be
hashed to the same value.
Images
Time Series
Representations
SAX
Words
A)
adad
0
200
400
600
800
1000
1200
B)
daca
0
Time Series Data Mining Group
200
400
600
800
1000
1200
Rotation Invariant Locality-sensitive Hash Function
• Consider a string s of length w over an alphabet S and k
indices i1, … , ik chosen uniformly at random from the set
{1, … , w}, the rotation invariant locality-sensitive hash
function f ’ is defined as
f ' ( s )  { p[i1 ], p[i2 ],..., p[ik ] | p  LSHIFTS ( s )}
where LSHIFTS(s) is the set of all possible left shifts of string s
Images
SAX Words
Shifts LSH Values
A)
adad
adad
dada
aa
dd
B)
da c a
d
a
c
a
dc
aa
cd
Time Series Data Mining Group
a
c
a
d
c
a
d
a
a
d
a
c
Generating Heuristics
Image 1 Time Series 1
1
2
Time Series 4
3
4
m
bada
Buckets
daca
acad
cada
adac
aa:
14m
ab:
23
ac:
ba:
bd:
3
m
ca:
3
cd:
12
db:
m
dc:
12
dd:
4
adad
dada
23
Collision Matrix
1 2
3
2
1
2
3
2
4
1
4 …
m
1
1
2
2
1
…
Image 4
Array
daca
a cbd
caab
adad
:: :: :: ::
Shifts
m
1
1
• Outer order: examines shapes in the ascending order of the largest number of
collisions each shape has with others .
• Inner order: When candidate shape i is considered in the outer loop, the inner
loop examines the shapes in the descending order of the number of collisions
they have with shape i.
Time Series Data Mining Group
The Utility of Shape Discords
A
B
1st Discord
Heliconius melpomene
(The Postman)
(Dacrocyte)
A
B
D
E
Heliconius erato
C
(Red Passion Flower Butterfly)
0
100
200
300
400
500
600
700
800
E
900
C
D
B
C
A
B
C
D
E
F
A
A, D, E
1st Discord
G
A
B
C
B, C, F
G
Time Series Data Mining Group
The Utility of Heuristic Ordered Search
• Datasets
– Homogeneous: 10,000 projectile points
– Heterogeneous: 5,844 objects
• Measurement
– number of distance function calls by each approach / number of distance
function calls by brute force
Projectile Points
Heterogeneous Dataset
1
0.9
0.8
0.7
1
0.8
0.6
0.5
0.4
0.3
0.2
0.1
0
Time Series Data Mining Group
0.6
0.4
0.2
0
Just using early abandoning (which is an original idea in this context)
is 3 or 4 orders of magnitude faster, the Magic heuristic is a further
order of magnitude faster.
Conclusion & Future Work
• We define shape discords.
• We introduce the heuristic based algorithm to
efficiently find discords and demonstrate its
utility in various domains
• Future Work
– Investigate image discords not only using shapes
but also texture/color
– Conduct a field studies of shape discord discovery
in anthropology and archeology
Time Series Data Mining Group
Thank you!
Questions?
Time Series Data Mining Group