Scalability of Similarity Searching

Download Report

Transcript Scalability of Similarity Searching

Scalability
of Similarity Searching
the MUFIN approach
Pavel Zezula
Masaryk University
Brno, Czech Republic
Outline of the talk
• On the importance of similarity and searching
• Main memory and disk-oriented index structures
• Parallel and distributed implementations
• Future trends:
– Findability and retrieval in cloud services
– Self-organizing search networks
April 2014
Seminar Santiago da Compostela
2
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela
3
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela
4
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela
5
Real-life Similarity
• Are they similar?
April 2014
Seminar Santiago da Compostela
6
Real-Life Motivation
The social psychology view
• Any event in the history of organism is, in a sense,
unique.
• Recognition, learning, and judgment presuppose
an ability to categorize stimuli and classify
situations by similarity.
• Similarity (proximity, resemblance, communality,
representativeness, psychological distance, etc.) is
fundamental to theories of perception, learning,
judgment, etc.
• Similarity is subjective a context-dependent
April 2014
Seminar Santiago da Compostela
7
Contemporary Networked Media
The digital data view
• Almost everything that we see, read, hear, write,
measure, or observe can be digital.
• Users autonomously contribute to production of global
media and the growth is exponential.
• Sites like Flickr, YouTube, Facebook host user
contributed content for a variety of events.
• The elements of networked media are related by
numerous multi-facet links of similarity.
April 2014
Seminar Santiago da Compostela
8
Challenge
• Networked media database is getting close to the
human “fact-bases”
– the gap between physical and digital world has blurred
• Similarity data management is needed to connect,
search, filter, merge, relate, rank, cluster, classify,
identify, or categorize objects across various
collections.
WHY?
It is the similarity which is in the world revealing.
April 2014
Seminar Santiago da Compostela
9
State of the art in
Metric Searching technology
Hanan Samet
Foundation of Multidimensional and
Metric Data Structures
Morgan Kaufmann, 2006
P. Zezula, G. Amato, V. Dohnal, and M. Batko
Similarity Search: The Metric Space Approach
Springer, 2005
Teaching material:
http://www.nmis.isti.cnr.it/amato/similarity-searchbook/
April 2014
Seminar Santiago da Compostela
10
Last Similarity Search Conference
Keynote speakers:
Ricardo Baeza-Yates
Jiri Matas
April 2014
Seminar Santiago da Compostela
11
The Big Data
• Loads on a sharp rise – usage on decline
• The (3V) problem of: Volume, Variety, Velocity
• Issues:
– Acquisition: what to keep and what to discard
– Unstructured data: what content to extract
– Datafication: render into data many new aspects
– Inaccuracy: approximation, imprecision, noise
April 2014
Seminar Santiago da Compostela
12
The Big Data problem
• Shifts in thinking:
– from some to all (scalability)
– from clean to messy (approximate)
• Technological obstacles: heterogeneity, scale,
timeliness, complexity, and privacy aspects
• Foundational challenges: scalable and secure
data analysis, organization, retrieval, and
modeling
April 2014
Seminar Santiago da Compostela
13
Similarity & Geometry
• Figures that have the same shape but not
necessarily the same size are similar figures:
• Any two line segments are similar:
A
B
C
D
• Any two circles are similar:
April 2014
Seminar Santiago da Compostela
14
Similarity & Geometry
• Any two squares are similar:
• Any two equilateral triangles are similar:
April 2014
Seminar Santiago da Compostela
15
Similarity & Geometry
Definition:
• Two polygons are similar to each other, if:
1. Their corresponding angles are equal
2. The lengths of their corresponding sides are
proportional
Two geometric figures are either similar
or they are not similar at all
April 2014
Seminar Santiago da Compostela
16
Visual Similarity
• MPEG-7 multimedia content desc. standard
• Global feature descriptors:
– Color, shape, texture, …
– One high-dimensional vector per image
April 2014
Seminar Santiago da Compostela
17
Multiple Visual Aspects
April 2014
Seminar Santiago da Compostela
18
Visual Similarity
- Local feature descriptors – SIFT (Scale Invariant Feature Transform), SURF, etc.
- Invariant to image scaling, small viewpoint change, rotation, noise, illumination
April 2014
Seminar Santiago da Compostela
19
Visual Similarity - finding coresspondence
April 2014
Seminar Santiago da Compostela
20
Biometric Similarity
• Biometrics:
– methods of recognizing a person based on
physiological and behavioral characteristics
• Two types of recognition problems:
– Verification – authenticity of a person
– Identification – recognition of a person
• Examples:
– Finger prints, face, iris, retina, speech, gait, etc.
April 2014
Seminar Santiago da Compostela
21
Biometrics: Fingerprint
• Minutiae detection:
– Detect ridges (endings and branching)
– Represented as a sequence of minutiae
• P=( (r1,e1,θ1), …, (rm,em,θm) )
• Point in polar coordinates (r,e) and direction θ
• Matching of two sequences:
– Align input sequence with database one
– Compute weighted edit distance
• wins,del=620
• wrepl=[0;26] - depending on similarity of two minutiae
April 2014
Seminar Santiago da Compostela
22
Biometrics: Hand Recognition
• Hand image analysis
– Contour extraction, global registration
• Rotation, translation, normalization
– Finger registration
– Contour represented as a set of pixels
F={f1,…,fNF}
• Matching: modified Hausdorff distance
H F , G  max hF , G, hG, F 
h F , G  
April 2014
1
NF
 min f  g
f F gG
h G, F  
1
NG
Seminar Santiago da Compostela
 min
f g
gG f F
23
Remote Biometrics: Approaches
• Detection, normalization, extraction, recognition
• Face recognition
– Methods:
• Appearance-based – analyze the face as a whole
• Model-based – compare individual features (e.g., eyes, mouth)
• Gait recognition
– Less likely to be obscured, low resolution suffices
– Methods are based on shape or dynamics of the person:
• Appearance-based – analyze person’s silhouettes
• Model-based – compare features (e.g., trajectory, angular velocity)
April 2014
Seminar Santiago da Compostela
24
Face similarity
• Face detection
• Face recognition
• face detection – MPEG-7
April 2014
Seminar Santiago da Compostela
25
Search – the goals
1.
2.
3.
4.
We search to get results (papers, books, …)
We ask to find answers (what time … )
We use filters so that the right staff finds us
We browse while wandering and way-finding
in typically restricted space
• In reality, we move fluidly between modes of
ask, browse, filter, and search
April 2014
Seminar Santiago da Compostela
26
Search – some quantitative facts
• 85% of all web traffic comes from search
engines
• 450+ million searches/day are performed in
North America alone
• 70%+ of all searches are done on Google sites
Search is the most popular application
(second to E-mail??)
April 2014
Seminar Santiago da Compostela
27
Search – some experience
• 60% of searchers NEVER go past 1st page of
search results
• The top three results draw 80% of the
attention
• The first few results inordinately influence
query reformulation.
April 2014
Seminar Santiago da Compostela
28
Search - as an interaction
• When we search, our next actions are reactions
to the stimuli of previous search results
• What we find is changing what we seek
• In any case, search must be:
fast, simple, and relevant
April 2014
Seminar Santiago da Compostela
29
Search – changes our cognitive habits
1. We are increasingly handing off the job of
remembering to search engines
2. When we expect information to be easily found
again, we do not remember it well
3. Our original memory of facts is changing to a
memory of ways to find the facts
April 2014
Seminar Santiago da Compostela
30
Evolution of Search Engine Strategies
well established
cutting-edge
research
self-organized
peer-to-peer
distributed
parallel
Scalability
centralized
grade
high
data volume
number of users (queries)
variety of data types
multi-aspect queries
Determinism
exact match
precise
same answer
fixed infrastr.
► similarity
► approximate
► variable
► mapping
low
April 2014
Seminar Santiago da Compostela
31
The MUFIN Approach
MUFIN: MUlti-Feature Indexing Network
Extensibility
metric space
Scalability
P2P structure
SEARCH
infrastructure
Independence
Infrastructure as a service
April 2014
Seminar Santiago da Compostela
32
Extensibility:
Metric Abstraction of Similarity
• Metric space: M = (D,d)
– D – domain
– distance function d(x,y)
x,y,z  D
• d(x,y) > 0
• d(x,y) = 0  x = y
• d(x,y) = d(y,x)
• d(x,y) ≤ d(x,z) + d(z,y)
April 2014
Seminar Santiago da Compostela
- non-negativity
- identity
- symmetry
- triangle inequality
33
Examples of Distance Functions
• Lp Minkovski distance (for vectors)
• L1 – city-block distance
• L2 – Euclidean distance
• L – infinity
n
L1 ( x, y )   | xi  yi |
i 1
L2 ( x, y) 
n
2


x

y
 i i
i 1
n
L ( x, y )  max xi  yi
i 1
• Edit distance (for strings)
• minimal number of insertions, deletions and substitutions
• d(‘application’, ‘applet’) = 6
• Jaccard’s coefficient (for sets A,B)
April 2014
Seminar Santiago da Compostela
d  A, B   1 
A B
A B
34
Examples of Distance Functions
• Mahalanobis distance
– for vectors with correlated dimensions
• Hausdorff distance
– for sets with elements related by another distance
• Earth movers distance
– primarily for histograms (sets of weighted features)
• and many others
April 2014
Seminar Santiago da Compostela
35
Similarity Search Problem
• For X D in metric space M,
pre-process X so that the similarity queries
are executed efficiently.
In metric space: no total ordering exists!
April 2014
Seminar Santiago da Compostela
36
Similarity Range Query
r
q
• range query
– R(q,r) = { x  X | d(q,x) ≤ r }
… all museums up to 2km from my hotel …
April 2014
Seminar Santiago da Compostela
37
Nearest Neighbor Query
• the nearest neighbor query
– NN(q) = x
– x  X, y  X, d(q,x) ≤ d(q,y)
• k-nearest neighbor query
k=5
– k-NN(q,k) = A
– A  X, |A| = k
– x  A, y  X – A, d(q,x) ≤ d(q,y)
q
… five closest museums to my hotel …
April 2014
Seminar Santiago da Compostela
38
Basic Partitioning Principles
• Given a set X  D in M=(D,d), basic
partitioning principles have been defined:
– Ball partitioning
– Generalized hyper-plane partitioning
– Excluded middle partitioning
April 2014
Seminar Santiago da Compostela
39
Ball Partitioning
• Inner set: { x  X | d(p,x) ≤ dm }
• Outer set: { x  X | d(p,x) > dm }
dm
p
April 2014
Seminar Santiago da Compostela
40
Generalized Hyper-plane
• { x  X | d(p1,x) ≤ d(p2,x) }
• { x  X | d(p1,x) > d(p2,x) }
p2
p1
April 2014
Seminar Santiago da Compostela
41
Excluded Middle Partitioning
• Inner set: { x  X | d(p,x) ≤ dm -  }
• Outer set: { x  X | d(p,x) > dm +  }
2
dm
p
p
dm
• Excluded set: otherwise
April 2014
Seminar Santiago da Compostela
42
Clustering
• Cluster data into sets
– bounded by a ball region
– { x  X | d(pi,x) ≤ ric }
April 2014
Seminar Santiago da Compostela
43
Scalability: Peer-to-Peer Indexing
• Local search: Main memory structures
• Native metric techniques: GHT*, VPT*
• Transformation techniques: M-CAN, M-Chord
April 2014
Seminar Santiago da Compostela
44
Main memory structures
1.
ball partitioning methods
1.
2.
3.
4.
2.
Burkhard-Keller Tree
Fixed Queries Tree (Array)
Vantage Point Tree
Excluded Middle Vantage Point Forest
generalized hyper-plane partitioning approaches
1.
2.
3.
Bisector Tree
Generalized Hyper-plane Tree
exploiting pre-computed distances
1.
2.
3.
April 2014
AESA
Linear AESA
Other Methods – Shapiro, Spaghettis
Seminar Santiago da Compostela
45
The M-tree [Ciaccia, Patella, Zezula, VLDB 1997]
1)Paged organization
2)Dynamic
3) Suitable for arbitrary metric spaces
4) I/O and CPU optimization
- computing d can be time-consuming
April 2014
Seminar Santiago da Compostela
46
The M-tree Idea
Metric: L2 (Euclidean)
C
A B
A
F
E
B
D
CDEF
• Depending on the metric, the “shape” of index regions changes
L1 (city-block)
April 2014
L (max-metric)
weighted-Euclidean
Seminar Santiago da Compostela
quadratic form
47
M-tree: Example
o10
o5
o3
Covering
radius
o11
o2
o7
o1
o6
o8
o4
o9
Distance
to parent
o1 4.5 -.-
o1 1.4 0.0
o10 1.2 3.3
o1 0.0 o6 1.4
Leaf
Distance
to parent
entries
April 2014
o2 6.9 -.-
o7 1.3 3.8
o10 0.0 o3 1.2
o2 2.9 0.0
o4 1.6 5.3
o2 0.0 o8 2.9
o7 0.0 o5 1.3 o11 1.0
Seminar Santiago da Compostela
o4 0.0 o9 1.6
48
M-tree family
•
•
•
•
•
•
Bulk loading
Slim-tree
Multi-way insertion
PM-tree
M2-tree
etc.
April 2014
Seminar Santiago da Compostela
49
D-Index [Dohnal, Gennaro, Zezula, MTA 2002]
4 separable buckets at
the first level
2 separable buckets at
the second level
exclusion bucket of
the whole structure
April 2014
Seminar Santiago da Compostela
50
D-index: Insertion
April 2014
Seminar Santiago da Compostela
51
D-index: Range Search
r
r
q
q
r
r
q
q
r
q
r
q
April 2014
Seminar Santiago da Compostela
52
Parallel implementations
• Trees (parallel M-tree):
– Single input point – tree root, a possible bottleneck
– Inherently sequential processing steps
• Paths from the root to leaves – we never know which
node to access before the ancestor is processed
• Hash-based organizations:
– All processing steps can be parallelized
– Single input point remains
April 2014
Seminar Santiago da Compostela
53
Implementation Postulates of
Distributed Indexes
• dynamism – nodes can be added and removed
• no hot-spots – no centralized nodes, no
flooding by messages (transactions)
• update independence – network update at
one site does not require an immediate
change propagation to all the other sites
April 2014
Seminar Santiago da Compostela
54
Distributed
Similarity Search Structures
• Native metric structures:
– GHT* (Generalized Hyperplane Tree)
– VPT* (Vantage Point Tree)
• Transformation approaches:
– M-CAN (Metric Content Addressable Network)
– M-Chord (Metric Chord)
April 2014
Seminar Santiago da Compostela
55
GHT* Address Search Tree
• Based on the Generalized Hyperplane Tree
[Uhl91]
– two pivots for binary partitioning
p3
p4
p5
p6
p6
p3
April 2014
p2
p2
p5
p1
p1
p4
Seminar Santiago da Compostela
56
GHT* Address Search Tree
• Inner node
– two pivots (reference objects)
p1
• Leaf node
p3
p2
p4
p5
p6
– BID pointer to a bucket if
data stored on the current peer
BID1
BID2
BID3
NNID2
– NNID pointer to a peer if
data stored on a different peer
Peer 2
April 2014
Seminar Santiago da Compostela
57
GHT* Address Search Tree
Peer 1
Peer 3
Peer 2
April 2014
Seminar Santiago da Compostela
58
GHT* Range Query
• Range query R(q,r)
– traverse peer’s own AST
– search buckets for all BIDs found
– forward query to all NNIDs found
p3
p2
p5
p1
p2
p4
p5
p6
r
q
p3
p1
April 2014
p6
BID1
BID2
BID3
NNID2
Peer 2
p4
Seminar Santiago da Compostela
59
AST: Logarithmic replication
• Full AST on every peer is space consuming
– replication of pivots grows in a linear way
• Store only a part of the AST:
– all paths to local buckets p
• Deleted sub-trees:
p7
– replaced by NNID
BID
of the leftmost peer
April 2014
1
3
p8
NNID2
p1
p2
p4
p5
p9
NNID3
Seminar Santiago da Compostela
p10
NNID4
p11
NNID5
p12
NNID6
p6
p13
NNID7
p14
NNID8
60
AST: Logarithmic Replication (cont.)
• Resulting tree
– replication of pivots grows in a logarithmic way
p1
p3
p7
BID1
April 2014
p8
p4
p2
NNID5
NNID3
NNID2
Seminar Santiago da Compostela
61
VPT* Structure
• Similar to the GHT* - ball partitioning is used
for AST
– Based on the Vantage Point Tree [Yia93]
• inner nodes have one pivot and a radius
• different traversing conditions
r3
r2
p1 (r1)
p3
p2
r1
p2 (r2)
p3 (r3)
p1
April 2014
Seminar Santiago da Compostela
62
M-Chord: The Metric Chord
• Transform metric space to one-dimensional domain
– Use M-Index - a generalized version of the iDistance
• Divide the domain into intervals
– assign each interval to a peer
• Use the Chord P2P protocol for navigation
• The Skip graphs distributed protocol can be used,
alternatively
April 2014
Seminar Santiago da Compostela
63
M-Chord: Indexing the Distance
• iDistance – indexing technique for vector domains
– cluster analysis = centers = reference points pi
– assign iDistance keys to objects x  Ci
iDist ( x)  d ( pi , x)  i  c
•
– range query R(q,r): identify intervals of
interest
Generalization to metric spaces
– select pivots
{ p0 ,..., pn }
– then partition: Voronoi-style
April 2014
Seminar Santiago da Compostela
64
M-Chord: Chord Protocol
• Peer-to-Peer navigation protocol
• Peers are responsible for intervals of keys
• (log n) hops to localize a node storing a key

M-Chord


set the iDistance domain
make it uniform: function h
mchord ( x)  h(d ( pi , x)  i  c)

Use Chord on this domain
April 2014
Seminar Santiago da Compostela
65
M-Chord: Range Query
• Node Nq initiates the search
• Determine intervals
– generalized iDistance
• Forward requests to peers on
intervals
• Search in the nodes
– using local organization
• Merge the received partial
answers
April 2014
Seminar Santiago da Compostela
66
M-CAN: The Metric CAN
• Based on the Content-Addressable Network
(CAN)
–
a DHT navigating in an N-dimensional vector space
• The Idea:
1. Map the metric space to a vector space
– given N pivots: p1, p2 , … , pN, transform every o into
vector F(o)
2. Use CAN to
– distribute the vector space zones among the nodes
– navigate in the network
April 2014
Seminar Santiago da Compostela
67
• CAN – the principles
2-dimensional vector space
CAN: Principles & Navigation
– the space is divided in zones
– each node “owns” a zone
– nodes know their neighbors
• CAN – the navigation
– greedy routing
– in every step, move to the
neighbor closer to the
target location
April 2014
Seminar Santiago da Compostela
6
2
3
1
5
4
x,y
68
M-CAN: Contractiveness & Filtering
• Use the L∞ as a distance measure
– the mapping F is contractive
L ( F ( x), F ( y))  d ( x, y)
• More pivots  better filtering
– but, CAN routing is better for less dimensions
• Additional filtering
– some pivots are only used for filtering data (inside the
explored nodes)
– they are not used for mapping into CAN vector space
April 2014
Seminar Santiago da Compostela
69
Infrastructure Independence:
MESSIF
Metric Similarity Search Implementation Framework
Performance statistics
Distributed index structures
Centralized index structures
Metric space (D,d)
Operations
Storage
Communication
Net
Vectors
• Lp and quadratic form
Strings
• (weighted) edit and
protein sequence
April 2014
Insert, delete,
range query,
k-NN query,
Incremental k-NN
Volatile memory
Persistent memory
Seminar Santiago da Compostela
70
MUFIN demos
•
•
•
•
•
•
•
•
•
http://disa.fi.muni.cz/imgsearch/similar
http://www.pixmac.com/
http://disa.fi.muni.cz/twenga/
http://disa.fi.muni.cz/fingerprints/
http://disa.fi.muni.cz/subseq/
http://disa.fi.muni.cz/FaceMatch/
http://disa.fi.muni.cz/annotation-ui/
http://disa.fi.muni.cz/motion-match/
http://disa.fi.muni.cz/profimedianeural_network-20M/
May 29-30, 2014
Czech Technology Days at GT
71
MUFIN demos
•
•
•
•
•
•
•
•
http://mufin.fi.muni.cz/imgsearch/similar
http://www.pixmac.com/
http://mufin.fi.muni.cz/twenga/random
http://mufin.fi.muni.cz/fingerprints/random
http://mufin.fi.muni.cz/subseq/random
http://mufin.fi.muni.cz/mma-faces-extended/
http://mufin.fi.muni.cz/plugins/annotation
http://mufin.fi.muni.cz/motion-retrieval/random
April 2014
Seminar Santiago da Compostela
72
Problems with Current Applications
1. Current applications are implemented as
complex software projects; it is costly, highly
qualified specialists are needed
2. They fitfully need massive infrastructure to
build and run multiple indices
3. Applications are much more complex than a
search, which is an important supporting
service
April 2014
Seminar Santiago da Compostela
73
Similarity Data Management System
Application
Areas
Cloud
Services
findability
stimuli
effectiveness
Similarity
Search
Computing
web
enterprise
mobile
social
efficiency
operators
retrieval
April 2014
Seminar Santiago da Compostela
74
Similarity Searching in Clouds
• Retrieval – effectiveness, evaluation, operations, execution, and
efficiency
• Findability – effectiveness, matching, stimuli, extraction, and
efficiency
• Cloud way of computing:
–
–
–
–
Scalability – must balance load across servers and avoid bottlenecks
Elasticity – to allow adding (reducing) capacity to a running system
Availability – to provide high levels of usability and fault tolerance
Privacy – to safeguard data that is valuable or sensitive against
unauthorized access
April 2014
Seminar Santiago da Compostela
75
Self-organizing Search Systems
• Big data in computer networks:
– A very large number of users keeping data within their
devices (computers, mobiles, etc.)
– A very high degree of churning users
– Solution to storing and searching the data:
• Cloud-type systems – data stored within a single infrastructure
• Fully distributed systems – data stored within each user’s device
that brings additional computational power to the system
Facebook
YouTube
April 2014
76/4
Seminar Santiago da
Compostela
Self-organizing Search Systems
• A self-organizing system:
–
–
–
–
Devices are independent and equal in functionality
A set of devices interconnected by semantic links
There is no global control mechanism
Search mechanisms and link management based on
social-like interactions (acquaintance/friend links)
• Properties:
– Scalability
– Adaptability
– Robustness
April 2014
77/4
Seminar Santiago da
Compostela
Architecture
• Each peer stores its piece of data
– It asks and answers similarity queries
• Each peer maintains a query history
– An acquaintance tie - the best answer
– Friend ties – similar peer
P2
Q1
P1
Q2
Q1
Q3
P3
Q3
Q3
April 2014
Seminar Santiago da Compostela
78
Experiments

150 peers organizing 200,000 images


Extracted features: 45-dimensional vectors
Random initial state

April 2014
Five queries per peer
• Query history
– 100 queries limit
Seminar Santiago da Compostela
79
Self-organizing Search Systems
• Our Solution – Metric Social Network
– Resilience to disconnections of peers – initially 2 000
– After each 20th test series:
• S1 – 200 random peers were disconnected
• S2 – 200 the most knowledgeable peers were disconnected
S1
April 2014
S2
80/4
Seminar Santiago da
Compostela