Transcript Retrieval2

K-Nearest Neighbors (kNN)
• Given a case base CB, a new problem P,
and a similarity metric sim
• Obtain: the k cases in CB that are most
similar to P according to sim
• Reminder: we used a priority list with the
top k most similar cases obtained so far
Forms of Retrieval
•Sequential Retrieval
•Two-Step Retrieval
•Retrieval with Indexed Cases
Retrieval with Indexed Cases
Sources:
–Bergman’s b`ook
–Davenport & Prusack’s book on Advanced Data
Structures
–Samet’s book on Data Structures
Range Search
Space of known problems
Red light on? Yes
Beeping? Yes
…
Transistor burned!
K-D Trees
•Idea: Partition of the case base in smaller fragments
•Representation of a k-dimension space in a binary tree
•Similar to a decision tree: comparison with nodes
•During retrieval:
Search for a leaf, but
Unlike decision trees backtracking may occur
Definition: K-D Trees
•Given:
K types: T1, …, Tk for the attributes A1, …, Ak
A case base CB containing cases in T1 …  Tk
A parameter b (size of bucket)
•A K-D tree T(CB) for a case base CB is a binary tree
defined as follows:
If |CB| < b then T(CB) is a leaf node (a bucket)
Else T(CB) defines a tree such that:
The root is marked with an attribute Ai and a
value v in Ai and
The 2 k-d trees T({c  CB: c.i-attribute < v}) and
T({c  CB: c.i-attribute  v}) are the left and right
subtrees of the root
Example
A1
(0,100)
35
<35
(60,75)
Toronto
Denver
(80,65)
Omaha
(5,45)
(35,40)
Buffalo
Denver P(32,45)
Chicago
Atlanta (85,15)
(50,10)
(25,35)
<85
Mobile
(90,5)
Omaha
Miami
Mobile
(0,0)
(100,0)
Notes:
•Supports Euclidean distance
•May require backtracking
Closest city to P(32,45)?
•Priority lists are used for computing kNN
A2
<40
40
A1
85
Atlanta
Miami
<60
Chicago
A1
60
Toronto
Buffalo
Using Decision Trees as Index
Standard Decision Tree
Variant: InReCA Tree
Ai
v1
v2
… vn
Ai
v1
v2
unknown
… vn
Ai
v1
unknown
… >v
n
>v1
Notes:
v2
•Supports Hamming distance
Can be combined
with numeric
attributes
•May require backtracking
 Operates in a similar fashion as kd-trees
•Priority lists are used for computing kNN
Variation: Point QuadTree
•Particularly suited for performing range search (i.e,
similarity assessment)
•Adequate with fewer numerical and known-important
attributes
A node in a (point) quadtree contains:
•4 Pointers: quad [‘NW’], quad [‘NE’],
quad[‘SW’], and quad[‘SE’]
•point, of type DataPoint, which in turn contains:
•name
•(x,y) coordinates
Example
(0,100)
(5,45)
Denver
(25,35)
Omaha
(0,0)
(60,75)
Toronto
(80,65)
(35,40)
Buffalo
Chicago
Atlanta (85,15)
(50,10)
Mobile
(90,5)
Miami
(100,0)
Insertion order: Chicago, Mobile, Toronto, Buffalo, Denver,
Omaha, Atlanta and Miami
Insertion in Quadtree
Chicago
Denver
Toronto
Omaha
Buffalo
Mobile
Atlanta Miami
Insertion Procedure
We define a new type:
quadrant: ‘NW’, ‘NE’, ‘SW’, ‘SE’
function PT_compare(DataPoint dP, dR): quadrant
//quadrant where dP belongs relative to dR
if (dP.x < dR.x) then
if (dP.y < dR.y) then return ‘SW’
else return ‘NW’
else if (dP.y < dR.y) then return ‘SE’
else return ‘NE’
Insertion Procedure (Cont.)
procedure PT_insert(Pointer P, R)
//inserts P in the tree rooted at R
Pointer T //points to the current node being examined
Pointer F // points to the parent of T
Quadrant Q //auxiliary variable
TR
F  null
while not(T == null) && not(equalCoord(P.point,T.point)) do
FT
Q  PT_compare(P.point, T.point)
T  T.quad[Q]
if (T == null) then
F.quad[Q]  P
Search
Typical query: “find all cities within 50 miles of Washington,DC”
In the initial example: “find all cities within 8 data units from
(83,13)”
Solution:
•Discard NW, SW and NE of Chicago (that is, only examine SE)
•There is no need to search the NW and SW of Mobile
Search (II)
1
2
9
10
r
4
5
A
11
6
Let R be the root of the
quadtree, what regions
need to be inspected if R is
in the quadrant:
3
1: SE
12
7
2: SW, SE
8
8: NW
11: NW, NE, SE
Priority Queues
•Typical example: printing in a Unix/Linux environment.
Printing jobs have different priorities.
•These priorities may override the FIFO policy of the queues
(i.e., jobs with the highest priorities will get printed first).
Operations supported in a priority queue:
•Insert a new element
•Extract/Delete of the element with the lowest priority
•In search trees, the priority is based on the distance
•Insertion, deletion can be done in O(Log N) and look-head in
O(1)
Nearest-Neighbor Search
Problem:
Given a point quadtree T and a point P find the node in T that
is the closest to P
Idea:
(60,75)
Toronto
traverse the quadtree
(80,65)
maintaining a
(5,45)
(35,40)
Buffalo
Denver
priority list,
Chicago
(85,15)
candidates, based on
Atlanta P(95,15)
(50,10)
the distance from P
(25,35)
Mobile
(90,5)
Omaha
to the quadrants
Miami
containing the
candidate nodes
Distance from P to a Quadrant
Let f-1 be the inverse of the distancesimilarity compatible function
P2
2
(x,y)
P1
P3
3
distance(P,SW) = f-1(sim(P,(P.y,0))
distance(P,NW) =
f-1(sim(P,(x,y))
4
P
1
P4
distance(P,NE) = f-1(sim(P,(P.x,0))
distance(P,SE) = 0
Idea of the Algorithm
Candidates = [Chicago (4225)]
Buffer: null ()
(60,75)
Toronto
(5,45)
Denver
(25,35)
Omaha
(35,40)
Chicago
(50,10)
Mobile
P = (95,15)
Candidates = [Mobile(0),Toronto (25), Omaha (60), Denver(4225)]
Buffer: Chicago (4225)
List of Candidates
•Termination test: Buffer.distance < distance(candidates.top,P)
if “yes” then return Buffer
if “no” then continue
•In this particular example, is “no” since Mobile is closer to P than
Chicago
•Examine the quadrant of the top of candidates (Mobile) and
make it the new buffer:
distance(P,NE) = 0
(85,15)
Atlanta P(95,15) distance(P,SE) = 5
(50,10)
Mobile
(90,5)
Miami
Buffer: Mobile (1625)
Finally the Nearest Neighbor is Found
Candidates = [Atlanta(0), Miami(5), Toronto (25), Omaha (60),
Denver(4225)]
Buffer: Atlanta(100)
A new iteration:
Candidates = [Miami(5), Toronto (25), Omaha (60),
Denver(4225)]
The algorithm terminates since the distance from Atlanta to P is
less than the distance from Miami to P
Complexity
•Experiments show that random insertion of N nodes is
roughly O(N log4N)
•Thus, insertion of a single node is O(log4N)
•But worst case (actual complexity) can be much worse
•Range search can be performed in O(2 N ½)
Delete
First idea:
•Find the node N that you want to delete
•Delete N and all of its descendants ND
•For each node N’ in ND, add N’ back into the tree
Terrible idea; it is too inefficient!.
Idealized Deletion in Quadtrees
If a point A is to be deleted find a point B such that the
region between A and B is empty and replaced A with B
B
“Hatched
Region”
Why?
A
Because all the remaining points will be in the same
quadrants relative to B as they are relative to A. For
example, Omaha could replace Chicago as the root.
Problem with Idealized Situation
First Problem: A lot of effort is required to find such a B.
In the following example which point (C, F, D or A) has a
hatched region with A?
F
C
A
E
D
Answer: none!. Second problem: No such a B may exit!
Problem with Defining a New Root
Several points will have to be re-positioned
NW  NE
Old root
SW  NW SW  NE
SE  NE
New root
SW  SE
Deletion Process
Delete P:
1. If P is a leaf then just delete it!.
2. If P has a single child C, then replace P with C
3. For all other cases:
3.1 Compute 4 candidate nodes, one for each
quadrant under P
3.2 Select one of the candidate node, N according to
certain criteria
3.3 Delete several nodes under P and collect
them in a list, ADD. Also delete N.
3.4 Make N.point the new root: P.point  N.point
3.5 Re-insert all nodes in ADD
A Word of Warning About Deletion
•In databases frequently deletion is not done immediately
because it is so time-consuming.
•Sometimes they don’t even do insertions immediately!
•Instead they keep a log with all deletions (and additions), and
periodically (i.e., every night, weekend), the log is traversed
to update the database. The technique is called Differential
Databases.
•Deleting cases is part of the general problem of case base
maintenance.
Properties of Retrieval with Indexed
Cases
•Advantage:
Efficient retrieval
Incremental: don’t need to rebuild index again
every time a new case is entered
-error does not occur
•Disadvantages:
Cost of construction is high
Only work for monotonic similarity relations