distance measure

Download Report

Transcript distance measure

Data Mining
Spring 2013
Data quality
Missing values imputation using Mean, Median
and k-Nearest Neighbor approach
Distance Measure
Distance Measures
•
Remember K-Nearest Neighbor are determined on the
bases of some kind of “distance” between points.
•
Two major classes of distance measure:
1. Euclidean : based on position of points in some k
-dimensional space.
2. Noneuclidean : not related to position or space.
Scales of Measurement
•
Applying a distance measure largely depends on the type of
input data
•
Major scales of measurement:
1.
Nominal Data (aka Nominal Scale Variables)
•
•
•
2.
Typically classification data, e.g. m/f
no ordering, e.g. it makes no sense to state that M > F
Binary variables are a special case of Nominal scale variables.
Ordinal Data (aka Ordinal Scale)
• ordered but differences between values are not important
• e.g., political parties on left to right spectrum given labels 0, 1, 2
• e.g., Likert scales, rank on a scale of 1..5 your degree of satisfaction
• e.g., restaurant ratings
Scales of Measurement
•
Applying a distance function largely depends on the type of
input data
•
Major scales of measurement:
3.
Numeric type Data (aka interval scaled)
• Ordered and equal intervals. Measured on a linear scale.
• Differences make sense
• e.g., temperature (C,F), height, weight, age, date
•
Scales of Measurement
Only certain operations can be performed on
certain scales of measurement.
1. Equality
2. Count
3. Rank
(Cannot quantify difference)
4. Quantify the difference
Nominal Scale
Ordinal Scale
Interval Scale
•
Axioms of a Distance Measure
d is a distance measure if it is a function from
pairs of points to reals such that:
1. d(x,x) = 0.
2. d(x,y) = d(y,x).
3. d(x,y) > 0.
Some Euclidean Distances
• L2 norm (also common or Euclidean distance):
d (i, j)  (| x  x |2  | x  x |2 ... | x  x |2 )
i1 j1
i2 j 2
i p jp
– The most common notion of “distance.”
• L1 norm (also Manhattan distance)
d (i, j) | x  x |  | x  x | ... | x  x |
i1
j1
i2
j2
ip
jp
– distance if you had to travel along coordinates only.
Examples L1 and L2 norms
y = (9,8)
L2-norm:
dist(x,y) = (42+32) = 5
5
3
L1-norm:
dist(x,y) = 4+3 = 7
x = (5,5)
4
Another Euclidean Distance
• L∞ norm : d(x,y) = the maximum of the
differences between x and y in any dimension.
Non-Euclidean Distances
• Jaccard measure for binary vectors
• Cosine measure = angle between vectors
from the origin to the points in question.
• Edit distance = number of inserts and
deletes to change one string into another.
Jaccard Measure
• A note about Binary variables first
– Symmetric binary variable
• If both states are equally valuable and carry the same weight, that
is, there is no preference on which outcome should be coded as 0
or 1.
• Like “gender” having the states male and female
– Asymmetric binary variable:
• If the outcomes of the states are not equally important, such as
the positive and negative outcomes of a disease test.
• We should code the rarest one by 1 (e.g., HIV positive), and the
other by 0 (HIV negative).
– Given two asymmetric binary variables, the agreement of
two 1s (a positive match) is then considered more
important than that of two 0s (a negative match).
Jaccard Measure
• A contingency table for binary data
Object j
Object i
1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p
• Simple matching coefficient (invariant, if the binary variable is
bc
symmetric):
d (i, j) 
a bc  d
• Jaccard coefficient (noninvariant if the binary variable is
asymmetric):
d (i, j) 
bc
a bc
Jaccard Measure Example
• Example
Name
Jack
Mary
Jim
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
d (i, j) 
– All attributes are asymmetric binary
– let the values Y and P be set to 1, and the value N be set to 0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
bc
a bc
1
0 sum
1
a
b a b
0
c
d cd
sum a  c b  d p
Cosine Measure
• Think of a point as a vector from the origin
(0,0,…,0) to its location.
• Two points’ vectors make an angle, whose
cosine is the normalized dot-product of the
vectors.
– Example:
– p1.p2 = 2; |p1| = |p2| = 3.
– cos() = 2/3;  is about 48 degrees.

dist(p1, p2) =  = arccos(p1.p2/|p2||p1|)
p1.p2
|p2|
p1
p2
Edit Distance
• The edit distance of two strings is the
number of inserts and deletes of characters
needed to turn one into the other.
• Equivalently, d(x,y) = |x| + |y| -2|LCS(x,y)|.
– LCS = longest common subsequence = longest
string obtained both by deleting from x and
deleting from y.
Example
• x = abcde ; y = bcduve.
• LCS(x,y) = bcde.
• D(x,y) = |x| + |y| - 2|LCS(x,y)| = 5 + 6 –2*4 =
3.
• What left?
• Normalize it in the range [0-1]. We will study
normalization formulas in next lecture.
Back to k-Nearest Neighbor (Pseudo-code)
• Missing values Imputation using k-NN.
• Input: Dataset (D), size of K
• for each record (x) with at least on missing value in D.
– for each data object (y) in D.
• Take the Distance (x,y)
• Save the distance and y in array Similarity (S) array.
– Sort the array S in descending order
– Pick the top K data objects from S
• Impute the missing attribute value (s) of x on the basic of known
values of S (use Mean/Median or MOD).
K-Nearest Neighbor Drawbacks
• The major drawbacks of this approach are the
– Choice of selecting exact distance functions.
– Considering all attributes when attempting to
retrieve the similar type of examples.
– Searching through all the dataset for finding the
same type of instances.
– Algorithm Cost: ?
Noisy Data
• Noise: Random error, Data Present but not correct.
– Data Transmission error
– Data Entry problem
• Removing noise
– Data Smoothing (rounding, averaging within a window).
– Clustering/merging and Detecting outliers.
• Data Smoothing
– First sort the data and partition it into (equi-depth) bins.
– Then the values in each bin using Smooth by Bin Means,
Smooth by Bin Median, Smooth by Bin Boundaries, etc.
Noisy Data (Binning Methods)
Sorted data for price (in dollars):
4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
Noisy Data (Clustering)
• Outliers may be detected by clustering, where similar values are
organized into groups or “clusters”.
• Values which falls outside of the set of clusters may be
considered outliers.
Data Discretization
• The task of attribute (feature)-discretization techniques is to
discretize the values of continuous features into a small number
of intervals, where each interval is mapped to a discrete symbol.
• Advantages:– Simplified data description and easy-to-understand data and final datamining results.
– Only Small interesting rules are mined.
– End-results processing time decreased.
– End-results accuracy improved.
Effect of Continuous Data on Results Accuracy
age
<=30
<=30
<=30
<=30
income
medium
medium
medium
medium
age
9
10
11
12
buys_computer
no
no
no
no
Data Mining
Discover only those
rules which contain
support (frequency)
greater >= 1
• If ‘age <= 30’ and income = ‘medium’ and age =
‘9’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘10’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘11’ then buys_computer = ‘no’
• If ‘age <= 30’ and income = ‘medium’ and age =
‘12’ then buys_computer = ‘no’
Due to the missing value in training
dataset, the accuracy of prediction
decreases and becomes “66.7%”
age
<=30
<=30
<=30
income
medium
medium
medium
age
9
11
13
buys_computer
?
?
?
Entropy-Based Discretization
• Given a set of samples S, if S is partitioned into two intervals S1
and S2 using boundary T, the entropy after partitioning is
• Where pi is the probability of class i in S1, determined by
dividing the number of samples of class i in S1 by the total
number of samples in S1.
Example 1
ID
Age
Grade
1
21
F
2
22
F
3
24
P
4
25
F
5
27
P
6
27
P
7
27
P
8
35
P
9
41
P
•
Let Grade be the class attribute. Use entropy-based
discretization to divide the range of ages into different discrete
(22+24) / 2 = 23
intervals.
•
There
are 6 possible boundaries. They are 21.5, 23, 24.5, 26, 31,
(21+22) / 2 = 21.5
and 38.
•
Let us consider the boundary at T = 21.5.
Let S1 = {21}
Let S2 = {22, 24, 25, 27, 27, 27, 35, 41}
Example 1 (cont’)
ID
Age
Grade
1
2
3
4
5
6
7
8
9
21
22
24
25
27
27
27
35
41
F
F
P
F
P
P
P
P
P
• The number of elements in S1 and S2 are:
|S1| = 1
|S2| = 8
• The entropy of S1 is
Ent ( S1 )   P(Grade  F)  log 2 P (Grade  F)  P (Grade  P)  log 2 P(Grade  P)
 (1)  log 2 (1)  (0)  log 2 (0)

• The entropy of S2 is
Ent ( S 2 )   P(Grade  F)  log 2 P(Grade  F)  P(Grade  P)  log 2 P(Grade  P)
 (2)  log 2 (2)  (6)  log 2 (6)

Example 1 (cont’)
• Hence, the entropy after partitioning at T =
21.5 is
| S1 |
| S2 |
E (S , T ) 
Ent ( S1 ) 
Ent ( S 2 )
|S|
|S|
|1|
|8|

Ent ( S1 ) 
Ent ( S 2 )
|9|
|9|
 ...
Example 1 (cont’)
• The entropies after partitioning for all the boundaries are:
T = 21.5 = E(S,21.5)
T = 23 = E(S,23)
.
.
T = 38 = E(S,38)
Now recursively apply entropy
discretization upon both partitions
Select the boundary with the smallest entropy
Suppose best is T = 23
ID
Age
Grade
1
2
3
4
5
6
7
8
9
21
22
24
25
27
27
27
35
41
F
F
P
F
P
P
P
P
P
References
– G. Batista and M. Monard, “The study of K-Nearest
Neighbor as a Imputation Method”, 2002 . (I will
placed at the course folder)
– “CS345 --- Lecture Notes”, by Jeff D Ullman at
Stanford. http://wwwdb.stanford.edu/~ullman/cs345-notes.html
– Vipin Kumar’s course in data mining offered at
University of Minnesota
– official text book slides of Jiawei Han and Micheline
Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann Publishers, August 2000.