lecture3_measurement_and_distance

Download Report

Transcript lecture3_measurement_and_distance

ICS 278: Data Mining
Lecture 3: Measurement and Distance
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Today’s lecture
• Questions on homework?
• Outline of today’s lecture:
– Initial discussion of project topics
– Chapter 2: Measurement and Data
• Types of measurement
• Distance measures
– Chapter 3: Exploratory Data Analysis
• Visualization methods
• Projection/Dimension-reduction methods
– Principal components analysis, multidimensional scaling
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Data Mining Resources
• Online (free) KD Nuggets newsletter
– www.kdnuggets.com
– Tends to be more industry-oriented than research, but
nonetheless interesting
• ACM SIGKDD Conference
– Leading annual conference on DM and knowledge discovery
– Papers provide a snapshot of current DM research
• Machine learning resources
– Journal of Machine Learning Research, www.jmlr.org
– Annual proceedings of NIPS and ICML conferences
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Class Projects
• Extensive information available on class Web pages:
– Please look at these Web pages for ideas
• Project proposal will be due no later than end of next week
– Details on format of proposal will be provided this Thursday
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Measurement
Mapping domain entities to symbolic representations
Data
Real world
Relationship in data
Relationship
in real world
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Nominal or Categorical Variable
Here, numerical values just "name" the attribute uniquely.
No ordering implied
i.e. jersey numbers in basketball; a player with number 30 is not more
of anything than a player with number 15; certainly not twice whatever
number 15 is.
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Measurements, cont.
ordinal measurement - attributes can be rank-ordered.
Distances between attributes do not have any meaning.
i.e., on a survey you might code Educational Attainment as
0=less than H.S.; 1=some H.S.; 2=H.S. degree; 3=some college;
4=college degree; 5=post college.
In this measure, higher numbers mean more education.
But is distance from 0 to 1 same as 3 to 4? No.
The interval between values is not interpretable in an ordinal measure.
interval measurement - distance between attributes does have meaning.
i.e., when we measure temperature (in Fahrenheit), the distance from
30-40 is same as distance from 70-80. The interval between values is
interpretable. average makes sense, however ratios don't - 80 degrees is
not twice as hot as 40 degrees
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Measurements, cont.
ratio measurement - an absolute zero that is meaningful. This means
that you can construct a meaningful fraction (or ratio) with a ratio
variable.
e.g., Many "count" variables are ratio, for example, income (= number
of dollars). Why? Because you can have zero dollars and because it
is meaningful to say that "...person A earns twice as much as person
B”
Other examples: age, weight, height, etc
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Hierarchy of Measurements
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Scales
scale
Legal transforms
example
Nominal/
Categorical
Any one-one mapping
Hair color, employment
ordinal
Any order preserving transform
Severity, preference
interval
Multiply by constant, add a
constant
Temperature, calendar time
ratio
Multiply by constant
Weight, income
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Why is this important?
• As we will see….
– Many algorithms require data to be represented in a specific
form
– e.g., real-valued vectors
• Linear regression, neural networks, support vector machines, etc
• These models implicitly assume interval-scale data (at least)
– What do we do with non-real valued inputs?
• Nominal with M values:
– Not appropriate to “map” to 1 to M (maps to an interval scale)
– Why? w_1 x employment_type + w_2 x city_name
– Could use M binary “indicator” variables
» But what if M is very large? (e.g., cluster into groups of values)
• Ordinal?
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Mixed data
•
Many real-world data sets have multiple types of variables,
– e.g., medical diagnosis data on patients and controls
– Categorical (Nominal): employment type, ethnic group
– Ordinal: education level
– Interval: body temperature, pixel values in medical image
– Ratio: income, age, response to drug
•
Unfortunately, many data analysis algorithms are suited to only one type of
data (e.g., interval)
•
Exception: decision trees
– Trees operate by subgrouping variable values at internal nodes
– Can operate effectively on binary, nominal, ordinal, interval
– We will see more details in later lectures…..
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Distance Measures
•
Many data mining techniques are based on similarity or distance
measures between objects.
•
Two methods for computing similarity or distance:
1. Explicit similarity measurement provided for each pair of objects
2. Similarity computed indirectly using vectors of object attributes.
•
Metric: d(i,j) is a metric iff
1. d(i,j)  0 for all i, j and d(i,j) = 0 iff i = j
2. d(i,j) = d(j,i) for all i and j
3. d(i,j)  d(i,k) + d(k,i) for all i, j and k
j
i
Data Mining Lectures
Lecture 2: Data Measurement
k
Padhraic Smyth, UC Irvine
Vector data and distance matrices
• Data may be available as n “vectors” each p-dimensional
• Or “data” itself may be a n x n matrix of similarities or distances
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Notation
• n objects, each with p measurements
• data vector for ith object
x (i )  ( x1 (i ), x2 (i ),  , x p (i ))
• Data matrix
• x j (i )is the ith row, jth column
• columns -> variables
• rows -> data points
• Can define distances/similarities
• between rows (data vectors i)
• between columns (variables j)
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
High-dimensional data example
(David Scott, Multivariate Density Estimation, Wiley, 1992)
Hypercube
in d dimensions
Hypersphere
in d dimensions
• Volume of sphere relative to cube in d dimensions?
Dimension
2
3
4
5
6
7
Rel. Volume
0.79
?
?
?
?
?
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
High-dimensional data example
Hypercube
in d dimensions
Hypersphere
in d dimensions
Dimension
2
3
4
5
6
7
Rel. Volume
0.79
0.53
0.31
0.16
0.08
0.04
• high-d => most data points will be “out” at the corners
• high-d space is sparse: and non-intuitive
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Distance
• n objects with p measurements
x  ( x1 , x2 ,  , x p )
y  ( y1 , y 2 ,  , y p )
• Most common distance metric is Euclidean distance:
 p
2
d E ( x, y )    ( xk  yk ) 
 k 1

1
2
• Makes sense in the case where the different measurements are
commensurate; each variable measured in the same units.
• If the measurements are different, say length and weight,
Euclidean distance is not necessarily meaningful.
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Standardization
When variables are not commensurate, we can standardize them by
dividing by the sample standard deviation. This makes them all equally
important.
The estimate for the standard deviation of xk :
1 n
2
ˆ k     xk  xk  
 n i 1

1
2
where xk is the sample mean:
1 n
xk   xk (i )
n i 1
(When might standardization *not* be a such a good idea?
hint: think of extremely skewed data and outliers, e.g., Gates income)
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Weighted Euclidean distance
If we have some idea of the relative importance of
each variable, we can weight them:
 p
2
dW E ( x, y )    wk ( xk  yk ) 
 k 1

Data Mining Lectures
Lecture 2: Data Measurement
1
2
Padhraic Smyth, UC Irvine
Other Distance Metrics
• Minkowski or L metric:
1

 p

d ( x, y )    ( xk  yk ) 
 k 1

• Manhattan, city block or L1 metric:
p
d ( x, y)   xk  yk
k 1
• L
d ( x, y)  max xk  yk
k
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Additive Distances
• Each variable contributes independently to the measure of distance.
• May not always be appropriate… e.g., think of nearest neighbor
classifier
object i
object j
height(i)
height(j)
height2(i)
height2(j)
height100(i)
height100(j)
Data Mining Lectures
…
diameter(j)
…
diameter(i)
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Dependence among Variables
• Covariance and correlation measure linear dependence
(distance between variables, not objects)
• Assume we have two variables or attributes X and Y and n objects
taking on values x(1), …, x(n) and y(1), …, y(n). The sample
covariance of X and Y is:
1 n
Cov( X , Y )   ( x(i )  x )( y (i )  y )
n i 1
• The covariance is a measure of how X and Y vary together.
– it will be large and positive if large values of X are associated
with large values of Y, and small X  small Y
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Correlation coefficient
• Covariance depends on ranges of X and Y
• Standardize by dividing by standard deviation
• Linear correlation coefficient is defined as:
n
 ( X ,Y ) 
 ( x(i)  x )( y(i)  y )
i 1
n
 n
2
2
  ( x(i )  x )  ( y (i )  y ) 
i 1
 i 1

Data Mining Lectures
Lecture 2: Data Measurement
1
2
Padhraic Smyth, UC Irvine
Sample Correlation Matrix
-1 0 +1
business acreage
nitrous oxide
average # rooms
Data on characteristics
of Boston surburbs
Median house value
percentage of large residential lots
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Mahalanobis distance (between objects)


d MH ( x, y)  x  y   1 x  y 
Evaluates to a
scalar distance
T
Vector difference in
p-dimensional space
1
2
Inverse covariance matrix
1. It automatically accounts for the scaling of the coordinate axes
2. It corrects for correlation between the different features
Cost:
1. The covariance matrices can be hard to determine accurately
2. The memory and time requirements grow quadratically, O(p2), rather
than linearly with the number of features.
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example 1 of Mahalonobis distance
Covariance matrix is
diagonal and isotropic
-> all dimensions have
equal variance
-> MH distance reduces
to Euclidean distance
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example 2 of Mahalonobis distance
Covariance matrix is
diagonal but non-isotropic
-> dimensions do not have
equal variance
-> MH distance reduces
to weighted Euclidean
distance with weights
= inverse variance
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Example 2 of Mahalonobis distance
Two outer blue
points will have same MH
distance to the center
blue point
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
What about…
Y
Are X and Y dependent?
(X,Y) = ?
X
linear covariance, correlation
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Distances between Binary Vectors
• matching coefficient
i=1
j=0
i=1
n11
n10
i=0
n01
n00
Number of
variables where
item j =1 and item i = 0
n11  n 00
n11  n10  n 01  n 00
• Jaccard coefficient (e.g., for sparse vectors, non-symmetric)
n11
n11  n10  n 01
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Other distance measures
•
Categorical variables
– Number of matches divided by number of dimensions
•
Distances between strings of different lengths
– e.g., “Patrick J. Smyth” and “Padhraic Smyth”
– Edit distance
•
Distances between images and waveforms
– Shift-invariant, scale invariant
– i.e., d(x,y) = min_{a,b} ( (ax+b) – y)
•
More generally, kernel methods
•
Distances between mixed vector types: combine distance measures
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Transforming Data
• Duality between form of the data and the model
– Useful to bring data onto a “natural scale”
– Some variables are very skewed, e.g., income
• Common transforms: square root, reciprocal, logarithm, raising to a
power
– Often very useful when dealing with skewed real-world data
• Logit: transforms from 0 to 1 to real-line
p
logit ( p) 
1 p
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Data Quality
•
Individual measurements
– Random noise in individual measurements
•
•
•
•
Variance (precision)
Bias
Random data entry errors
Noise in label assignment (e.g., class labels in medical data sets)
– Systematic errors
• E.g., all ages > 99 recorded as 99
• More individuals aged 20, 30, 40, etc than expected
– Missing information
• Missing at random
– Questions on a questionnaire that people randomly forget to fill in
• Missing systematically
– Questions that people don’t want to answer
– Patients who are too ill for a certain test
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Data Quality
•
Collections of measurements
– Ideal case = random sample from population of interest
– Real case = often a biased sample of some sort
– Key point: patterns or models built on the training data may only be
valid on future data that comes from the same distribution
•
Examples of non-randomly sampled data
– Medical study where subjects are all students
– Geographic dependencies
– Temporal dependencies
– Stratified samples
• E.g., 50% healthy, 50% ill
– Hidden systematic effects
• E.g., market basket data the weekend of a large sale in the store
• E.g., Web log data during finals week
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine
Next Lecture
• Chapter 3
– Exploratory data analysis and visualization
Data Mining Lectures
Lecture 2: Data Measurement
Padhraic Smyth, UC Irvine