Transcript o 1
Mining di dati web
Clustering di Documenti Web:
Le metriche di similarità
A.A 2006/2007
Learning: What does it
Means?
Definition from Wikipedia:
Machine learning is an area of artificial
intelligence concerned with the development of
techniques which allow computers to "learn".
More specifically, machine learning is a method
for creating computer programs by the analysis
of data sets.
Machine learning overlaps heavily with
statistics, since both fields study the analysis of
data, but unlike statistics, machine learning is
concerned with the algorithmic complexity of
computational implementations.
And… What about
Clustering?
Again from Wikipedia:
Data clustering is a common technique for statistical
data analysis, which is used in many fields, including
machine learning, data mining, pattern recognition,
image analysis and bioinformatics. Clustering is the
classification of similar objects into different groups, or
more precisely, the partitioning of a data set into subsets
(clusters), so that the data in each subset (ideally) share
some common trait - often proximity according to some
defined distance measure.
Machine learning typically regards data clustering as a
form of unsupervised learning.
What clustering is?
Clustering
Example where euclidean distance is the
distance metric
hierarchical clustering dendrogram
Clustering: K-Means
Randomly generate k clusters and
determine the cluster centers or directly
generate k seed points as cluster centers
Assign each point to the nearest cluster
center.
Recompute the new cluster centers.
Repeat until some convergence criterion is
met (usually that the assignment hasn't
changed).
K-means Clustering
Each point is assigned to the cluster with the closest
center point
The number K, must be specified
Basic algorithm
1. Select K points as the initial center points.
Repeat
2. Form K clusters by assigning all points to the closet center point.
3. Recompute the center point of each cluster
Until the center points don’t change
Example 1: data points
image the k-means
clustering result.
K-means clustering
result
Importance of Choosing Initial Guesses (1)
Importance of Choosing Initial Guesses (2)
Supplement to K-means
Local optima of Kmeans
The local optima of ‘K-means’
Kmeans is to minimize the “sum of point-centroid” distances.
The optimization is difficult.
After each iteration of K-means the MSE (mean square error)
decreases. But K-means may converge to a local optimum. So
K-means is sensitive to initial guesses.
Example 2: data points
Image the clustering result
Example 2: K-means clustering result
Limitations of K-means
K-means has problem when clusters are of differing
Sizes
Densities
Non-globular shapes
K-means has problems when the data contains outliers
One solution is to use many clusters
Find parts of clusters, but need to put together
Clustering: K-Means
The main advantages of this algorithm are its
simplicity and speed which allows it to run on large
datasets. Its disadvantage is that it does not yield
the same result with each run, since the resulting
clusters depend on the initial random assignments.
It maximizes inter-cluster (or minimizes intracluster) variance, but does not ensure that the
result has a global minimum of variance.
d_intra or d_inter? That
is the question!
d_intra is the distance among elements
(points or objects, or whatever) of the
same cluster.
d_inter is the distance among clusters.
Questions:
Should
Should
Should
Should
we
we
we
we
use distance or similarity?
care about inter cluster distance?
care about cluster shape?
care about clustering?
Distance Functions
Informal definition:
The distance between two points is the length of a
straight line segment between them.
A more formal definition:
A distance between two points P and Q in a metric space
is d(P,Q), where d is the distance function that defines the
given metric space.
We can also define the distance between two sets A and B
in a metric space as being the minimum (or infimum) of
distances between any two points P in A and Q in B.
Distance or Similarity?
In a very straightforward way we can define
the Similarity function
sim: SxS [0,1] as sim(o1,o2) = 1 - d(o1,o2)
where o1 and o2 are elements of the space S.
d = 0.3
d = 0.1
sim = 0.7
sim = 0.9
What does similar (or
distant) really mean?
Learning (either supervised or
unsupervised) is impossible without
ASSUMPTIONS
Watanabe’s Ugly Duckling theorem
Wolpert’s No Free Lunch theorem
Learning is impossible without some
sort of bias.
The Ugly Duckling theorems
The theorem gets its fanciful
name from the following
counter-intuitive statement:
assuming similarity is based
on the number of shared
predicates, an ugly duckling
is as similar to a beautiful
swan A as a beautiful swan B
is to A, given that A and B
differ at all.
It was proposed and proved
by Satosi Watanabe in 1969.
QuickTime™ and a
TIFF (Uncompressed) decompressor
are needed to see this picture.
Satosi’s Theorem
Let n be the cardinality of the Universal set S.
We have to classify them without prior knowledge on the
essence of categories.
The number of different classes, i.e. the different way to
group the objects into clusters, is given by the cardinality of
the Power Set of S:
|Pow(S)|=2n
Without any prior information, the most natural way to
measure the similarity among two distinct objects we can
measure the number of classes they share.
Oooops… They share exactly the same number of classes,
namely 2n-2.
The ugly duckling and 3
beautiful swans
S={o1,o2,o3,o4}
Pow(S) = { {},{o1},{o2},{o3},{o4},
{o1,o2},{o1,o3},{o1,o4},
{o2,o3},{o2,o4},{o3,o4},
{o1,o2,o3},{o1,o2,o4},
{o1,o3,o4},{o2,o3,o4},
{o1,o2,o3,o4} }
How many classes have in common oi, oj i<>j?
o1 and o3
4
o1 and o4
4
The ugly duckling and 3
beautiful swans
In binary 0000, 0001, 0010, 0100,
1000,…
Chose two objects
Reorder the bits so that the chosen
object are represented by the first
two bits.
How many strings share the first two
bits set to 1?
2n-2
Wolpert’s No Free Lunch
Theorem
For any two algorithms, A and B, there exist datasets for which
algorithm A outperform algorithm B in prediction accuracy on
unseen instances.
Proof: Take any Boolean concept. If A outperforms B on unseen
instances, reverse the labels and B will outperforms A.
So Let’s Get Back to
Distances
In a metric space a distance is a function
d:SxS->R so that if a,b,c are elements of S:
d(a,b) ≥ 0
d(a,b) = 0 iff a=b
d(a,b) = d(b,a)
d(a,c) ≤ d(a,b) + d(b,c)
The fourth property (triangular inequality)
holds only if we are in a metric space.
Minkowski Distance
Let’s consider two elements of a set S
described by their feature vectors:
x=(x1, x2, …, xn)
y=(y1, y2, …, yn)
The Minkowski Distance is parametric in p>1
1
p
n
p
p
d x, y x i y i
i1
p=1. Manhattan Distance
If p = 1 the distance is called Manhattan
Distance.
d
1
n
x, y x i y i
i1
It is also called taxicab distance because it
is the distance a car would drive in a city
laid out in square blocks (if there are no
one-way streets).
p=2. Euclidean Distance
If p = 2 the distance is the well known
Euclidean Distance.
d
2
n
x, y x i y i
i1
2
p=. Chebyshev Distance
If p = then we must take the limit.
1
p
n
p
d x, y lim x i y i
p
i1
max x1 y1 , x 2 y 2 ,
xn yn
It is also called chessboard Distance.
2D Cosine Similarity
It’s easy to explain in 2D.
Let’s consider a=(x1,y1) and b=(x2,y2).
a
b
Cosine Similarity
Let’s consider two points x, y in Rn.
xy
sim(x, y) cos
xy
n
x y
i
i
i1
n
n
x y
2
i
i1
i1
2
i
Jaccard Distance
Another commonly used distance is the
Jaccard Distance:
n
min x , y
i
d(x, y)
i
i1
n
max x , y
i
i1
i
Binary Jaccard Distance
In the case of binary feature vector the
Jaccard Distance could be simplified to:
xy
d(x, y)
xy
Edit Distance
The Levenshtein distance or edit distance
between two strings is given by the minimum
number of operations needed to transform one
string into the other, where an operation is an
insertion, deletion, or substitution of a single
character
Binary Edit Distance
The binary edit distance, d(x,y), from a binary
vector x to a binary vector y is the minimum
number of simple flips required to transform one
vector to the other
n
d(x, y) 1 x1 y i
i1
x=(0,1,0,0,1,1)
y=(1,1,0,1,0,1)
d(x,y)=3
The binary edit distance is equivalent to the Manhattan distance
(Minkowski p=1) for binary features vectors.
The Curse of High
Dimensionality
The dimensionality is one of the main problem to
face when clustering data.
Roughly speaking the higher the dimensionality the
lower the power of recognizing similar objects.
Volume of the
Unit-Radius Sphere
Sphere/Cube
Volume Ratio
Unit-radius Sphere / Cube whose edge
lenghts is 2.
Sphere/Sphere Volume
Ratio
Two embedded spheres. Radiuses 1 and 0.9.
Concentration of the
Norm Phenomenon
Gaussian distributions of points (std. dev. = 1). Dimensions
(1,2,3,5,10, and 20).
Probability Density Functions to find a point drawn according
to a Gaussian distribution, at distance r from the center of
that distribution.
Web Document
Representation
The Web can be characterized in three
different ways:
Content.
Structure.
Usage.
We are concerned with Web Content
information.
Bag-of-Words vs. Vector-Space
Let C be a collection of N documents
d1, d2, …, dN
Each document is composed of terms
drawn from a term-set of dimension T.
Each document can be represented in two
different ways:
Bag-of-Words.
Vector-Space.
Bag-of-Words
Apple
Peach
Apple
Banana
Apple
Banana
Coffee
Apple
Coffee
In the bag-of-words model the
document is represented as
d={Apple,Banana,Coffee,Peach}
Each term is represented.
No information on frequency.
Binary encoding. t-dimensional
Bitvector d=[1,0,1,1,0,0,0,1].
Vector-Space
Apple
Peach
Apple
Banana
Apple
Banana
Coffee
Apple
Coffee
In the vector-space model the
document is represented as
d={<Apple,4>,<Banana,2>,<Coffee
,2>,<Peach,2>}
Information about frequency are
recorded.
t-dimensional vectors.
d=[4,0,2,2,0,0,0,1].
Typical Web
Collection Dimensions
No. of documents: 20B
No of terms is approx 150,000,000!!!
Very high dimensionality…
Each document contains from 100 to 1000
terms.
Classical clustering algorithm cannot be
used.
Each document is very similar to the other due
to the geometric properties just seen.
We Need to Cope with
High Dimensionality
Possible solutions:
JUST ONE… Reduce Dimensions!!!!
Dimensionality
reduction
dimensionality reduction approaches can be
divided into two categories:
feature selection approaches try to find a subset of the
original features. Optimal feature selection for supervised
learning problems requires an exhaustive search of all
possible subsets of features of the chosen cardinality
feature extraction is applying a mapping of the
multidimensional space into a space of fewer dimensions.
This means that the original feature space is transformed
by applying e.g. a linear transformation via a principal
components analysis
PCA - Principal
Component Analysis
In statistics, principal components analysis (PCA) is
a technique that can be used to simplify a dataset.
More formally it is a linear transformation that
chooses a new coordinate system for the data set
such that the greatest variance by any projection
of the data set comes to lie on the first axis (then
called the first principal component), the second
greatest variance on the second axis, and so on.
PCA can be used for reducing dimensionality in a
dataset while retaining those characteristics of
the dataset that contribute most to its variance by
eliminating the later principal components (by a
more or less heuristic decision).
The Method
Suppose you have a random vector
population x:
x=(x1,x2, …, xn)T
and the mean:
x=E{x}
and the covariance matrix:
Cx=E{(x- x)(x- x)T}
The Method
The components of Cx, denoted by cij, represent the
covariances between the random variable
components xi and xj. The component cii is the
variance of the component xi. The variance of a
component indicates the spread of the component
values around its mean value. If two components xi
and xj of the data are uncorrelated, their covariance
is zero (cij = cji = 0).
The covariance matrix is, by definition, always
symmetric.
Take a sample of vectors x1, x2, …, xM we can
calculate the sample mean and the sample
covariance matrix as the estimates of the mean and
the covariance matrix.
Computationally
expensive
task
From a symmetric matrix such as the covariance
matrix, we can calculate an orthogonal basis by
finding its eigenvalues and eigenvectors. The
eigenvectors ei and the corresponding eigenvalues
i are the solutions of the equation:
Cxei= iei ==> | Cx- I |=0
By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one can
create an ordered orthogonal basis with the first
eigenvector having the direction of largest
variance of the data. In this way, we can find
directions in which the data set has the most
significant amounts of energy.
The Method
The Method
By ordering the eigenvectors in the order of
descending eigenvalues (largest first), one
can create an ordered orthogonal basis
with the first eigenvector having the
direction of largest variance of the data. In
this way, we can find directions in which
the data set has the most significant
amounts of energy.
To reduce n to k take the first k
eigenvectors.
A Graphical Example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
PCA Eigenvectors
Projection
Project the data onto the selected
eigenvectors:
k
x
aiei
i1
ai x ei
Another Example
QuickTime™ and a
TIFF (LZW) decompressor
are needed to see this picture.
Singular Value
Decomposition
Is a technique used for reducing
dimensionality based on some properties of
symmetric matrices.
Will be the subject for a talk given by one
of you!!!!
Locality-Sensitive
Hashing
The key idea of this approach is to create a
small signature for each documents, to
ensure that similar documents have similar
signatures.
There exists a family H of hash functions
such that for each pair of pages u, v we
have Pr[mh(u) = mh(v)] = sim(u,v), where
the hash function mh is chosen at random
from the family H.
Locality-Sensitive
Hashing
Will be a subject of a talk given by one of
you!!!!!!!