Power Iteration Clustering - Carnegie Mellon School of Computer

Download Report

Transcript Power Iteration Clustering - Carnegie Mellon School of Computer

Scalable Methods for Graph-Based
Unsupervised and Semi-Supervised Learning
Frank Lin
Language Technologies Institute
School of Computer Science
Carnegie Mellon University
PhD Thesis Proposal
October 8, 2010, Pittsburgh, PA, USA
Thesis Committee
William W. Cohen (Chair), Christos Faloutsos, Tom Mitchell, Xiaojin Zhu
Motivation
And some of
them very big
• Graph data is everywhere
• We want to do machine learning on them
• For non-graph data, often it makes sense to
represent them as graphs for unsupervised
and semi-supervised learning
Often require a similarity
measure between data
points – naturally
represented as edges
between nodes
2
Motivation
• We want to do clustering and semi-supervised
learning on large graphs
• We need scalable methods that provide
quality results
3
Thesis Goal
• Make contributions toward fast, spaceefficient, effective, and simple graph-based
learning methods that scale up to large
datasets.
4
Road Map
Applications
Unsupervised /
Clustering
Power
Iteration
Clustering
SemiSupervised
Learning Synonym NP’s
MultiRankWalk
PIC with Path
Folding
Learning Polyseme NP’s
Finding New Types and
Relations
MRW with Path
Folding
Noun Phrase
Categorization
Support
Hierarchical
Induced Graph
Avoiding Collision
Graph
Applications to a Web-Scale Knowledge Base
Distributed
Implementation
Basic Learning Methods
Distributed
Implementation
5
Road Map
Induced Graph
Applications
Unsupervised /
Clustering
Power
Iteration
Clustering
SemiSupervised
Learning Synonym NP’s
MultiRankWalk
PIC with Path
Folding
Learning Polyseme NP’s
Finding New Types and
Relations
MRW with Path
Folding
Noun Phrase
Categorization
Support
Hierarchical
Graph
Applications to a Web-Scale Knowledge Base
Avoiding Collision
Basic Learning Methods
Proposed
Work
Distributed
Implementation
Prior
Work
Distributed
Implementation
6
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
7
Power Iteration Clustering
• Spectral clustering methods are nice, and natural
choice for graph data
• But they are rather expensive (slow)
Power iteration clustering (PIC) can
provide a similar solution at a very
low cost (fast)!
8
Background: Spectral Clustering
• Idea: instead of clustering data points in their
original (Euclidean) space, cluster them in the
space spanned by the “significant”
eigenvectors of a (Laplacian) similarity matrix
A popular spectral
clustering method:
normalized cuts (NCut)
9
Background: Spectral Clustering
dataset and
normalized
cut results
2nd smallest
eigenvector
1
2
3
value
clustering space
cluster
index
3rd smallest
eigenvector
10
Background: Spectral Clustering
Finding eigenvectors
and eigenvalues of a
• Normalized Cut algorithm (Shi & Malikmatrix
2000):
is very slow
in general
1. Choose
and asimilarity
function s
Can wek find
similar low-1A, where I is the identity
2. Derive
A fromembedding
s, let W=I-Dfor
dimensional
matrix and
D is a eigenvectors?
diagonal square matrix Dii=Σj Aij
clustering
without
D
3. Find eigenvectors and corresponding eigenvalues of W
4. Pick the k eigenvectors of W with the 2nd to kth smallest
corresponding eigenvalues as “significant” eigenvectors
5. Project the data points onto the space spanned by these
vectors
6. Run k-means on the projected data points
11
The Power Iteration
• Or the power method, is a simple iterative method
for finding the dominant eigenvector of a matrix:
Typically
converges quickly;
fairly efficient if W
is a sparse matrix
v
t 1
 cWv
c : a normalizing
constant to keep vt
from getting too large
or too small
t
W:a
square
matrix
vt : the
vector at
iteration t;
v0 typically a
random
vector
12
The Power Iteration
• Or the power method, is a simple iterative method
for finding the dominant eigenvector of a matrix:
v
t 1
 cWv
t
Rownormalized
similarity
matrix
What if we let W=D-1A
(like Normalized Cut)?
13
The Power Iteration
14
Power Iteration Clustering
• The 2nd to kth eigenvectors of W=D-1A are
roughly piece-wise constant with respect to
the underlying clusters, each separating a
cluster from the rest of the data (Meila & Shi
2001)
• The linear combination of piece-wise constant
vectors is also piece-wise constant!
15
Spectral Clustering
dataset and
normalized
cut results
2nd smallest
eigenvector
1
2
3
value
clustering space
cluster
index
3rd smallest
eigenvector
16
a·
+
b·
=
17
Power Iteration Clustering
dataset and
PIC results
vt
Key idea: to do clustering, we may not need
all the information in a full spectral
embedding (e.g., distance between clusters
in a k-dimension eigenspace)
we just need the clusters to be
separated in some space.
18
When to Stop
Recall:
At the beginning, v changes fast
(“accelerating”) to converge
locally duet to “noise terms” t
k k with
k smallk λ1 k 1 k 1
(k+1…n)
v t  c11t e1  ...  c  e  c  e
 ...  cn tne n
Then:
t
ck
v
 e1  ... 
t
c11
c1
t
t
t
 k 
ck 1  k 1 
cn  n 
  e k 

 e k 1  ...    e n
c1  1 
c1  1 
 1 
When “noise terms” have gone to zero, v
changes slowly (“constant speed”) because
only larger λ terms (2…k) are left, where the
eigenvalue ratios are close to 1
Because they are raised
to the power t, the
eigenvalue ratios
determines how fast v
converges to e1
Power Iteration Clustering
• A basic power iteration clustering (PIC) algorithm:
Input: A row-normalized affinity matrix W and the number of clusters k
Output: Clusters C1, C2, …, Ck
1. Pick an initial vector v0
2. Repeat
• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|
• Increment t
• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
20
PIC Runtime
Normalized Cut
Normalized Cut, faster
implementation
Ran out of memory
(24GB)
21
PIC Accuracy on Network Datasets
Upper
triangle:
PIC does
better
Lower
triangle:
NCut or
NJW does
better
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
23
Clustering Text Data
• Spectral clustering methods are nice
• We want to use them for clustering text data
(A lot of)
24
The Problem with Text Data
• Documents are often represented as feature
vectors of words:
The importance of a Web
page is an inherently
subjective matter, which
depends on the readers…
In this paper, we present
Google, a prototype of a
large-scale search engine
which makes heavy use…
You're not cool just
because you have a lot of
followers on twitter, get
over yourself…
cool
web
search
make
over
you
0
4
8
2
5
3
0
8
7
4
3
2
1
0
0
0
1
2
25
The Problem with Text Data
• Feature vectors are often sparse
• But similarity matrix is not!
Mostly non-zero
- any two
documents are
likely to have a
word in common
27 125
23
-
23
Mostly zeros - any
document contains
only a small fraction
of the vocabulary
-
125
27
cool
web
search
make
over
you
0
4
8
2
5
3
0
8
7
4
3
2
1
0
0
0
1
2
26
In general O(n3);
approximation
methods still not
very fast
The Problem with Text Data
• A similarity matrix is the input to many clustering
methods, including spectral clustering
• Spectral clustering requires the computation of
the eigenvectors of a similarity matrix of theToo
data
O(n2) time to construct
27 125
-
23
-
125
-
23
27
O(n2) space to store
expensive!
Does not
scale up to
big
datasets!
> O(n2) time to operate on
27
The Problem with Text Data
• We want to use the similarity matrix for
clustering (like spectral clustering), but:
– Without calculating eigenvectors
– Without constructing or storing the similarity
matrix
Power Iteration
Clustering
+ Path Folding
28
Path Folding
Okay, we have a fast clustering
• A basic power
clustering
methoditeration
– but there’s
the W (PIC)
that algorithm:
requires O(n2) storage space and
Input: A row-normalized affinity matrix W and the number of clusters k
construction and operation time!
Output: Clusters C , C , …, C
1
2
k
1. Pick an initial vector v0
2. Repeat
Key operation in PIC
• Set vt+1 ← Wvt
• Set δt+1 ← |vt+1 – vt|
• Increment t
• Stop when |δt – δt-1| ≈ 0
3. Use k-means to cluster points on vt and return clusters C1, C2, …, Ck
Note: matrix-vector
multiplication!
29
Path Folding
not ifabout matrix-vector
• What’sWell,
so good
And these
this matrix
multiplication?
are
sparse!
is dense…
• If we can decompose the matrix…
v
t 1
 Wv  ( ABC ) v
t
t
• Then we arrive at the same solution doing a
Isn’t this more
series of matrix-vector multiplications!expensive?
v
t 1
 ( A( B(Cv )))
t
30
Path Folding
• As long as we can decompose the matrix into
a series of sparse matrices, we can turn a
dense matrix-vector multiplication into a
series of sparse matrix-vector multiplications.
This means that we can turn an
operation that requires O(n2)
storage and runtime into one that
requires O(n) storage and runtime!
This is exactly
the case for
text data
And many other
kinds of data as well!
31
Path Folding
• Example – inner product similarity:
1
W  D FF
Diagonal matrix
that normalizes W
so rows sum to 1
Storage: n
T
The original
feature
matrix
Construction: given
Storage: O(n)
The feature
matrix
transposed
Construction: given
Storage: just use F
32
Okay…how about a
similarity function
we actually use for
text data?
Path Folding
• Example – inner product similarity:
Construction:
W
O(n)
Storage:
1
T
 DO(n)FF
Operation:
O(n)
• Iteration update:
v
t 1
1
 D ( F ( F v ))
T
t
33
Path Folding
Diagonal
cosine
normalizing
matrix
• Example – cosine similarity:
Construction:
W
O(n)

Storage:
Operation:
1
T
D O(n)
NFF N O(n)
• Iteration update:
v
t 1
1
 D ( N ( F ( F ( Nv ))))
T
t
Compact storage: we don’t need a cosinenormalized version of the feature vectors
34
Path Folding
• We refer to this technique as path folding due
to its connections to “folding” a bipartite
graph into a unipartite graph.
35
Results
• An accuracy result:
Diagonal: tied
(most datasets)
Upper triangle:
we win
No statistically
significant
difference –
same accuracy
Each point is
accuracy for
a 2-cluster
text dataset
Lower triangle:
spectral
clustering wins
36
y: algorithm
runtime
(log scale)
Results
• A scalability result:
Spectral clustering
(red & blue)
Quadratic
curve
Linear
curve
Our method
(green)
x: data size
(log scale)
37
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
38
MultiRankWalk
• Classification labels are expensive to obtain
• Semi-supervised learning (SSL) learns from
labeled and unlabeled data for classification
• When it comes to network data, what is a
general, simple, and effective method that
requires very few labels?
Our Answer:
MultiRankWalk (MRW)
Random Walk with Restart
• Imagine a network, and starting at a specific
node, you follow the edges randomly.
• But with some probability, you “jump” back to
the starting node (restart!).
If you recorded the number of
times you land on each node,
what would that distribution look
like?
Random Walk with Restart
What if we
start at a
different node?
Start node
Random Walk with Restart
• The walk distribution r satisfies a simple
Transition
equation:
Start
node(s)
matrix of the
network
r  (1  d )u  dWr
Restart
probability
“Keep-going” probability
(damping factor)
Random Walk with Restart
• Random walk with restart (RWR) can be
solved simply and efficiently with an iterative
procedure:
r  (1  d )u  dWr
t
t 1
RWR for Classification
• Simple idea: use RWR for classification
RWR with start
nodes being
labeled points in
class A
RWR with start
nodes being
labeled points in
class B
Nodes frequented more by RWR(A)
belongs to class A, otherwise they
belong to B
Results
Upper
triangle:
MRW better
Accuracy: MRW vs. harmonic functions method (HF)
With larger #
of labels, both
methods do
well
MRW does
much better
with only a
few labels!
Lower
triangle:
MRW better
Point color & style roughly
indicates # of labeled examples
45
Results
• How much better is MRW using authoritative seed preference?
y-axis:
MRW F1
score minus
wvRN F1
The gap between
MRW and wvRN
narrows with
authoritative
seeds, but they
are still
prominent on
some datasets
with small
number of seed
labels
x-axis:
number
of seed
labels per
class
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
47
MRW with Path Folding
• MRW is based on random walk with restart
(RWR):
Most
other methods
preprocess the
induced graph
data to create a
sparse similarity
matrix
r  (1  d )u  dWr
t
“Path folding” can be applied to
MRW as well to efficiently learn
from induced graph data!
Can also be applied to iterative
implementations of the harmonic
functions method
t 1
Core operation:
matrix-vector
multiplication
48
On to real, large, difficult
applications…
• We now have a basic set of tools to do
efficient unsupervised and semi-supervised
learning on graph data
• We want to apply to cool, challenging
problems…
49
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
50
A Web-Scale Knowledge Base
• Read the Web (RtW) project:
Build a never-ending system that
learns to extract information
from unstructured web pages,
resulting in a knowledge base of
structured information.
51
Noun Phrase and Context Data
• As a part of RtW project, two kinds of noun
phrase (NP) and context co-occurrence data
was produced:
– NP-context co-occurrence data
– NP-NP co-occurrence data
• These datasets can be treated as graphs
52
Noun Phrase and Context Data
• NP-context data:
… know that drinking pomegranate juice may not be a bad …
before context
pomegranate juice
JELL-O
Jagermeister
NP-context
graph
NP
3
2
5
8
2
1
after context
know that drinking _
_ may not be a bad
_ is made from
_ promotes responsible
53
Noun Phrase and Context Data
• NP-NP data:
Context can be used for
weighting edges or making
a more complex graph
… French Constitution Council validates burqa ban …
NP
context
NP
veil
French Constitution Council
burqa ban
hot pants
French Court
NP-NP
graph
Jagermeister
JELL-O
54
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
55
Noun Phrase Categorization
• We propose using MRW (with path folding) on
the NP-context data to categorize NPs, given a
handful of seed NPs.
• Challenges:
– Large, noisy dataset (10m NPs, 8.6m contexts from
500m web pages).
– What’s the right function for NP-NP categorical
similarity?
– Which learned category assignment should we
“promote” to the knowledge base?
– How to evaluate it?
56
Noun Phrase Categorization
• We propose using MRW (with path folding) on
the NP-context data to categorize NPs, given a
handful of seed NPs.
• Challenges:
– Large, noisy dataset (10m NPs, 8.6m contexts from
500m web pages).
– What’s the right function for NP-NP categorical
similarity?
– Which learned category assignment should we
“promote” to the knowledge base?
– How to evaluate it?
57
Noun Phrase Categorization
• Preliminary experiment:
– Small subset of the NP-context data
• 88k NPs
• 99k contexts
– Find category “city”
• Start with a handful of seeds
– Ground truth set of 2,404 city NPs created using
58
Noun Phrase Categorization
• Preliminary result:
Precision
1.0
0.9
HF
0.8
coEM
0.7
MRW
0.6
MRWb
0.5
0.4
0.3
0.2
0.1
0.0
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Recall
59
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
60
Learning Synonym NP’s
• Currently the knowledge base see NP’s with
different strings as different entities
• It would be useful to know when two different
surface forms refer to the same semantic entity
Natural thing to try:
use PIC to cluster NPs
But do we get
synonyms?
61
Learning Synonym NP’s
NP-context graph yields
categorical clusters…
NP-NP graph yields
Other methods may help to
semantically related clusters…
further refine results:
George W. Bush
African American
Obamamania
President Obama
President Obama
President Obama
Tony Blair
Bailout
Michelle Obama
Barack Obama
Barack Obama
Barack Obama
Nicolas Sarkozy
Democrat
Anti-Obama
Senator McCain
Senator McCain
BarackObama.com
Senator Obama
Senator Obama
Senator Obama
But together they may help us
to identify synonyms!
e.g., string similarity
62
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
63
Learning Polyseme NP’s
• Currently a noun phrase found on a web page can
only be mapped to a single entity in the
knowledge base
• It would be useful to know when two different
entities share the same surface form.
• Example:
Jordan
?
64
Learning Polyseme NP’s
• Proposal: Cluster the NP’s neighbors:
– Cluster contexts in the NP-context graph and NPs in the NP-NP graph
– Identify salient clusters as reference to different entities
• Results from clustering neighbors of “Jordan” in the NP-NP graph:
Tru, morgue, Harrison, Jackie, Rick, Davis, Jack, Chicago, episode, sights, six NBA championships,
glass, United Center, brother, restaurant, mother, show, Garret, guest star, dead body, woman,
Khouri, Loyola, autopsy, friend, season, release, corpse, Lantern, NBA championships, Maternal
grandparents, Sox, Starr, medical examiner, Ivers, Hotch, maiden name, NBA titles, cousin John,
Scottie Pippen, guest appearance, Peyton, Noor, night, name, Dr . Cox, third pick, Phoebe, side,
third season, EastEnders, Tei, Chicago White Sox, trumpeter, day, Chicago Bulls, products, couple,
Pippen, Illinois, All-NBA First Team, Dennis Rodman, first retirement
1948-1967, 500,000 Iraqis, first Arab country, Palestinian Arab state, Palestinian President, Arabstates, al-Qaeda, Arab World, countries, ethics, Faynan, guidelines, Malaysia, methodology,
microfinance, militias, Olmert, peace plan, Strategic Studies, million Iraqi refugees, two Arab
countries, two Arab states, Palestinian autonomy, Muslim holy sites, Jordan algebras, Jordanian
citizenship, Arab Federation, Hashemite dynasty, Jordanian citizens, Gulf Cooperation, Dahabi,
Judeh, Israeli-occupied West Bank, Mashal, 700,000 refugees, Yarmuk River, Palestine Liberation,
Sunni countries, 2 million Iraqi, 2 million Iraqi refugees, Israeli borders, moderate Arab states
65
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
66
Finding New Types and Relations
we learn for finding
Tom’s suggestion:
• Different
Proposed Can
methods
synonyms and
Admittedly
relationships
effective
Given a set of NPs,
identify
nodes
maypolysemes
require
the most
similarity particular
can you
find k and
different kinds
open-ended
functions
for
categories
and on
l
relationship
between
nodes
based
the
of graphs and
part of this
clustering
relationships
that
similarity
characteristics
of the graph
structure,
PIC …
proposal
efficiently?
cover the
most NPs? using
functions
Can we generalize these
methods to find any unary or
binary relationships between
nodes in the graph?
67
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
68
PIC Extension: Avoiding Collisions
• One robustness question for vanilla PIC as data
size and complexity grows:
• How many (noisy) clusters can you fit in one
dimension without them “colliding”?
Cluster signals
cleanly separated
A little too close for
comfort?
69
PIC Extension: Avoiding Collisions
• We propose two general solutions:
1. Precondition the starting vector; instead of random:
•
•
•
Degree
Skewed (1, 2, 4, 8, 16, etc.)
May depend on particular properties of the data
2. Run PIC d times with different random starts and
construct a d-dimension embedding
•
•
Unlikely two clusters collide on all d dimensions
We can afford it because PIC is fast and space-efficient!
70
PIC Extension: Avoiding Collisions
• Preliminary results on network classification datasets:
1-dimension PIC embeddings
lose on accuracy at higher k’s
compared to NCut and NJW
RED: PIC
embedding
with a random
start vector
GREEN: PIC
using a degree
start vector
BLUE: PIC
using 4
random start
vectors
Dataset (k)
# of clusters
But using a 4
random vectors
instead helps!
Note # of
vectors << k
71
PIC Extension: Avoiding Collisions
• Preliminary results on name disambiguation datasets:
Again using a 4
random vectors
seems to work!
Again note # of
vectors << k
72
PIC Extension: Hierarchical Clustering
• Real, large-scale data may not have a “flat”
clustering structure
• A hierarchical view may be more useful
Good News:
The dynamics of a PIC embedding
display a hierarchically convergent
behavior!
73
PIC Extension: Hierarchical Clustering
• Why?
• Recall PIC embedding at time t:
e’s – eigenvectors
(structure)
Big
t
Small
t
t
c3  3 
cn  n 
v
c2  2 
 e1    e 2    e3  ...    e n
t
c11
c1  1 
c1  1 
c1  1 
t
More salient
structure stick
around
There may not
be a clear
eigengap - a
gradient of
cluster saliency
Less significant
eigenvectors / structures
go away first, one by one
74
PIC Extension: Hierarchical Clustering
Same dataset
you’ve seen
PIC already
converged to
8 clusters…
But let’s keep
on iterating…
Similar behavior also noted
in matrix-matrix power
methods (diffusion maps,
mean-shift, multi-resolution
spectral clustering)
“N” still a part
of the “2009”
cluster…
Yes
(it might take a while)
75
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
76
Distributed / Parallel Implementations
• Distributed / parallel implementations of learning
methods are necessary to support large-scale data
given the direction of hardware development
• PIC, MRW, and their path folding variants have at their
core sparse matrix-vector multiplications
• Sparse matrix-vector multiplication lends itself well to
a distributed / parallel computing framework
• We propose to use
Existing graph
• Alternatives:
analysis tool:
77
Talk Outline
• Prior Work
• Power Iteration Clustering (PIC)
• PIC with Path Folding
• MultiRankWalk (MRW)
• Proposed Work
• MRW with Path Folding
• MRW for Populating a Web-Scale Knowledge Base
• Noun Phrase Categorization
• PIC for Extending a Web-Scale Knowledge Base
• Learning Synonym NP’s
• Learning Polyseme NP’s
• Finding New Ontology Types and Relations
• PIC Extensions
• Distributed Implementations
78
Questions?
Additional Information
80
PIC: Some “Powering” Methods at a
Glance
Method
W
Iterate
Stopping
Final
information
bottleneck
method
Tishby &
Slonim 2000
W=D-1A
Wt+1=Wt
rate of
information
loss
Zhou &
Woodruff
2004
W=A
Wt+1=Wt
a small t
a threshold ε
CarreiraPerpinan
2006
W=D-1A
Xt+1=WX
entropy
a threshold ε
PIC
W=D-1A
vt+1=Wvt
acceleration
k-means
How far can we go with a
one- or low-dimensional
embedding?
PIC: Versus Popular Fast Sparse
Eigencomputation Methods
For Symmetric
Matrices
For General
Matrices
Improvement
Successive Power
Method
Basic; numerically
unstable, can be
slow
Lanczos Method
Arnoldi Method
More stable, but
require lots of
memory
Implicitly Restarted
Lanczos Method
(IRLM)
Implicitly Restarted
Arnoldi Method
(IRAM)
More memoryefficient
Randomized
sampling
methods are
also popular
Method
Time
Space
IRAM
(O(m3)+(O(nm)+O(e))×O(m-k))×(# restart)
O(e)+O(nm)
PIC
O(e)x(# iterations)
O(e)
82
PIC: Related Clustering Work
• Spectral Clustering
– (Roxborough & Sen 1997, Shi & Malik 2000, Meila & Shi 2001, Ng et al. 2002)
• Kernel k-Means (Dhillon et al. 2007)
• Modularity Clustering (Newman 2006)
• Matrix Powering
– Markovian relaxation & the information bottleneck method (Tishby & Slonim
2000)
– matrix powering (Zhou & Woodruff 2004)
– diffusion maps (Lafon & Lee 2006)
– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
• Mean-Shift Clustering
– mean-shift (Fukunaga & Hostetler 1975, Cheng 1995, Comaniciu & Meer 2002)
– Gaussian blurring mean-shift (Carreira-Perpinan 2006)
83
Upper triangle:
PICwPF:
PIC wins
Results
Each point
represents
the accuracy
result from a
dataset
Lower triangle:
k-means wins
84
PICwPF:
Two methods
Results have almost the
same behavior
Overall, one
method not
statistically
significantly
better than
the other
85
PICwPF: Results
Lesson:
Approximate
eigencomputation
methods may
require
expertise to
work well
Not sure why
NCUTiram did
not work as well
as NCUTevd
86
PICwPF: Results
• PIC is O(n) per iteration and the runtime curve looks
linear…
• But I don’t like eyeballing curves, and perhaps the
number of iteration increases with size or difficulty of
the dataset?
Correlation statistic
(0=none, 1=correlated)
Correlation
plot
87
PICwPF: Results
• Linear run-time implies constant number of
iterations.
• Number of iterations to “accelerationconvergence” is hard to analyze:
– Faster than a single complete run of power
iteration to convergence.
– On our datasets
• 10-20 iterations is typical
• 30-35 is exceptional
88
PICwPF: Related WorkNot O(n) time
• Faster spectral clustering
methods
– Approximate eigendecomposition (Lanczos, IRAM)
– Sampled eigendecomposition (Nyström)
• Sparser matrix
– Sparse construction
• k-nearest-neighbor graph
• k-matching
– graph sampling / reduction
Still require O(n2)
construction in
general
Not O(n) space
methods
89
PICwPF: Results
90
MRW: RWR for Classification
We refer to this method as MultiRankWalk: it classifies
data with multiple rankings using random walks
MRW: Seed Preference
• Obtaining labels for data points is expensive
• We want to minimize cost for obtaining labels
• Observations:
– Some labels inherently more useful than others
– Some labels easier to obtain than others
Question: “Authoritative” or “popular” nodes
in a network are typically easier to obtain
labels for. But are these labels also more
useful than others?
Seed Preference
• Consider the task of giving a human expert (or
posting jobs on Amazon Mechanical Turk) a list of
data points to label
• The list (seeds) can be generated uniformly at
random, or we can have a seed preference,
according to simple properties of the unlabeled
data
• We consider 3 preferences:
– Random
– Link Count
– PageRank
Nodes with highest counts make the list
Nodes with highest scores make the list
MRW: The Question
• What really makes MRW and wvRN different?
• Network-based SSL often boil down to label propagation.
• MRW and wvRN represent two general propagation
methods – note that they are call by many names:
MRW
Random walk with restart
wvRN
Reverse random walk
Great…but
we
still don’t
Regularized random
walk
Random
walk with sink nodes
know why the differences
in
Hitting time
their behavior on these
Local & global consistency
Harmonic functions on graphs
network datasets!
Personalized PageRank
Iterative averaging of neighbors
MRW: The Question
• It’s difficult to answer exactly why MRW does better
with a smaller number of seeds.
• But we can gather probable factors from their
propagation models:
MRW
1 Centrality-sensitive
wvRN
Centrality-insensitive
Exponential drop-off
2
No drop-off / damping
/ damping factor
Propagation of
Propagation of different
3 different classes done
classes interact
independently
1. Centrality-sensitive:
seeds have different
scores and not
necessarily the highest
MRW: The Question
Seed labels
underlined
• An example from a political blog dataset – MRW vs. wvRN
scores for how much a blog is politically conservative:
0.020
0.019
0.017
0.017
0.013
0.011
0.010
0.010
0.008
0.007
0.007
0.007
0.006
0.006
0.005
0.005
firstdownpolitics.com
1.000
neoconservatives.blogspot.com
neoconservatives.blogspot.com
strangedoctrines.typepad.com
2. Exponential drop- 1.000
jmbzine.com off: much less sure 1.000
jmbzine.com
strangedoctrines.typepad.com
presidentboxer.blogspot.com
about nodes further 0.593
millers_time.typepad.com
0.585
rooksrant.com
away from seeds
decision08.blogspot.com
0.568
purplestates.blogspot.com
gopandcollege.blogspot.com
0.553
ikilledcheguevara.blogspot.com
charlineandjamie.com
0.540
restoreamerica.blogspot.com
We still don’t
really
marksteyn.com
0.539
billrice.org
understand0.529
it yet.kalblog.com
blackmanforbush.blogspot.com
3. Classes propagate
reggiescorner.blogspot.com
0.517
right-thinking.com
independently:
fearfulsymmetry.blogspot.com
0.517 charlineandjamie.com
tom-hanna.org
is both
quibbles-n-bits.com
0.514 very
crankylittleblog.blogspot.com
likely a conservative and a
undercaffeinated.com
0.510
hasidicgentile.org
liberal blog (good or bad?)
samizdata.net
0.509
stealthebandwagon.blogspot.com
pennywit.com
0.509
carpetblogger.com
RWR ranking as
features to SVM
Similar
MRW: Relatedformulation,
Work
different view
• MRW is very much related to
Random
walk
without
restart,
heuristic
stopping
– “Local and global consistency” (Zhou et al. 2004)
– “Web content categorization using link information”
(Gyongyi et al. 2006)
– “Graph-based semi-supervised learning as a
generative model” (He et al. 2007)
• Seed preference is related to the field of active
learning
– Active learning chooses which data point
to label next
Authoritative seed
based on previous labels; the labeling ispreference
interactive
a good
– Seed preference is a batch labeling method
base line for active
learning on network
data!