Transcript rinco

Evaluation of Bipartite-graph-based
Web Page Clustering
Shim Wonbo
M1 Chikayama-Taura Lab
Background
Open Directory Project


Used by Google, Lycos, etc.
Categorizing Web pages by hand

Accurate

Lately updated
Unscalable

World Wide Web

Rapid increase (= # of clusters changes)
Daily updated (= cluster centers move)

Due to these two properties of the Web..


A Web page clustering system without
human effort is needed.
Purpose

Constructing a Web page clustering
system which




finds clusters without human help
is scalable
clusters Web pages in high speed
clusters Web pages accurately
Agenda





Introduction
Related Work
Proposal
Comparison
Conclusion
Clustering Algorithm

Text-based clustering



Use of word as feature
Generally used algorithm
Link-based clustering


Focus on link structure
Especially used in clustering Web pages
k-means Algorithm
k=3
point: vector expression
of each document
Problems of k-means Algorithm


k depends on the data set.
Outliers sensitively effect clustering
result.
Hierarchical Clustering
BIRCH
[Zhang ’96], CURE [Guha ’98], Chameleon [Karypis ’99], ROCK [Guha ’00]
Hierarchical Clustering


# of clusters can be determined by
condition.
Clustering a large number of points
(pages) results in many I/O accesses.
Use of Link Structure


Web pages include not only text but
also links.
People link Web pages to other related
pages.
Linked Web pages may share
the same topic
Extraction of Web Community
based on Link Analysis

An Approach to Find Related
Communities Based on Bipartite Graphs
[P.Krishna Reddy et al., 2001]
Terminology

Fans and Centers
Fan
Center
p

q
Bipartite Graph


Complete BG
Dense BG
(a) CBG
(b) DBG
( p  pt , q  qt )
An Approach to Find Related Communities
Based on Bipartite Graphs

Definition
The set T contains the members of the
community if there exist a dense bipartite
graph DBG(T, I, p, q) where p  pt , q  qt




T: Fans
I: Centers
p: # of out-link
q: # of in-link
p
q
DBG(T, I, 2, 3)
DBG Extraction Algorithm
(pt = 2, qt = 3)
1.
Gathering related nodes
threshold = 1
DBG Extraction Algorithm
(pt = 2, qt = 3)
2.
Extracting a DBG
1
2
3
2
2
3
3
2
3
1
2
0
1
DBG-based Web Community
O High speed (O( #links ))
O Finding out topics over the Web
X Possibility of extracting disrelated Web
page group
Comparison

Text-based clustering



Accurate
Difficult to determine the center of cluster
Community topology based on DBG


Inaccurate
Can be used as topic selection
Refined Web Community

Center of Cluster
Agenda





Introduction
Related Word
Proposal
Comparison
Conclusion
Proposal
1.
2.
3.
Extract DBGs through link analysis
Refine communities and fix centers
with DBSCAN
Partition other pages to the nearest
center
Community Extraction

Extract DBGs from the Web Graph

Disallow the same page to be included in
more than one Web community
Web Graph
Cluster Center Refinement

Find meaningful page sets
1.
2.


Does the DBGs really have a topic?
Is there any page in the community that is not
related the topic?
Feature: terms of extracted pages
DBSCAN [Martin Easter et al., A Density-Based
Algorithm for Discovering Clusters in Large Spatial
Databases with Noise, 1999]
DBSCAN
radius: r
minP: m
Density
reachable
Community
(Center of cluster)
r
Core
Partitioning Remaining Pages

Feature: term’s appearance
1.
Calculate distance between a remaining
page and each center
If the distance to the nearest center is
shorter than threshold, attach the page to
that cluster
Otherwise, attach the page to “Unclassified
cluster”
2.
3.
Agenda





Introduction
Related Word
Proposal
Experimental Result
Conclusion
Target


Seed: 3,000 pages categorized to
Computer/Software by ODP
70,000 pages departed from seed
pages by 2 hops
Preprocess

Word ID






Use words of a dictionary as base vectors
Attribute the same ID to words sharing the same
derivation
Add terms which appear in many documents
(IDF <= 8)
Total: 29347
Link Extraction
Elimination of links to pages which are not
collected.
# Communities
1200
1000
(2,2)
(2,3)
(2,4)
(3,2)
(3,3)
(3,4)
(4,2)
(4,3)
(4,4)
# Communities
800
600
400
200
0
0
1
2
3
4
5
Threshold
6
7
8
9
10
# Community Members
(pt=3, qt=3)
45000
40000
# Community Members
35000
(3,3,2)
(3,3,3)
(3,3,4)
(3,3,5)
(3,3,6)
(3,3,7)
(3,3,8)
(3,3,9)
30000
25000
20000
15000
10000
5000
0
0
2
4
6
8
Nth Community
10
12
14
# Community Members
60000
# Community Members
50000
(2,2)
(2,3)
(2,4)
(3,2)
(3,3)
(3,4)
(4,2)
(4,3)
(4,4)
40000
30000
20000
10000
0
0
1
2
3
4
5
Threshold
6
7
8
9
10
Variance of Terms
After DBSCAN
Conclusion
Future Work

Applying to more large data set


This may need parallel processing
Analyzing with