Clustering Web Search Results
Download
Report
Transcript Clustering Web Search Results
Semantic, Hierarchical, Online Clustering
of Web Search Results
Yisheng Dong
Overview
Introduction
Previous Related Works
SHOC Approach
Prototype System
Conclusion
2
Introduction
Motivation
The Web is the biggest data source.
Search engine is the most commonly used tool
for Web information retrieval.
Its current status is far from the satisfaction.
Solution
Clustering of Web search results would help a
lot.
SHOC can generate both reasonable and
readable cluster.
3
Basic requirements
(clustering approach for web search result)
Semantic
Hierarchical
Each cluster should correspond to a concept.
Avoid confining each Web page to only on
cluster.
A label can describe the topic of cluster well.
Eye-browsing tree structure.
Taking advantage of the relationship between
them.
Online
Provide fresh clustering result “just-in-time”.
4
Previous Related Work
Scatter/Gather system
Based on hyperlink
traditional heuristic clustering algorithm.
It has some limitations.
It needs to download and parse original Web
page.
Cannot cluster immediately.
STC
It is not appropriate for Oriental language.
Extract many meaningless partial phrases.
Synonymy and polysemy are not considered.
5
SOHC step
1.
2.
3.
4.
5.
Data acquisition
Data cleaning
Feature extraction
Identifying base clusters
Combining base clusters
6
Data acquision
The data acquisition task here is
actually meta-search.
Use 2-level parallelization
mechanism
1.
2.
Call several engines simultaneously.
Fetch all of its search result
simultaneously.
7
Data cleaning
Sentence boundaries are identified via the
following.
punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.)
HTML tags (e.g.<p>, <br>, <li>, <td> etc.)
Non-word tokens are stripped.
(e.g. punctuation marks and HTML tags)
Redundant spaces are compressed.
Stemming algorithm may be applied.
(for English text)
8
Feature extraction (Overview)
Words
Most clustering algorithm treat a document as
“bag of words”.
Ignoring word order and proximity.
Key phrases
Advantage
Improve the quality of the clusters.
Useful in constructing labels.
Data structures (key phrase discovery)
Suffix tree
Related to the alphabet size of language.
Suffix array
Scalable over alphabet size.
9
Feature extraction
(key phrase discovery)
Completeness
Stability (Mutual Information)
Left-completeness
Right-completeness
S =“c1c2∙∙∙cp”, SL =“c1∙∙∙cp-1”, SR =“c2∙∙∙cp”
MI ( S )
freq(S)
freq(SL ) freq(SR ) - freq(S)
Significance
se(S) = freq(S) * g(|S|)
g(x) 0
(x=1)
log2x
(2≤x≤8)
3
(x>8)
10
Feature extraction (Suffix array)
Suffix array
An array of all N suffixes, sorted alphabetically
LCP (Longest Common Prefix)
Use to accelerate searching in text
<Suffix array and lcp of the “to_be_or_not_to_be”>
11
Feature extraction (Discover rcs)
void discover_rcs()
{
typedef structure{
int ID;
int frequency;
} RCSTYPE;
RSCTYPE rcs_stack[N]; // N is the document's
length
Initialize rcs_stack;
int sp = -1; // the stack pointer
int i = 1;
while(i < N+1)
{
if(sp < 0){ // the stack is empty
if(lcp[i] > 0){
sp++;
rcs_stack[sp].ID = i;
rcs_stack[sp].frequency = 2;
}
i++;
}
else{
.
.
.
}
}
}
int r = rcs_stack[sp].ID;
if(lcp[r] < lcp[i]) {
sp++;
rcs_stack[sp].ID = i;
rcs_stack[sp].frequency = 2;
i++;
}
else if(lcp[r] == lcp[i]) {
rcs_stack[sp].frequecny++;
i++;
}
else
{
Output rcs_stack[sp];
// ID & frequency
int f = rcs_stack[sp].frequency;
sp--;
if(sp >= 0){
rcs_stack[sp].frequency =
rcs_stack[sp].frequency + f -1;
}
}
12
Feature extraction (Intersect lcs_rcs)
void intersect_lcs_rcs(sorted lcs array, sorted rcs
array)
{
int i =0, j=0;
while(i<L && j < R) {
string str_l = lcs[i].ID denoted LCS;
string str_r = rcs[j].ID denoted RCS;
if(str_l == str_r) {
Output lcs[i];
i++;
j++;
}
if(str_l < str_r){
i++;
}
if(str_l > str_r){
j++;
}
}
}
rcs array
ID
frequency
RCS
1
2
_be
2
5
_
6
2
be
8
2
e
11
2
o_be
12
4
o
16
3
to_be
17
2
t
cs array
ID
frequency
CS
2
5
_
12
4
o
16
3
t
17
2
to_be
13
Identifying base clusters
Terms
(key phrases)
Documents
The association between terms and documents
14
Combining base clusters
Combine base cluster X and Y
if ( |X ∩ Y| / |X ∪ Y| > t1 ) {
X and Y are merged into
one cluster;
}
else {
if ( |X| > |Y| ) {
if ( |X ∩ Y| / |Y| > t2 ) {
let Y become X’s child;
}
}
else {
if ( |X ∩ Y| / |X| > t2 ) {
let X become Y’s child;
}
}
}
Merging Label
if ( label x is a substring of label y ) {
label_xy = label_y;
}
else if ( label_y is a substring of label_x ){
label_xy = label_x;
}
else
{
label_xy = “ label_x + label_y ”;
}
15
Prototype system
Crate a prototype system named
WICE (Web Information Clustering
Engine)
Doing well for dealing with the
special problems related to Chinese
Output for query “object oriented”
object oriented programming
object oriented analysis, etc.
16
Conclusion
Main contribution
The benefit of using key phrase.
Method based on suffix array for key phrase.
The concept of orthogonal clustering.
The WICE system is designed and implemented.
Further works
Detailed analysis.
Further experimenting.
Interpretation of experiment results.
Comparing with other clustering algorithms.
17