Clustering Web Search Results

Download Report

Transcript Clustering Web Search Results

Semantic, Hierarchical, Online Clustering
of Web Search Results
Yisheng Dong
Overview





Introduction
Previous Related Works
SHOC Approach
Prototype System
Conclusion
2
Introduction

Motivation




The Web is the biggest data source.
Search engine is the most commonly used tool
for Web information retrieval.
Its current status is far from the satisfaction.
Solution


Clustering of Web search results would help a
lot.
SHOC can generate both reasonable and
readable cluster.
3
Basic requirements
(clustering approach for web search result)

Semantic




Hierarchical



Each cluster should correspond to a concept.
Avoid confining each Web page to only on
cluster.
A label can describe the topic of cluster well.
Eye-browsing tree structure.
Taking advantage of the relationship between
them.
Online

Provide fresh clustering result “just-in-time”.
4
Previous Related Work

Scatter/Gather system



Based on hyperlink



traditional heuristic clustering algorithm.
It has some limitations.
It needs to download and parse original Web
page.
Cannot cluster immediately.
STC



It is not appropriate for Oriental language.
Extract many meaningless partial phrases.
Synonymy and polysemy are not considered.
5
SOHC step
1.
2.
3.
4.
5.
Data acquisition
Data cleaning
Feature extraction
Identifying base clusters
Combining base clusters
6
Data acquision


The data acquisition task here is
actually meta-search.
Use 2-level parallelization
mechanism
1.
2.
Call several engines simultaneously.
Fetch all of its search result
simultaneously.
7
Data cleaning

Sentence boundaries are identified via the
following.





punctuation marks (e.g. ‘.’, ‘,’, ‘;’, ‘?’, etc.)
HTML tags (e.g.<p>, <br>, <li>, <td> etc.)
Non-word tokens are stripped.
(e.g. punctuation marks and HTML tags)
Redundant spaces are compressed.
Stemming algorithm may be applied.
(for English text)
8
Feature extraction (Overview)

Words



Most clustering algorithm treat a document as
“bag of words”.
Ignoring word order and proximity.
Key phrases


Advantage
 Improve the quality of the clusters.
 Useful in constructing labels.
Data structures (key phrase discovery)
 Suffix tree


Related to the alphabet size of language.
Suffix array

Scalable over alphabet size.
9
Feature extraction
(key phrase discovery)

Completeness



Stability (Mutual Information)



Left-completeness
Right-completeness
S =“c1c2∙∙∙cp”, SL =“c1∙∙∙cp-1”, SR =“c2∙∙∙cp”
MI ( S ) 
freq(S)
freq(SL )  freq(SR ) - freq(S)
Significance


se(S) = freq(S) * g(|S|)
g(x) 0
(x=1)
log2x
(2≤x≤8)
3
(x>8)
10
Feature extraction (Suffix array)

Suffix array


An array of all N suffixes, sorted alphabetically
LCP (Longest Common Prefix)

Use to accelerate searching in text
<Suffix array and lcp of the “to_be_or_not_to_be”>
11
Feature extraction (Discover rcs)
void discover_rcs()
{
typedef structure{
int ID;
int frequency;
} RCSTYPE;
RSCTYPE rcs_stack[N]; // N is the document's
length
Initialize rcs_stack;
int sp = -1; // the stack pointer
int i = 1;
while(i < N+1)
{
if(sp < 0){ // the stack is empty
if(lcp[i] > 0){
sp++;
rcs_stack[sp].ID = i;
rcs_stack[sp].frequency = 2;
}
i++;
}
else{
.
.
.
}
}
}
int r = rcs_stack[sp].ID;
if(lcp[r] < lcp[i]) {
sp++;
rcs_stack[sp].ID = i;
rcs_stack[sp].frequency = 2;
i++;
}
else if(lcp[r] == lcp[i]) {
rcs_stack[sp].frequecny++;
i++;
}
else
{
Output rcs_stack[sp];
// ID & frequency
int f = rcs_stack[sp].frequency;
sp--;
if(sp >= 0){
rcs_stack[sp].frequency =
rcs_stack[sp].frequency + f -1;
}
}
12
Feature extraction (Intersect lcs_rcs)
void intersect_lcs_rcs(sorted lcs array, sorted rcs
array)
{
int i =0, j=0;
while(i<L && j < R) {
string str_l = lcs[i].ID denoted LCS;
string str_r = rcs[j].ID denoted RCS;
if(str_l == str_r) {
Output lcs[i];
i++;
j++;
}
if(str_l < str_r){
i++;
}
if(str_l > str_r){
j++;
}
}
}
rcs array
ID
frequency
RCS
1
2
_be
2
5
_
6
2
be
8
2
e
11
2
o_be
12
4
o
16
3
to_be
17
2
t
cs array
ID
frequency
CS
2
5
_
12
4
o
16
3
t
17
2
to_be
13
Identifying base clusters
Terms
(key phrases)
Documents
The association between terms and documents
14
Combining base clusters
Combine base cluster X and Y
if ( |X ∩ Y| / |X ∪ Y| > t1 ) {
X and Y are merged into
one cluster;
}
else {
if ( |X| > |Y| ) {
if ( |X ∩ Y| / |Y| > t2 ) {
let Y become X’s child;
}
}
else {
if ( |X ∩ Y| / |X| > t2 ) {
let X become Y’s child;
}
}
}
Merging Label
if ( label x is a substring of label y ) {
label_xy = label_y;
}
else if ( label_y is a substring of label_x ){
label_xy = label_x;
}
else
{
label_xy = “ label_x + label_y ”;
}
15
Prototype system



Crate a prototype system named
WICE (Web Information Clustering
Engine)
Doing well for dealing with the
special problems related to Chinese
Output for query “object oriented”


object oriented programming
object oriented analysis, etc.
16
Conclusion

Main contribution





The benefit of using key phrase.
Method based on suffix array for key phrase.
The concept of orthogonal clustering.
The WICE system is designed and implemented.
Further works




Detailed analysis.
Further experimenting.
Interpretation of experiment results.
Comparing with other clustering algorithms.
17