with query word

Download Report

Transcript with query word

國立雲林科技大學
National Yunlin University of Science and Technology
Web-Page Summarization Using
Clickthrough Data
Advisor : Dr. Hsu
Graduate : Jing Wei Lin
Authors
: Jian-Tao Sun, Dou Shen,
Hua-Jun Zeng, Qiang Yang,
Yuchang Lu, Zheng Chen
2005 ACM SIGIR
1
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Outline







Motivation
Objective
Summarize Web Pages Using Clickthrough Data
Experimental Results
Discussions
Conclusions
Personal Opinions
2
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Motivate

Many of summarization methods do not
consider the hidden relationships in the Web.
─
Uncovering the hidden knowledge is important in
building good Web-page summarizers.
3
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Objective

We extract the extra knowledge from the
clickthrough data of a Web search engine to
improve Web-page summarization.
4
Intelligent Database Systems Lab
Empirical Study on Clickthrough Data
Clickthrough
N.Y.U.S.T.
I. M.
data can be represented by a set of triples < u; q; p >
(user (u)、query (q) the user clicks on the pages (p) of interest)
Experiment1:
In the clickthrough data among the 260,763web pages accessed by users
during one month, 109,694 of them contain "KEYWORD" metadata.
Result:45.5% of the keywords occur in the query words
Keyword
13.1% of query words appear as keywords.
Query Word
Experiment2:
We collected 90 pages which are covered by the clickthrough data, then we
asked three human evaluators to conduct a manual summarization task.
Result:58% of the sentences in the original Web page contain query words
each sentence contains 1.48 query words on average (without knowing query word)
the percentage of sentences containing queries becomes 71.3% and the average
query word length in each sentence becomes 2.0. (with query word)
5
Intelligent Database Systems Lab
Adapted Web-page Summarization
Methods—Adapted Significant Word

N.Y.U.S.T.
I. M.
In order to compute the significance factor of each sentence, a set of significant
words are constructed first.
Each candidate word is assigned with a
significance factor wi given in Equation 1.

In Luhn’s method 在某年某月某日(陳義雄槍殺陳水扁總統)
(1) Set a limit L for the distance at which any two significant words could be
considered as being significantly related.
(2) Find out a portion in the sentence that is bracketed by significant words not
more than L non-significant words apart. (ex.陳水扁 總統/陳義雄)
(3) Count the number of significant words contained in the portion and divide the
square of this number by the total number of words within the portion.
Ns/(N^2*Nall)= the significance factor of a sentence 6/(6^2*10)=1/60=0.017
6
Intelligent Database Systems Lab
Adapted Web-page Summarization
Methods--ALSA

Gong et al. proposed an summarization algorithm
─ 1.A term-sentence matrix is constructed from the original text
document.
─
2. LSA analysis Sentence
is conducted on the matrix.
Each element in measures the importance factor of this
sentence on the corresponding latent concept.
─ 3.A document summary is produced incrementally.
We utilize the query-word knowledge by changing the term-sentence
matrix: if a term occurs as query words, its weight is increased
according to its frequency in query word collection.


N.Y.U.S.T.
I. M.
7
Intelligent Database Systems Lab
Summarize Web Pages Not Covered
by Clickthrough Data


TS(c)  a set of terms associated with category c
Thematic lexicon is a set of TS
Arts
Movies



N.Y.U.S.T.
I. M.
TV
Music
Step1:TS corresponding to each category is set empty.
Step2:For each page covered by the clickthrough data, its query
words are added into TS of categories and query words
frequency is added to its original weight in TS.
Step3:Term weight in each TS is multiplied by its Inverse
Category Frequency (ICF). The ICF value of a term is the
reciprocal of its frequency occurring in different categories of the
hierarchical taxonomy.


Look up the lexicon for TS according to the page's category.
Use the summarization methods proposed in paper.
─
Weights of the terms in TS can be used to select significant words or to update
the term-sentence matrix. Ex.絕命終結站3(驚悚、死神……)
8
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results
Ignore
clickthrough
data
Only
query
word
9
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results (cont.)
Ignore
clickthrough
data
Using
Lexicon
10
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Experimental Results (cont.)
Top
Top
Top
Top
11
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Discussions

The thematic lexicon built from clickthrough data can
discover the topic terms associated with a specific
category and the ICF-based approach can effectively
assign weights to terms of this category.
12
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Conclusion

We leverage extra knowledge from clickthrough data to improve
Web-page summarization.

For the pages which are not covered by the clickthrough data, we
build a thematic lexicon using the clickthrough data in
conjunction with an available hierarchical Web directory.
13
Intelligent Database Systems Lab
N.Y.U.S.T.
I. M.
Personal Opinions



Advantage
Drawback
Opinions:利用clickthough data來做個人化的
知識+
14
Intelligent Database Systems Lab