Web Content Summarization

Download Report

Transcript Web Content Summarization

WEB CONTENT
SUMMARIZATION
A Look at Algorithms, Methodologies, and Live Systems
Timothy Washington
Why is web content summarization important?



The quantity of information available on the Internet
has grown vastly and this trend will continue for
years to come.
In contrast, however, the amount of time that web
users desire to spend looking through this
information continues to decrease.
Additionally, the development of handheld web
devices (e.g. smartphones and, iPads) creates the
need to reduce text for display on such small devices.
Two Main Phases of Web Content Summarization


Content Selection - weighing the importance of
information throughout web documents and
determining what is crucial to the web user’s general
understanding of the material; determining what
information is redundant and unnecessary.
Content Generation - structuring the selected
content with proper syntactic structure and semantic
representation
Algorithms and Methodologies
used in Content Selection




TF*IDF (term frequency inverse document frequency)
Lexical Chains
Machine Learning (e.g. Noisy Channel Models and Ngrams)
Clickthrough Data Evaluation
TF*IDF
TF*IDF is a method used for summarizing multiple web documents. It selects sentences for a
summary based on the statistical scores of words they contain. This score is computed from the
combination of two equations.
For term frequency we count the number of occurrences of a given word in the current web
document being evaluated.
With inverse document frequency we check to see how many web pages (out of the group
that is being summarized) contain a given word. This number is divided into the number of web
pages in the group. Next the log of the quotient is taken.
After this is computed for each word, sentences with the highest scoring words are selected for
the summary until some threshold is reached.
Lexical Chains




Lexical chaining is used in the summarization of single and multiple web
documents. In lexical chaining words that have similar meaning are linked
together through a data structure.
WordNet, an online database that groups words into sets of synonyms is a
primary tool used in determining which words should be “chained”.
The strength of a lexical chain is based on the number of terms contained
within it.
Usually, the first sentence from each strong chain is selected for the
summary until some threshold is reached.
An arctic cold wave, the worst in 10 years, hit parts of Europe, leaving
many residents dead. Hardest hit by the sub-zero temperatures were
Poland, Bulgaria, and Romania. Rare snowfall occurred in southern
Italy. In Poland, three weeks of chilling temperatures killed at least
85 people in November, 29 more citizens than in all of the previous
winter …
Noisy Channel Model


In the noisy channel model, a summarization system is
trained on a set of documents and handwritten summaries,
recording features of sentences selected (as well as features
of those not selected) for the summary.
These features may include sentence length, average
TF*IDF per word, sentence position in a paragraph, etc.
They are represented in vectors as binary numbers,
decimals, and strings.
Noisy Channel Model (continued)
When web documents are run through the summarizer feature vectors are created
for each sentence. Afterwards a machine learning algorithm such as Baye’s Rule
(shown below) is applied for each sentence. A threshold can be used to place the
N number of sentences with the highest values in the summary.
P(s∈<S|F1,…,FN) = P(F1,F2,…,FN|s∈S) * P (s∈S) / P (F1, F2,…,FN)
s – the given sentence
S – the summary
F1, F2,…,FN – sentence feature vector
P(s∈<S|F1,…,FN) – the probability that s should be chosen for the summary
P (s∈S) – the statistical probability ( not based on any features) of a sentence
being chosen for a summary.
P (F1, F2,…,FN) – the probability of a sentence vector in the document set
matching that of the given sentence
P(F1,F2,…,FN|s∈S) – the probability of a sentence vector in the summary set
matching that of the given sentence
N-Grams






N-grams are sequences of N words in a given sentence that are evaluated based on the
probability that they will occur in a given order.
The most popular forms of N-grams are unigrams (single words), bigrams (a set of two
words), and tri-grams (a set of three words).
In the case of summarization, N-gram probability is computed based on the statistics
collected from the document-summary training set.
For example, the probability for sets of bigrams in a candidate sentence appearing
within sentences selected for the summary could be evaluated.
Once again this statistical probability can be compared to a threshold value used to
determine whether or not a sentence is fit for summarization.
The shortcoming of such an implementation is that the subject matter of the documentsummary pairs must be closely related to that of the web content.
Clickthrough Data Evaluation
Clickthrough data is created by recording data sets from search engines to represent user
activity.
The data sets are represented as set <u, q, p>, where user (u) enters data in a query (q)
and from the results selects pages (p). Once this data has been collected over a given
period of time there will be millions of these data sets available for evaluation.
Sentences can then be selected based on their number of significant words. In this case the
significance of each word is determined by the number of times a user selected the given
web page from the search engine after having entered a query containing the given word
(select all q from the data set with value current p).
 One obvious issue with using the clickthrough data is that certain web pages may be
summarized that users have not accessed through a search engine. In these cases it is useful
to employ some backup method (such as TF-IDF) of selecting useful material for the summary.
News In Essence
Online news article summarization system developed at the University of Michigan
Ann Arbor
Unique in that it uses a Trolling Algorithm to determine the group of documents to
summarize itself rather than taking a preset group of documents.
News In Essence (continued)
Input : SeedUrl, SitesToSearch, ExitConditions
Output : Cluster
Cluster<-SeedUrl
WeightedKeywords<-get_common_keywords(SeedUrl, SeedUrl)
LinkedUrls<-get_links(SeedUrl)
//primary search
while UrlToTest<- next(LinkedUrls) && PrimaryExitCondition != true
if follows_useful_rules(UrlToTest)
LinkedUrls<- LinkedUrls + get_links(UrlToTest)
if follows_article_rules(UrlToTest)
&& (similarity(SeedUrl, UrlToTest) > threshold)
Cluster<- Cluster + UrlToTest
WeightedKeyWords<- WeightedKeyWords +
get_common_keywords(SeedUrl, UrlToTest)
SecSearchKeyWords<- max_n(WeightedKeyWords)
//secondary search
while SearchSite<-next(SitesToSearch) &&
SecondaryExitCondition != true
SearchPage<- generate_search(SearchSite,
SecSearchKeyWords)
LinkedUrls<- get_links(SearchPage)
while UrlToTest<- next(LinkedUrls) &&
SecondaryExitCondition != true
if follows_useful_rules(UrlToTest)
LinkedUrls<- LinkedUrls + get_links(UrlToTest)
if follows_article_rules(UrlToTest) &&
(similarity(SeedUrl, UrlToTest) > threshold)
Cluster<- Cluster + UrlToTest
Return Cluster
Starts by looking through a single web page,
known as the SeedUrl. Searches this page for
links to other articles that share the same topic.
After adding the linked pages to the
document set, it uses TF*IDF to compute the
words of most importance between the web
documents.
In the next phase it takes these words and
uses them to query several news search
engines to find additional articles and add
them to the document set.
It additionally follows links from these
search results to see if there are additional
pages that can be used for summarization.
 Uses a series of rules to determine whether
linked pages should be excluded from the set.
One such rule is that if a URL ends with “.jpg”
then this is evidence of a page that contains no
textual content, therefore deeming it unfit for
the document set.
Web Page Summarization using Clickthrough Data




System developed at Tsing Hua University.
Determines the importance of a word using weighted term
frequency within web pages and term frequency within query
data.
Each of the frequencies is weighted based on the variable between
0 and 1.
As shown above the closer this variable is to 0 the more the word
importance will be based on term frequency within the web pages.
Likewise, the closer the variable is to 1 the more this ranking will
be based on clickthrough data.
Web Page Summarization using Clickthrough Data
(continued)
After the system determines the important terms it applies Luhn’s algorithm (shown below)
to each sentence.
1) Set a limit L for the distance at which any two significant words could be considered
related.
2) Finds a portion in the sentence that is bracketed by significant words not more than L
non-significant words apart.
3) Counts the number of significant words contained in this portion and divides the square
of this number by the total number of words in the portion. The result is the significant
factor of each sentence.
Sample: A default could devastate the economy, leading to a crashing dollar and skyrocketing interest rates,
among other things.
L =4
Left bracket word = economy
Right bracket word = interest
Significant words in portion = 3
Total words in portion = 7
Significance factor of sentence = sqrt(3)/7 = 0.247
Web Page Summarization using Clickthrough Data
(continued)
NewsBlaster
NewsBlaster (continued)
Weights sentences in web pages based on various features. Some of which include the
following:
Location: If a sentence appears late in a document then its weight is reduced as these
sentences are more likely to contain information that is redundant or not as important.
Publication Date: The more recent an article that the sentence belongs to the greater the
weight that it is given. This is to help ensure that web users receive the most up to date
information.
Length: Sentences were weighted negatively if they have a length of more than 30
words or less than 15 words. Long sentences are thought to contain redundant information
whereas short sentences are thought to be limited to supporting information.
Pronoun: A negative weight is given for sentences that start with pronouns as these
sentences usually contain supporting information and not primary information.
Sentences with the highest weights are selected. Afterwards they are fused together using a
separate algorithm.

Tests Run on NewsBlaster
Three document summary sets from NewsBlaster were tested based on the
following criteria:
 Summary Length – the length of the summary compared to the average length of
the original documents
 Precision – portion of information in the summary that belongs in the summary (in
this case this was based on the number of sentences in each summary relevant to the
title of the summary
 Sentence structure - since this is an abstractive system (constructing and generating
new sentences) each sentence was tested for proper grammatical structure by
running it through the Stanford parser
Results for Summary Length
Result Set for Precision
Conclusion


Many advances have been made in this field with
the development of algorithms an methodologies
such as TF-IDF, clickthrough data evaluation, and
machine learning.
It is also clear that further research is needed in this
area; mainly in improving precision beyond all else
as many summarization systems have issues in this
area.