NUS at DUC 2007 - National University of Singapore

Download Report

Transcript NUS at DUC 2007 - National University of Singapore

NUS at DUC 2007:
Using Evolutionary Models of Text
Ziheng Lin, Tat-Seng Chua, Min-Yen Kan,
Wee Sun Lee, Long Qiu and Shiren Ye
Department of Computer Science
National University of Singapore, Singapore
NUS at DUC 2007: Using Evolutionary Models of Text
Summarization
• Traditionally, weighted heuristics to select sentences
• With the advent of machine learning, heuristic weights
can be tuned
• In last few years, graphical representations of text have
shed new light on the summarization problem
• TextRank and LexRank allow us to naturally incorporate
context as a continuum
• How can we enhance this representational model?
DUC 2007 Workshop
2
NUS at DUC 2007: Using Evolutionary Models of Text
Prestige as sentence selection
• One motivation of using graphical methods was to model the problem
as finding prestige of nodes in a social network
• PageRank used random walk to smooth the effect of non-local context
• Lead to TextRank and LexRank
• Contrast with previous
graphical approaches (Salton et al. 1994)
• Did we leave anything out of our
representation for summarization?
Yes, the notion of an evolving network
DUC 2007 Workshop
3
NUS at DUC 2007: Using Evolutionary Models of Text
Social networks change!
Natural evolving networks (Dorogovtsev and Mendes, 2001)
– Citation networks: New papers can cite old ones, but the old
network is static
– The Web: new pages are added with an old page connecting it to
the web graph, old pages may update links
DUC 2007 Workshop
4
NUS at DUC 2007: Using Evolutionary Models of Text
Evolutionary models for summarization
Writers and readers often follow conventional rhetorical
styles - articles are not written or read in an arbitrary way
Consider the evolution of texts using a very simplistic
model
– Writers write from the first sentence onwards in a text
– Readers read from the first sentence onwards of a text
• A simple model: sentences get added incrementally to the
graph
DUC 2007 Workshop
5
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
Approach
– These assumptions suggest us to iteratively add sentences into
the graph in chronological order.
– At each iteration, consider which edges to add to the graph.
– For single document: simple and straightforward: add 1st
sentence, followed by the 2nd, and so forth, until the last sentence is
added
– For multi-document: treat it as multiple instances of single
documents, which evolve in parallel; i.e., add 1st sentences of all
documents, followed by all 2nd sentences, and so forth
• Doesn’t really model chronological ordering between articles, fix later
DUC 2007 Workshop
6
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
Model:
• Documents as columns
– di = document i
• Sentences as rows
–sj = jth sentence of
document
DUC 2007 Workshop
7
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
• A multi document example
doc1
doc2
doc3
sent1
sent2
sent3
DUC 2007 Workshop
8
NUS at DUC 2007: Using Evolutionary Models of Text
An example TSG: DUC 2007 D0703A-A
DUC 2007 Workshop
9
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
Formalization of TSGs:
– The example is just one instance of TSG
– Let’s generalize and formalize the TSG algorithm
– A timestamped graph algorithm tsg(M) is a 9-tuple:
(d, e, u, f,σ, t, i, s,τ) that
• specifies a resulting algorithm that takes
• as input the set of texts M
• and outputs a graph G
• Salient parameters for TSGs:
e - # edges to add per vertex per time step
u - unweighted or weighted edges
For description of other
σ- vertex selection function σ(u, G)
parameters:
s - skew degree
see our TextGraphs-2 paper
DUC 2007 Workshop
10
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
• Vertex selection function σ(u, G)
– One strategy: among those nodes not yet connected to u in
G, choose the one that has the highest similarity with u
– Similarity functions: Jaccard, cosine, concept links
(Ye et al.. 2005)
DUC 2007 Workshop
11
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
• Skew degree s
– Models how nodes in multi-document graphs are added
– Skew degree = how many iterations to wait before adding the 1st
sentence of the next document
Motivation
– Up to now, the TSG models assume that the authors start writing
the documents at the same time
– In reality, some documents are authored later than others, giving
updates or reporting changes
– Infer information from timestamps of articles or from date
extraction on articles themselves.
DUC 2007 Workshop
12
NUS at DUC 2007: Using Evolutionary Models of Text
Skew Degree Examples
time(d1) < time(d2) < time(d3) < time(d4)
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed =
Only add a new
document when it
would be linked
by some node
using vertex
function σ
Skewed by 1
DUC 2007 Workshop
Skewed by 2
Freely skewed
13
NUS at DUC 2007: Using Evolutionary Models of Text
Timestamped Graph Construction
• Representations
– We can model a number of different algorithms using this
9-tuple formalism:
(d, e, u, f,σ, t, i, s,τ)
– The given toy example:
(f, 1, 0, 1, max-cosine-based, sentence, 1, 0, null)
– LexRank graphs:
(u, N, 1, 1, cosine-based, sentence, Lmax, 0, null)
N = total number of sentences in the cluster; Lmax = the max document
length
i.e., all sentences are added into the graph in one timestep, each
connected to all others, and cosine scores are given to edge weights
DUC 2007 Workshop
14
Summarization using TSGs
NUS at DUC 2007: Using Evolutionary Models of Text
System Overview
• Sentence splitting
–Detect and mark sentence boundaries
–Annotate each sentence with the doc
ID and the sentence number
–E.g., XIE19980304.0061: 4 March
1998 from Xinhua News;
XIE19980304.0061-14: the 14th
sentence of this document
• Graph construction
–Construct TSG in this phase
DUC 2007 Workshop
16
NUS at DUC 2007: Using Evolutionary Models of Text
System Overview
• Sentence Ranking
– Apply topic-sensitive random
walk on the graph to redistribute the
weights of the nodes
• Sentence extraction
– Extract the top-ranked sentences
– Two different modified MMR rerankers are used, depending on
whether it is main or update task
DUC 2007 Workshop
17
NUS at DUC 2007: Using Evolutionary Models of Text
Differences for main and update task processing
Main task:
1.
2.
3.
Construct a TSG for
input cluster
Run topic-sensitive
PageRank on the TSG
Apply first modified
version of MMR to
extract sentences
Update task:
• Cluster A:
– Construct a TSG for cluster A
– Run topic-sensitive PageRank on the TSG
– Apply the second modified version of MMR
to extract sentences
• Cluster B:
– Construct a TSG for clusters A and B
– Run topic-sensitive PageRank on the TSG;
only retain sentences from B
– Apply the second modified version of MMR
to extract sentences
• Cluster C:
– Construct a TSG for clusters A, B and C
– Run topic-sensitive PageRank on the TSG;
only retain sentences from C
– Apply the second modified version of MMR
to extract sentences
DUC 2007 Workshop
18
NUS at DUC 2007: Using Evolutionary Models of Text
Sentence Ranking
• Once a timestamped graph is built, we want to compute an prestige
score for each node
• PageRank: use an iterative method that allows the weights of the
nodes to redistribute until stability is reached
• Similarities as edges → weighted edges; query → topic-sensitive
Topic
sensitive (Q)
portion
DUC 2007 Workshop
Standard
random
walk term
19
NUS at DUC 2007: Using Evolutionary Models of Text
Sentence Extraction – Main task
• Original MMR: integrates a penalty of the maximal similarity of the candidate
document and one selected document
• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total
similarity of the candidate sentence and all selected sentences
Penalty: All
previous
sentence
similarity
• Score(s) = PageRank score of s; S = selected sentences
• This is used in the main task
DUC 2007 Workshop
20
NUS at DUC 2007: Using Evolutionary Models of Text
Sentence Extraction – Update task
•Update task assumes readers already read previous cluster(s)
– implies we should not select sentences that have redundant
information with previous cluster(s)
• Propose a modified MMR for the update task:
– consider the total similarity of the candidate sentence with all
selected sentences and sentences in previously-read cluster(s)
Previous
cluster
overlap
• P contains some top-ranked sentences in previous cluster(s)
DUC 2007 Workshop
21
Evaluation and Analysis
NUS at DUC 2007: Using Evolutionary Models of Text
Macroscopic Evaluation
Main task parameterization
–Graph construction: (u, 1, 1,
1, concept-link-based,
sentence, 1, 0, null)
–Sentence extraction: λ= 0.8
and δ= 6, tuned from DUC 05
and 06 datasets
–12th for ROUGE-2 and 10th
for ROUGE-SU4 among 32
systems
DUC 2007 Workshop
23
NUS at DUC 2007: Using Evolutionary Models of Text
Macroscopic Evaluation
•Update task parameterization
–Graph construction: (u, 1, 1,
1, concept-link-based,
sentence, 1, 0, null)
–Sentence extraction: λ= 0.8,
δ= 3 and γ= 6, based on our
experience
–3rd in ROUGE-2, 4th in
ROUGE-SU4 and 6th in
average pyramid scores
among 24 systems
DUC 2007 Workshop
24
NUS at DUC 2007: Using Evolutionary Models of Text
What do we think?
• Better performance in update task
• TSG is better tailored to deal with update summaries
• The second modified version of MMR works better at
distilling redundant information that is shown in
previously-read cluster(s)
DUC 2007 Workshop
25
NUS at DUC 2007: Using Evolutionary Models of Text
Conclusion
• Proposed a timestamped graph model for text understanding
and summarization
– Adds sentences one at a time
• Parameterized model with nine variables
– Several important variables to the iterative TSG formalism
explained
• MMR reranking modified for fit with update task
Future Work
• Freely skewed model
• Empirical and theoretical properties of TSGs
(e.g., in-degree distribution)
DUC 2007 Workshop
26
Backup Slides
15 Minutes total talk
3:15-3:30
NUS at DUC 2007: Using Evolutionary Models of Text
References
• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as
salience in text summari-zation. Journal of Artificial Intelligence Research, (22).
• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In
Proceedings of EMNLP 2004.
• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to
Advances in Physics on 6th March 2001.
• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web
search engine. Com-puter Networks and ISDN Systems, 30(1-7).
• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In
Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.
• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005:
Understanding docu-ments via concepts links. In Proceedings of DUC 2005.
DUC 2007 Workshop
28