Evolutionary Models of Text for Multi
Download
Report
Transcript Evolutionary Models of Text for Multi
Timestamped Graphs:
Evolutionary Models of Text for
Multi-document Summarization
Ziheng Lin and Min-Yen Kan
Department of Computer Science
National University of Singapore, Singapore
Using Evolutionary Models of Text for Multi-document summarization
Summarization
• Traditionally, heuristics for extractive summarization
– Cue/stigma phrases
– Sentence position (relative to document, section,
paragraph)
– Sentence length
– TF×IDF, TF scores
– Similarity (with title, context, query)
• With the advent of machine learning, heuristic weights for
different features are tuned by supervised learning
• In last few years, graphical representations of text have
shed new light on the summarization problem
TextGraphs 2 at HLT/NAACL 2007
2
Using Evolutionary Models of Text for Multi-document summarization
Prestige as sentence selection
• One motivation of using graphical methods was to model the problem
as finding prestige of nodes in a social network
• PageRank used random walk to smooth the effect of non-local context
• HITS and SALSA to model hubs and authorities
• In summarization, lead to TextRank and LexRank
• Contrast with previous
graphical approaches (Salton et al. 1994)
• Did we leave anything out of our
representation for summarization?
Yes, the notion of an evolving network
TextGraphs 2 at HLT/NAACL 2007
3
Using Evolutionary Models of Text for Multi-document summarization
Social networks change!
Natural evolving networks (Dorogovtsev and Mendes, 2001)
– Citation networks: New papers can cite old ones, but the old
network is static
– The Web: new pages are added with an old page connecting it to
the web graph, old pages may update links
TextGraphs 2 at HLT/NAACL 2007
4
Using Evolutionary Models of Text for Multi-document summarization
Evolutionary models for summarization
Writers and readers often follow conventional rhetorical
styles - articles are not written or read in an arbitrary way
Consider the evolution of texts using a very simplistic
model
– Writers write from the first sentence onwards in a text
– Readers read from the first sentence onwards of a text
• A simple model: sentences get added incrementally to the
graph
TextGraphs 2 at HLT/NAACL 2007
5
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction
Approach
– These assumptions suggest us to iteratively add sentences into
the graph in chronological order.
– At each iteration, consider which edges to add to the graph.
– For single document: simple and straightforward: add 1st
sentence, followed by the 2nd, and so forth, until the last sentence is
added
– For multi-document: treat it as multiple instances of single
documents, which evolve in parallel; i.e., add 1st sentences of all
documents, followed by all 2nd sentences, and so forth
• Doesn’t really model chronological ordering between articles, fix later
TextGraphs 2 at HLT/NAACL 2007
6
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction
Model:
• Documents as columns
– di = document i
• Sentences as rows
–sj = jth sentence of
document
TextGraphs 2 at HLT/NAACL 2007
7
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction
• A multi document example
doc1
doc2
doc3
sent1
sent2
sent3
TextGraphs 2 at HLT/NAACL 2007
8
Using Evolutionary Models of Text for Multi-document summarization
An example TSG: DUC 2007 D0703A-A
TextGraphs 2 at HLT/NAACL 2007
9
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction
These are just one instance of TSGs
Let’s generalize and formalize them
Def: A timestamped graph algorithm tsg(M) is a 9-tuple
(d, e, u, f,σ, t, i, s, τ) that specifies a resulting
algorithm that takes as input the set of texts M and
outputs a graph G
Properties
of edges
TextGraphs 2 at HLT/NAACL 2007
Properties
of nodes
Input text
transformation
function
10
Using Evolutionary Models of Text for Multi-document summarization
Edge properties (d, e, u, f)
• Edge Direction (d)
– Forward, backward, or undirected
• Edge Number (e)
– number of edges to instantiate per timestep
• Edge Weight (u)
– weighted or unweighted edges
• Inter-document factor (f)
– penalty factor for links between documents
in multi-document sets.
TextGraphs 2 at HLT/NAACL 2007
11
Using Evolutionary Models of Text for Multi-document summarization
Node properties (σ, t, i, s)
• Vertex selection function σ(u, G)
– One strategy: among those nodes not yet connected to u in G,
choose the one with highest similarity according to u
– Similarity functions: Jaccard, cosine, concept links
(Ye et al.. 2005)
• Text unit type (t)
– Most extractive algorithms use sentences as elementary units
• Node increment factor (i)
– How many nodes get added at each timestep
• Skew degree (s)
– Models how nodes in multi-document graphs are added
– Skew degree = how many iterations to wait before adding the 1st
sentence of the next document
– Let’s illustrate this …
TextGraphs 2 at HLT/NAACL 2007
12
Using Evolutionary Models of Text for Multi-document summarization
Skew Degree Examples
time(d1) < time(d2) < time(d3) < time(d4)
d1 d2 d3 d4
d1 d2 d3 d4
d1 d2 d3 d4
Freely skewed =
Only add a new
document when it
would be linked
by some node
using vertex
function σ
Skewed by 1
TextGraphs 2 at HLT/NAACL 2007
Skewed by 2
Freely skewed
13
Using Evolutionary Models of Text for Multi-document summarization
Input text transformation function (τ)
• Document Segmentation
Function (τ)
– Problem observed in some
clusters where some documents
in a multi-document cluster are
very long
– Takes many timestamps to
introduce all of the sentences,
causing too many edges to be
drawn
d5
d5a d5b
–Τ(G) segments long documents
into several sub docs
• Solution is too hacked – hope to
investigate more in current and
future work
TextGraphs 2 at HLT/NAACL 2007
14
Using Evolutionary Models of Text for Multi-document summarization
Timestamped Graph Construction
• Representations
– We can model a number of different algorithms using this
9-tuple formalism:
(d, e, u, f,
σ, t, i, s,
τ)
– The given toy example:
(f, 1, 0, 1,
max-cosine-based, sentence, 1, 0,
null)
– LexRank graphs:
(u, N, 1, 1,
cosine-based, sentence, Lmax, 0,
null)
N = total number of sentences in the cluster; Lmax = the max document
length
i.e., all sentences are added into the graph in one timestep, each
connected to all others, and cosine scores are given to edge weights
TextGraphs 2 at HLT/NAACL 2007
15
TSG-based summarization
Methodology
Evaluation
Analysis
Using Evolutionary Models of Text for Multi-document summarization
System Overview
• Sentence splitting
–Detect and mark sentence boundaries
–Annotate each sentence with the doc
ID and the sentence number
–E.g., XIE19980304.0061: 4 March
1998 from Xinhua News;
XIE19980304.0061-14: the 14th
sentence of this document
• Graph construction
–Construct TSG in this phase
TextGraphs 2 at HLT/NAACL 2007
17
Using Evolutionary Models of Text for Multi-document summarization
System Overview
• Sentence Ranking
– Apply topic-sensitive random
walk on the graph to redistribute the
weights of the nodes
• Sentence extraction
– Extract the top-ranked sentences
– Two different modified MMR rerankers are used, depending on
whether it is main or update task
TextGraphs 2 at HLT/NAACL 2007
18
Using Evolutionary Models of Text for Multi-document summarization
Evaluation
• Dataset: DUC 2005, 2006 and 2007.
• Evaluation tool: ROUGE: n-gram based automatic evaluation
• Each dataset contains 50 or 45 clusters, each cluster contains
a query and 25 documents
• Evaluate on some parameters
–Do different e values affect the summarization process?
–How do topic-sensitivity and edge weighting perform in running
PageRank?
–How does skewing the graph affect the information flow in the
graph?
TextGraphs 2 at HLT/NAACL 2007
19
Using Evolutionary Models of Text for Multi-document summarization
Evaluation on number of edges (e)
Tried different e values
• Optimal performance: e = 2
• At e = 1, graph is too loosely connected, not suitable for PageRank
→ very low performance
• At e = N, a LexRank system
e=2
e=2
N
TextGraphs 2 at HLT/NAACL 2007
N
N
20
Using Evolutionary Models of Text for Multi-document summarization
Evaluation (other edge parameters)
• PageRank: generic vs topic-sensitive
• Edge weight (u): unweighted vs weighted
• Optimal performance: topic-sensitive PageRank and weighted edges
TextGraphs 2 at HLT/NAACL 2007
Topicsensitive
Weighted
edges
ROUGE-1
ROUGE-2
No
No
0.39358
0.07690
Yes
No
0.39443
0.07838
No
Yes
0.39823
0.08072
Yes
Yes
0.39845
0.08282
21
Using Evolutionary Models of Text for Multi-document summarization
Evaluation on skew degree (s)
• Different skew degrees: s = 0, 1 and 2
• Optimal performance: s = 1
• s = 2 introduces a delay interval that is too large
Skew degree
ROUGE-1
ROUGE-2
0
0.36982
0.07580
1
0.37268
0.07682
2
0.36998
0.07489
• Need to try freely skewed graphs
TextGraphs 2 at HLT/NAACL 2007
22
Using Evolutionary Models of Text for Multi-document summarization
Holistic Evaluation in DUC
We participated in DUC 2007 with an extractive-based TSG system
• Main task: 12th for ROUGE-2, 10th for ROUGE-SU4 among 32 systems
• Update task: 3rd for ROUGE-2, 4th for ROUGE-SU4 among 24 systems
• Used a modified version of maximal marginal relevance to penalize links
in previously read articles
– Extension of inter-document factor (f)
• TSG formalism better tailored to deal with update / incremental text
tasks
• New method that may be competitive with current approaches
– Other top scoring systems may do sentence compression, not just
extraction
TextGraphs 2 at HLT/NAACL 2007
23
Using Evolutionary Models of Text for Multi-document summarization
Conclusion
• Proposed a timestamped graph model for text
understanding and summarization
– Adds sentences one at a time
• Parameterized model with nine variables
– Canonicalizes representation for several graph based
summarization algorithms
Future Work
• Freely skewed model
• Empirical and theoretical properties of TSGs
(e.g., in-degree distribution)
TextGraphs 2 at HLT/NAACL 2007
24
Backup Slides
25 Minute talk total
26 Apr 2007, 11:50-12:15
Using Evolutionary Models of Text for Multi-document summarization
Differences for main and update task processing
Main task:
1.
2.
3.
Construct a TSG for
input cluster
Run topic-sensitive
PageRank on the TSG
Apply first modified
version of MMR to
extract sentences
Update task:
• Cluster A:
– Construct a TSG for cluster A
– Run topic-sensitive PageRank on the TSG
– Apply the second modified version of MMR
to extract sentences
• Cluster B:
– Construct a TSG for clusters A and B
– Run topic-sensitive PageRank on the TSG;
only retain sentences from B
– Apply the second modified version of MMR
to extract sentences
• Cluster C:
– Construct a TSG for clusters A, B and C
– Run topic-sensitive PageRank on the TSG;
only retain sentences from C
– Apply the second modified version of MMR
to extract sentences
TextGraphs 2 at HLT/NAACL 2007
26
Using Evolutionary Models of Text for Multi-document summarization
Sentence Ranking
• Once a timestamped graph is built, we want to compute an prestige
score for each node
• PageRank: use an iterative method that allows the weights of the
nodes to redistribute until stability is reached
• Similarities as edges → weighted edges; query → topic-sensitive
Topic
sensitive (Q)
portion
TextGraphs 2 at HLT/NAACL 2007
Standard
random
walk term
27
Using Evolutionary Models of Text for Multi-document summarization
Sentence Extraction – Main task
• Original MMR: integrates a penalty of the maximal similarity of the candidate
document and one selected document
• Ye et al. (2005) introduced a modified MMR: integrates a penalty of the total
similarity of the candidate sentence and all selected sentences
Penalty: All
previous
sentence
similarity
• Score(s) = PageRank score of s; S = selected sentences
• This is used in the main task
TextGraphs 2 at HLT/NAACL 2007
28
Using Evolutionary Models of Text for Multi-document summarization
Sentence Extraction – Update task
•Update task assumes readers already read previous cluster(s)
– implies we should not select sentences that have redundant
information with previous cluster(s)
• Propose a modified MMR for the update task:
– consider the total similarity of the candidate sentence with all
selected sentences and sentences in previously-read cluster(s)
Previous
cluster
overlap
• P contains some top-ranked sentences in previous cluster(s)
TextGraphs 2 at HLT/NAACL 2007
29
Using Evolutionary Models of Text for Multi-document summarization
References
• Günes Erkan and Dragomir R. Radev. 2004. LexRank: Graph-based centrality as
salience in text summari-zation. Journal of Artificial Intelligence Research, (22).
• Rada Mihalcea and Paul Tarau. 2004. TextRank: Bring-ing order into texts. In
Proceedings of EMNLP 2004.
• S.N. Dorogovtsev and J.F.F. Mendes. 2001. Evolution of networks. Submitted to
Advances in Physics on 6th March 2001.
• Sergey Brin and Lawrence Page. 1998. The anatomy of a large-scale hypertextual Web
search engine. Com-puter Networks and ISDN Systems, 30(1-7).
• Jon M. Kleinberg. 1999. Authoritative sources in a hy-perlinked environment. In
Proceedings of ACM-SIAM Symposium on Discrete Algorithms, 1999.
• Shiren Ye, Long Qiu, Tat-Seng Chua, and Min-Yen Kan. 2005. NUS at DUC 2005:
Understanding docu-ments via concepts links. In Proceedings of DUC 2005.
TextGraphs 2 at HLT/NAACL 2007
30