W. Lee, H. Fang, Y. Li, Effective Information Access Over Public

Download Report

Transcript W. Lee, H. Fang, Y. Li, Effective Information Access Over Public

Effective Access Over Public Email Conversations
William Lee, Hui Fang and Yifan Li
University of Illinois at Urbana-Champaign
Motivation
Clustering
• Goal:
• Information within newsgroups or mailing lists has
largely been underutilized.
• For now, access to those data is restricted to
traditional searching and browsing.
– Find commonly-discussed topics from a set of conversations
(threads)
• Method:
– Use agglomerative clustering with complete link
– Learn similarity functions from different “perspectives” of threads:
• authors, date, subject, contents, contents without quote, first
message, reply, reply without quote.
• Use Linear and Logistic Regression to learn the combined
similarity function
Existing technologies
Search
Browse
Thread 1: Subject, Authors, Date
• Visualization---Conversation Map
• Experiment Design
– Data: 3 CS class newsgroups from UIUC
– Judgement file: manually created by three different
human taggers
– Methodology: 3-way cross validation, using one
group’s judgment file as training set and test on the
other two.
– Evaluation: Use overall entropy as comparison metric
– Derived from Treemap
– Clusters sorted by two time dimensions
– Allows user to adjust the similarity
threshold -- “zoom” to the more similar
threads
• Experiment Result
Thread 2: Subject, Authors, Date
First Message
First Message
First Reply
First Reply
Second Reply
• How to access those information more effectively?
Second Reply
– Clustering
– Summarization
Third Reply
Third Reply
Summarization
Conclusion
• Goal:
• Solution to Announcement-Driven
• Experiment Design
– Data: 3 CS class newsgroups from UIUC
– Find the gist from a conversation (i.e. a thread)
summarization
– Judgement file: manually created by two different
– Key observation
• Observation: Different types of
human taggers
•
Subject
plays
an
important
role
conversation need different types of
– Evaluation: Precision and Recall or user study
•
Threads
with
similar
subjects
may
have
common
summarization method
pivot words in their summaries
• Examples of Experiment Results
– Question-driven conversation
– Method
Question-Driven
Announcement-Driven
• Training stage
Subject:MP7: Viewing Array in debugger
– Announcement –driven conversation
SUBJECT: Final Exam - PLEASE READ
From:Scott Stephens <[email protected]>
– Clustering the threads by similarities of the subjects
on Sat, 04 Dec 2004 22:49:06 -0600
TIME: 1:30-4:30pm
• Solution to Question-Driven
– Find frequent words (pivot words) in the summaries
PLACE: Regular classroom (SC 1404)
I've been debugging things using DDD with dbx, but
summarization
– Extend the subjects by combining the
I'm running into a weird problem. My PatriciaTree
TOPICS: The exam is cummulative but with
– Key observation
• Question plays an important role
– Method
• Identify the question
• Detect the topic shifting during the conversation
• Divide the conversation into segments based on topic
shifting
• Store all the segments containing the question
• Remove segments with the redundant question
• Return the remaining segments
corresponding pivot words
• Testing stage
– Given a thread, find the similar subject w.r.t. the
current one
– Find the pivot words associated with the subject to
extend the current one
– Select similar sentences w.r.t. the extended subject
as the summary
class is basically a wrapper around a root pointer,
so I observe that pointer, and dereference it to give
me my first PatriciaNode, and then look at all that's
inside that, and one of those things is a pointer to
"data" within the Array object that's embedded in
my PatriciaNode. An address shows up fine, but I'd
like to dereference it to take a look at the actual array,
so I can continue to examine the structure of my tree.
But when I dereference it in DDD, i just shows up as
"(nil)". I know for a fact that there's valid data in that
array somewhere, because I can access it in my
program, I just can't look at it in the debugger.
Anybody have any ideas? It'd be really
nice to look at that info for debugging purposes.
-Scott
emphasis on the material after the midterm. Most
important from earlier material are general
techniques: divide-and-conquer, greedy,
dynamic programming, randomization. Also
topics like MST and shortest paths that have
reappeared after the midterm.
Student having more than two consecutive
examinations: No student should be required to
take more than two consecutive final
examinations. N In a semester, this means that a
student taking a final examination at 8:00 a.m.
and another at 1:30 p.m. on the same day cannot
be required to take an examination that same
evening. N However, the student could be
required to take an examination beginning at
8:00 a.m. the next day ...
• Propose two ways to access to the public email
archive
– Clustering
– Summarization
• Use conversation map to visualize the clustering
result
• Future Work
– Derive better algorithms to learn the similarity
function
– Faster clustering algorithms that work on
mining patterns in conversations
– LM approach to email summarization