Project presentation - dimacs

Download Report

Transcript Project presentation - dimacs

Monitoring Message Streams:
Retrospective and Prospective
Event Detection
Paul Kantor, Dave Lewis,
David Madigan, Fred Roberts
DIMACS, Rutgers University
1
DIMACS is a partnership of:
•Rutgers University
•Princeton University
•AT&T Labs
•Bell Labs
•NEC Research Institute
•Telcordia Technologies
http:dimacs.rutgers.edu
[email protected]
732-445-5928
2
OBJECTIVE:
Monitor streams of textualized communication to
detect pattern changes and "significant" events
Motivation:
 sniffing and monitoring email traffic
3
TECHNICAL PROBLEM:
• Given stream of text in any language.
• Decide whether "new events" are present in the
flow of messages.
• Event: new topic or topic with unusual level of
activity.
• Retrospective or “Supervised” Event
Identification: Classification into pre-existing
classes.
4
More Complex Problem: Prospective Detection
or “Unsupervised” Learning
 Classes change - new classes or change
meaning
 A difficult problem in statistics
 Recent new C.S. approaches
1) Algorithm suggests a new class
2) Human analyst labels it; determines its
significance
5
COMPONENTS OF AUTOMATIC
MESSAGE PROCESSING
(1). Compression of Text -- to meet storage and
processing limitations;
(2). Representation of Text -- put in form
amenable to computation and statistical analysis;
(3). Matching Scheme -- computing similarity
between documents;
(4). Learning Method -- build on judged examples
to determine characteristics of document cluster
(“event”)
(5). Fusion Scheme -- combine methods (scores)6 to
yield improved detection/clustering.
OUR APPROACH: WHY WE CAN DO BETTER
THAN STATE OF THE ART:
• Existing methods use some or all 5 automatic
processing components, but don’t exploit the full
power of the components and/or an understanding
of how to apply them to text data.
• Dave Lewis' method at TREC filtering used an offthe-shelf support vector machine supervised learner,
but tuned it for frequency properties of the data.
• The combination still dominated competing
approaches in the TREC-2001 batch filtering
evaluation.
7
OUR APPROACH:WHY WE CAN DO BETTER II:
• Existing methods aim at fitting into available
computational resources without paying attention
to upfront data compression.
• We hope to do better by a combination of:
 more sophisticated statistical methods
 sophisticated data compression in a preprocessing stage
 optimization of component combinations
8
COMPRESSION:
• Reduce the dimension before statistical analysis.
• Recent results: “One-pass” through data can reduce
volume significantly w/o degrading performance
significantly. (E.g.: use random projections.)
• Unlike feature-extracting dimension reduction, which
can lead to bad results.
We believe that sophisticated dimension reduction
methods in a preprocessing stage followed by
sophisticated statistical tools in a detection/filtering
stage can be a very powerful approach.
9
MORE SOPHISTICATED STATISTICAL
APPROACHES:
• Representations: Boolean representations; weighting
schemes
• Matching Schemes: Boolean matching; nonlinear
transforms of individual feature values
• Learning Methods: new kernel-based methods; more
complex Bayes classifiers; boosting;
• Fusion Methods: combining scores based on ranks,
linear functions, or nonparametric schemes
10
OUTLINE OF THE APPROACH
• Identify best combination of newer methods through
careful exploration of variety of tools.
• Address issues of effectiveness (how well task is done)
and efficiency (in computational time and space)
• Use combination of new or modified algorithms and
improved statistical methods built on the algorithmic
primitives.
11
IN LATER YEARS
• Extend work to unsupervised learning.
• Still concentrate on new methods for the 5 components.
• Emphasize “semi-supervised learning” - human analysts
help to focus on features most indicative of anomaly or
change; algorithms assess incoming documents as to
deviation on those features.
• Develop new techniques to represent data to highlight
significant deviation:
 Through an appropriately defined metric
 With new clustering algorithms
 Building on analyst-designated features
12
THE PROJECT TEAM:
Strong team:
Statisticians: David Madigan, Rutgers Statistics;
Ilya Muchnik, Rutgers CS
Experts in Info. Retrieval & Library Science &
Text Classification: Paul Kantor, Rutgers Info.
And Library Science; David Lewis, Private
Consultant
13
THE PROJECT TEAM:
Learning Theorists/Operations Researchers: Endre
Boros, Rutgers Operations Research
Computer Scientists: Muthu Muthukrishnan,
Rutgers CS, Martin Strauss, AT&T Labs, Rafail
Ostrovsky, Telcordian Technologies
Decision Theorists/Mathematical Modelers: Fred
Roberts, Rutgers Math/DIMACS
Homeland Security Consultants: David
Goldschmidt, IDA-CCR
14
IMPACT:
12 MONTHS:
• We will have established a state-of-the art scheme for
classification of accumulated documents in relation to
known tasks/targets/themes and building profiles to
track future relevant messages.
• We are optimistic that by end-to-end experimentation,
we will discover synergies between new mathematical
and statistical methods for addressing each of the
component tasks and thus achieve significant
improvements in performance on accepted measures
that could not be achieved by piecemeal study of one
15
or two component tasks.
IMPACT:
3 YEARS:
• prototype code for testing the concepts and a
precise system specification for commercial or
government development.
• we will have extended our analysis to semisupervised discovery of potentially interesting
clusters of documents.
• this should allow us to identify potentially
threatening events in time for cognizant
agencies to prevent them from occurring.
16
RISKS
•Data will not be realistic enough.
•We will find it harder than expected to
combine good approaches to the 5 components
•Multidisciplinary cooperation won’t work as
well as we think.
17
TOP ACCOMPLISHMENTS
TO DATE
Infrastructure Work to Date (1 of 2)
--Built platform for text filtering experiments
*Modified CMU Lemur retrieval toolkit
to support filtering
*Created newswire testset with test
information needs (250 topics, 240K documents)
*Wrote evaluation and adaptive thresholding
software
18
TOP ACCOMPLISHMENTS
TO DATE II
Infrastructure Work to Date (2 of 2):
--Implemented fundamental adaptive linear
classifier (Rocchio)
--Benchmarked them using our data sets and
submitted to NIST TREC evaluation
19
TOP ACCOMPLISHMENTS
TO DATE III
Developed a Formal Framework for Monitoring
Message Streams:
•Cast Monitoring Message Streams as a multistage
decision problem
•For each message, decide to send to an analyst or
not
•Positive utility for sending an “interesting”
message; else negative…but
20
A Formal Framework for Monitoring
Message Streams Continued
•…positive “value of information” even for negative
documents
•Use Influence Diagrams as a modeling framework
•Key input is the learning curve
•Building simple learning curve models
•BinWorld – discrete model of feature space
21
TOP ACCOMPLISHMENTS
TO DATE IV
In June, held a “data mining in homeland security”
tutorial and workshop at IDA-CCR Princeton.
Organized Algorithmic Approach to
Compression/Dimension Reduction
Beginning Work on Nearest Neighbor Search
Methods
22
S.O.W: FIRST 12 MONTHS:
• Prepare available corpora of data on which to uniformly
test different combinations of methods
• Concentrate on supervised learning and detection
• Systematically explore & compare combinations of
compression schemes, representations, matching
schemes, learning methods, and fusion schemes
• Test combinations of methods on common data sets and
exchange information among the team
• Develop and test promising dimension reduction
(compression) methods
23
S.O.W: FIRST 12 MONTHS:
Midterm Exam (by end of November):
• Reports on Algorithms: draft writeups
• Research Quality Code: Under Development
• Reports on Experimental Evaluation: Interim Project
Report
• Dissemination: draft writeups, interim report plus
website, workshop in June 2002 just prior to
beginning of project
24
S.O.W: FIRST 12 MONTHS:
Final Exam (by end of First 12 Months):
• Reports on Algorithms: formal writeups as technical
reports and research papers
• Research Quality Code: Made available to sponsors
and mission agencies on a web site
• Reports on Experimental Evaluation: Project Report
Summarizing end-to-end studies on effectiveness of
different components of our approach + their
effectiveness in combination
• Dissemination: technical reports, conference papers,
journal submissions, final reports on algorithms and
experimental evaluation, refinement of websites,
meetings with sponsors and mission agencies. End of
Year 1 Workshop for Sponsors/Practitioners.
25
S.O.W: YEARS 2 AND 3:
• Combine leading methods for supervised learning with
promising upfront dimension reduction methods
• Develop research quality code for the leading identified
methods for supervised learning
• Develop the extension to unsupervised learning :
 Detect suspicious message clusters before an event
has occurred
 Use generalized stress measures indicating a
significant group of interrelated messages don’t fit
into the known family of clusters
 Concentrate on semi-supervised learning.
26
WE ARE OFF TO A GOOD START
The task we face is of great value in forensic activities.
We are bringing to bear on this task a multidisciplinary
approach with a large, enthusiastic, and experienced
team.
Preliminary results are very encouraging.
Work is needed to make sure that our ideas are of use to
analysts.
27