Using a New Method of Natural Language Intelligence for

Download Report

Transcript Using a New Method of Natural Language Intelligence for

YALE LAW SCHOOL
POLICY SCIENCES CENTER ANNUAL
INSTITUTE
Using a New Method of Natural Language
Intelligence for Performing Wiretap Analysis
Amy Neustein, Ph.D.
Linguistic Technology Systems
[email protected]
WHY DO WE NEED A NEW NATURAL
LANGUAGE INTELLIGENCE
METHOD FOR MINING WIRETAP
RECORDINGS?
1) The volume of terrorism-related government
wiretap recordings far exceeds the intelligent
agent’s human capabilities to mine those
recordings; and
2) Most automated audio data mining programs
have a low rate of return when searching for
“keywords” in wiretap recordings because
terror suspects will deliberately avoid the use of
key words that can identify names, places,
dates, etc.
Sequence Package Analysis--A New Method
of Natural Language Intelligence
HOW DOES SPA WORK?
1) Add rather than Replace
SPA adds a layer of intelligence to
standard dialog systems.
2) Mines audio data
SPA goes beyond a conventional
search for words and word strings.
Identifies a Series of Related Speaking
Turns and Turn Construction Units
(parts of turns) that are Discretely
Packaged as a Sequence of
Conversational Interaction
WHAT IS THE METHODOLOGICAL
BASIS OF SPA?
SPA is a new natural language understanding method, which
has been successfully peer reviewed and cited by other
researchers as an important data mining method for captioning
text, that draws mainly from the field of conversation analysis:
the study of the orderly properties of interactive dialog that
revolve around the turn-taking system process and other
sequentially based features that are part of that process.
Conversation Analysis has been called by some a sub field
of A.I. because it can detect the detailed structural
organization of dialog which is a necessary precondition for
the design of dialog systems that simulate and understand
human dialog.
WHAT DOES SPA DO?
1) SPA permits the discovery of “key”
words (e.g., the name of a location
where a crucial meeting among
terrorists will take place) that are not
contained in the speech application’s
vocabulary.
2) SPA permits rapid and efficient data
mining of large volumes of audio text
by spotting sequence packages in
the dialog.
MINING THE DATA FOR
SEQUENCE PACKAGES
•A sudden increase in the speakers’ use of
pronouns in place of noun referents may indicate
the speakers are going over familiar or well
rehearsed subject matter.
• The unexpected increased use of adjectival
descriptors, serving as a kind of privately shared
“shorthand” label to describe a person or enemy
target, in the place of nouns can flag terrorist plans
and activities.
•SPA, by looking for sequence patterns, can locate
these descriptors even when they are outside of
the speech application’s vocabulary.
ADVANTAGES OF SPA
• SPA captures the predictable patterns of human
dialog, while all other methods depend on
spotting isolated key words or phrases, which
can vary from speaker to speaker;
• Can be applied to different languages because it
works by identifying conversational sequence
patterns - which cut to the heart of the social
architecture of language-- rather than identify a
preset glossary of words; and
• Has the potential of performing data mining in
real time, allowing a human analyst to act on the
spot when hearing high alarm content.
DEMONSTRATION
The following example shows
how applying an SPA approach
to wiretapped dialog can flag
important security information
that is cleverly disguised by the
suspects:
Speaker “A” is trying to educate Speaker “B” about a
new meeting place right at the tip of the Brooklyn Bridge.
Any confusion or misunderstanding about this meeting
place could spoil the plans.
But Speaker “A” is very clever:
First, he stays away from buzz words (such as naming a
bridge, a tunnel or a street).
Second, he refrains from making any prefatory remarks
or comments to the other speaker about how vital it is to
get these instructions right.
Dialog Example
Speaker “A”: Come to the intersection
near Juniors? (the question mark
shows an upward intonation)
0.2 - 0.5 second pause (speaker then
pauses briefly)
Speaker “B”: 1.2 second pause
Speaker “A”: You know the thoroughfare
with the big traffic light?
Speaker “B”: Juniors, yeah.
THE SEQUENCE PACKAGE
Speaker “A”: Come to the intersection near
Juniors? 0.2-0.5
Speaker “B”: 1.2 seconds of silence
• A noun referent (“Juniors”) with an upward
intonation
• A brief pause, giving the listener the chance to
show recognition or ask for clarification.
• Silence by the listener which indicates lack of
understanding or confusion.
Speaker “A”: You know the thoroughfare with the big
traffic light?
Speaker “B”: Juniors, yeah.
• Speaker “A” produces a clarification of the noun referent
(“Juniors”)
(“You know the thoroughfare with...”)
• Speaker “B” produces a repeat of the noun referent
(“Juniors”) - the source of the recognition trouble
• followed by a recognitional marker (“Yeah”)--which
demonstrates to Speaker “A” that he has corrected the
misunderstanding.
• Had he simply produced a recognitional marker (“yeah”)
without mentioning the source of the trouble (“Juniors”),
there would be no indication to the other speaker that he
now recognizes the importance of the meeting place.
Finding the Sequence Package in the
Dialog Example
Look for a concatenation of these
utterance components:
•
•
•
•
•
noun referent with upward intonation
brief pause
silence
clarification of noun referent
repeat of noun referent that was initial
source of the recognition trouble
• recognitional marker
CODA
The next step is the validation of
SPA as a necessary tool for
performing wiretap analysis
Research Question:
Do mining programs have a higher
rate of accuracy in spotting
terrorists when adding Sequence
Package Analysis as a new method
of natural language intelligence for
performing wiretap analysis?