OGS and TM For Intel..

Download Report

Transcript OGS and TM For Intel..

Ontology-Guided Search and
Text Mining
for Intelligence Gathering
Kurt Godden, Ph.D.
MSR Lab, R&D
[email protected]
1
1
Outline
•
•
•
•
Definitions of terms
Customers (Who cares?)
Finding Text – ontology-guided search
Text Processing –
– Content extraction
– Text Mining
• Temporal Data Mining at GM
• Multi-Lingual Text Processing
• Summary
2
2
What is Text Mining?
• Data Mining:
– The process of analyzing data to discover new patterns or relationships
– 1st International Conference was KDD-95
– http://www-aig.jpl.nasa.gov/public/kdd95/
• Text Mining is Subfield of Data Mining
– As such, ideally TM is the process of analyzing unstructured text to discover
new patterns or relationships
– In practice, TM often refers simply to the Content Extraction (CE) of
structured data from unstructured text, usually from finite-state parsers.
3
3
Content Extraction:
Structured Data from Unstructured Text
“Company XYZ, is known to ship products
through the port of Dubai.”
From Text to Actionable
Knowledge:
Automatic multi- language scanning
SaoPaulo
Brazil1
AdenYemen
Dominican1
Helsinki
PortAden
BuenosAires
SunsetUSA
VyborgRussia
RioHaina
RioDeJaneiro
Istanbul
Brazil2
Kansas
French1
LosAngeles
Urumchi
Gdansk
Hamburg
Canada1
Guangzhou
ZhongshanGuangdongProvince
Algiers
Abbas
Taipei
AjmanUAE
Saudi1
ShanghaiPort
DubaiUAE
Jakarta
XinfengGuangdongWichitaUSA
Shanghai
SomervilleUSA
Jeddah
RuianZhejiangProvince
AmmanJordan
Riyadh
Karachi
CixiChina
DammamSaudi
HongKong SanaaYemen
MisratahLibya
LahorePakistan
BenghaziLybia
KhamisMushaytSaudi
MississaugaCanada
Lisboa
Magadan
Homs
NingboPort
SharjahUAE
StPetersburg
ZhaoqingGuangdongProvince
Libya1
Cairo
Misratah
Entity and Relation extraction/distillation
Filtering
<XYZ-Corp,exports-through,Dubai>
4
4
Who Cares?
• Government
– NSA, CIA, DIA, DHS, DARPA
• Industry
–
–
–
–
–
–
Automotive
Chemical
Pharmaceutical
Legal
Consumer goods
Aerospace
5
5
Why do they care?
• Intelligence and Security
– Valdis E. Krebs was able to manually map much of the
9/11 terrorist cell from public documents.
• http://vlado.fmf.uni-lj.si/pub/networks/doc/Seminar/Krebs.pdf
• Industrial
– Urban Legend: (Is it true?)
“80% of all corporate knowledge is in text.”
–
–
–
–
–
–
–
Market research
Fraud detection
Root cause analysis
Document clustering and categorization
Competitive intelligence
Patent analysis
6
etc
6
Before Mining Must Come Text
• How to find it?
7
7
Ontology-Guided Search (OGS)
• Oft-cited definition of ontology by T.R. Gruber:
– An ontology is a formal specification of a shared
conceptualization.
• www.vivisimo.com clusters search results according to
semantic categories
• OGS: use an ontology to guide the search for documents
to include not only keywords of interest, but also terms
that are semantically related to those keywords
8
8
What ontology to use?
• Public
– Wordnet: http://wordnet.princeton.edu/
• Organizes content words (N,V,Adj,Adv) into sets of semanticallyrelated concepts connected by relations
• Currently  207k pairs of words-senses
– <bank1, monetary institution>
– <bank2, land adjacent to river>
• Custom
– Parts
– Products
– Processes
• Tool: Protégé at http://protege.stanford.edu/
9
9
Ontology-Guided Search (OGS)
avoids
neighborhood
riot
“driving through”
avoiding
neighborhoods
riots
“drive through”
avoided
suburb
“civil unrest”
“drove through”
suburbs
• Use ontology to search not only on keywords, but on
semantically-related keywords
10
10
Pitfalls of OGS
• Beware of semantically related terms
• Simulation of OGS using Wordnet
– Original query:
• Which neighborhoods of Paris are safe?
– One of several transformed queries was:
• Which suburbs of Paris are condoms?
11
11
Content Extraction Technology
• Regular Expressions Mapped to Semantic
Templates
• Regular Expression for Passives:
NP1 BE TV [by NP2]
“The lecture was presented by Kurt Godden”
• Mapping of Match Registers to Template
< NP2:agent, TV:relation, NP1:object>
<kg, presented, lecture>
Post-ProcessingRule:
if NP2 is empty string, then use ‘someone’:agent
12
12
Content Extraction Example
“Some 40 vehicles were torched in the Val d'Oise area NW of Paris.”
http://www.breitbart.com/news/2005/11/04/D8DLFA780.html
For pattern:
‘vehicles’
‘were’
‘torched’
No match for
NP1 BE TV [by NP2]
matches
NP1
matches
BE
matches
TV
NP2
• Canonicalize tokens via a domain ontology (e.g. vehicles→vehicle, torched→burn)
<someone, burn, vehicle>
• Additional triples can be matched by other RegExp patterns, giving:
<vehicle, count, 40>
<vehicle, located-in, val-d’oise>
<val-d’oise, near, paris>
13
13
Why Only Regular Expressions?
• Computational Efficiency
• Practical Adequacy
• Workaround for lack of recursion: Lots of RE’s !
NP → NP and NP
becomes
NP → CN and CN
NP → CN and CN and CN
NP → NAME and NAME
NP → NAME and NAME and NAME
14
14
After Text Must Come Mining
• Temporal Data Mining research by K.P.
Unnikrishnan (GM R&D) and P.S. Sastry
(IISc, Bangalore)
• TDMiner
– Proprietary tool
– Discovers frequent sequences of events from
symbolic data
15
15
16
16
17
17
18
18
For More Info:
• 4th Workshop on Temporal Data Mining: Network
Reconstruction from Dynamic Data
– http://www.kdd2006.com/workshops.html
• Laxman, Sastry and Unnikrishnan. “Discovering
Frequent Episodes and Learning Hidden Markov
Models: a Formal Connection.” IEEE
Transactions on Knowledge and Data
Engineering, vol. 17, no. 11, pp. 1505-1517. 2005
19
19
Network Reconstruction
• How to determine directed, acyclic graphs
from sequential event data
z
x
a
g
n
p
20
20
Multilingual Problem
• What if source text is not in English?
21
21
Machine Translation (MT)
• Free, web-based tools not state-of-the-art
e.g. http://babelfish.altavista.com/
• LanguageWeaver uses Statistical-Based MT
Spin-off of USC Information Sciences Institute
www.languageweaver.com
22
22
23
23
Hypothesis
• Effective Content Extraction rules can be
custom-developed for raw machinetranslated text.
24
24
Summary
• Text Mining Can Offer Real Value
– Used Extensively by Gov’t Intel Agencies
– Several COTS tools available for Content
Extraction:
•
•
•
•
•
SAS Text Miner
AeroText (Lockheed Martin)
ClearForest
Attensity
etc.…
– GATE – Univ. of Sheffield, open-source
– http://gate.ac.uk/
25
25