Transcript NSF NGDM

Relational Data Mining with
Inductive Logic Programming for
Link Discovery
Ray Mooney, Prem Melville, Rupert Tang
University of Texas at Austin
Jude Shavlik, Inês de Castro Dutra, David Page,
Vítor Santos Costa
University of Wisconsin at Madison
1
EELD Program
• Evidence Extraction
• Link Discovery
• Pattern Learning
2
Link Discovery Task
(from Jim Antonisse, GITI)
Evidence
request(s)
Evidence
Link Discovery
Vetted hyp
cases
Core:
Pattern Matching
Queries
Legend:
Pattern(s) of Interest Domain Patterns
Ontologies
Alerts based on
Hypothesized
cases
Problem
Context
pre-run-time processing
run-time processing
3
Link Discovery
• Data is multi-relational with many people,
places, objects and actions and numerous
types of relations between them.
• Link analysis in intelligence and
criminology investigates exploring and
visualizing such data as a graph with many
nodes and edges of various types.
• Link discovery entails finding new links
and recognizing threatening patterns in such
highly-relational data.
4
EELD Program
• Evidence Extraction
• Link Discovery
• Pattern Learning
5
Pattern Learning for Link Discovery
• Automated discovery of “patterns of
interest” that indicate potentially
threatening activities in large amounts of
heterogeneous, multi-relational data.
• Requires inducing multi-relational patterns
that characterize multiple entities and
multiple links between them.
6
Limitations of Traditional Data Mining
• Traditional KDD methods assume the data
to be mined is in a single relational table
and that examples are flat tuples of attribute
values.
• This assumption stems from:
– 1) Properties of the typical data mining tasks
like market basket analysis.
– 2) Focus in machine learning and statistics on
classification or regression using feature
vectors as inputs.
7
Relational Data Mining
• Data contains multiple relations.
• Patterns to be discovered contain multiple
relations.
• Knowledge to be discovered may be the
definition of another relation rather than a
classification or regression function.
8
Relational Data Mining Example
Bob
Male
Female
Alice
Z
Alice
Married
Mary
Fred
Tom
X
Jane
W
Jane
John
Sue
Jack
Parent
Carol
Y
Carol
Uncle(tom, carol)
Uncle(X,Y) :- Parent(Z,X), Parent(Z,W), Parent(W,Y), Male(X), not(X=W).
9
Relational Data Mining Example (cont)
Bob
Male
Female
Alice
W
Alice
Married
Mary
Tom
V
Fred
John
Y
Jane
Z
Jane
Sue
Jack
X
Parent
Carol
Uncle(jack, john)
Uncle(X,Y) :- Married(X,Z), Parent(W,Z), Parent(W,V), Parent(V,Y)
, Male(X), not(Z=V).
10
Most KDD Ignores RDM
• KDD textbooks barely mention RDM:
– Han & Kamber, 2001
– Hand, Mannila, & Smyth, 2001
– Witten & Frank, 1999
• But there is a recent edited collection on
RDM:
– S. Džeroski & N. Lavrač, eds. Relational Data
Mining, Springer Verlag, 2001.
11
Inductive Logic Programming
(ILP)
• Standard formal language for representing
relational knowledge is first-order predicate
logic.
• ILP studies the induction of hypotheses in
first-order predicate logic.
• Logic programs (e.g. Prolog) or functionfree logic programs (e.g. Datalog), are a
useful, reasonably-tractable subset of firstorder predicate logic.
• ILP is the most well-studied approach to
relational data mining.
12
ILP Problem Definition
Given
• Positive Example Set:
P
• Negative Example Set: N
• Background Knowledge: B
Find
• Hypothesis, H, such that
p  P : B  H  p
n  N : B  H  n
P, N, B and H are all sets of rules in first-order logic (i.e.
Horn clauses, logic programs)
13
ILP Algorithms
• We have utilized two ILP systems for EELD
problems in link discovery.
– Aleph (Srinivasan, 2001) A variant of the
popular Progol algorithm (Muggleton, 1995)
– mFoil+ (Tang and Mooney, 2002) A variant of
the popular Foil algorithm (Quinlan, 1990)
14
EELD Russian Nuclear Smuggling Data
• Data manually extracted from new sources about
events related to nuclear smuggling (developed by
Veridian Inc.)
• Size of data set:
– 40 relational tables
– 2 to 800 tuples per relation
• Translated Access database to Prolog, mapping
each relational table to a predicate.
• Used Aleph to learn rules for the relation
Linked(A,B)which determines whether or not
two events are part of the same incident.
– 143 positive examples
– 517 negative examples
15
Illustration of Linked Relation
New Event
Partial Incident N
Partial Incident M
16
Find Correct Incident for New Event
Expanded Incident N
Partial Incident M
17
Sample Rule
linked(EventA,EventB) :lk_event_material(_,EventA,_,_,_, ConcealmentG,DescH),
lk_event_person(_,EventB,PersonD,_,C,C,_),
lk_person_material(_,PersonD,MatF,EvE,_,_,_,_,_),
lk_event_material(_,EvE,MatF,I,_, ConcealmentG,DescH),
l_relations(I,_,"Stolen").
If A is linked to a specific type of material <G,H>, and B is
linked to a person linked to the same specific type of material,
through an event in which that material was stolen, then events
A and B are linked.
18
Linked(A,B)
A
B
Event
Material
Person
19
Linked(A,B)
A
B
Material
Type GH
Event
Material
Person
20
Linked(A,B)
A
B
E
Material
Type GH
D
Material
Type GH
Event
Material
Person
21
Linked(A,B)
A
B
E
D
Stolen
Material
Type GH
Material
Type GH
Event
Material
Person
22
Linked(A,B)
A
B
E
D
Stolen
Material
Type GH
Material
Type GH
Event
Material
Person
23
Accuracy Results for Learning Linked
for Nuclear Smuggling Data
• Experimental Method: 5-fold cross validation.
• Also tried bagging Aleph to produce an
ensemble of 25 hypotheses.
Majority Class
(not Linked)
78%
Aleph
Bagged Aleph
83%
86%
24
Synthetic Contract Killing Data
• Data generated by a plan-based simulator that
generates evidence emulating contract killings and
other types of murders (developed by IET Inc.).
• Simulator used to generate evidence from 200
murder events of three types:
– Murder for Hire (71 exs)
– First Degree (75 exs)
– Second Degree (54 exs)
• Use mFoil+ to classify events into one of these
three categories.
25
Sample Rules
• Murder For Hire(A):groupMemberMaleficiary(A, B),
subEvents(A, C), crimeMotive(C, economic).
• First Degree Murder(A):subEvents(A, B), performedBy(B, C),
loves(C,D).
• Second Degree Murder(A):subEvents(A, B),
eventOccursAtLocationType(B,publicProperty),
crimeMotive(B, rival),
occurrentSubeventType(B, stealing_Generic).
26
Results on Synthetic Contract Killing Data
MurderForHire
FirstDegree
SecondDegree
PRECISION
85.52%
91.17%
95.83%
RECALL
91.07%
88.48%
59.45%
ACCURACY
Majority Class
mFOIL+
37.50%
76.67%
27
Recent Result from
EELD Challenge Problem
murder_for_hire(A) :eventOccursAt(A,B), perpetrator(A,C),
agentPhoneNumber(C,D),callerNumber(E,D),
accountHolder(F,C), to_Generic(G,F),
from_Generic(G,H), to_Generic(I,H).
• Says an event is a “murder for hire” if it has a recorded location
and perpetrator, we have a recorded phone call to the perpetrator,
and there was a chain of bank transfers resulting in money
reaching the perpetrator’s account.
• 100% accuracy on a held-out test set.
• Similar pattern found manually by LD researchers working on
this challenge problem.
28
Future Research
• Scaling to larger datasets
– Stochastic search
– Logic program optimization
– Integration with relational and deductive
database technology.
• Integrating probabilistic reasoning
– Logic programs with Bayes-net constraints
• Active Learning
• Theory Refinement
29
Related Research
• Graph-based Relational Data Mining
– Subdue (Cook & Holder, UT Arlington)
• Probabilistic Relational Models
– PRMs (Koller, Stanford)
• Relational Feature Construction
– PROXIMITY (Jensen, UMass)
30
Record Linkage
• Identify and merge duplicate field values and
duplicate records in a database.
• Applications
– Duplicates in mailing lists
– Merging multiple databases of stores, restaurants, etc.
– Matching bibliographic references in research papers
(Cora/ResearchIndex)
– Identifying individuals who are trying to hide their
identity by providing slightly erroneous personal
information.
31
Record Linkage Examples
Author
Title
Venue
Address Year
Yoav Freund, H.
Sebastian Seung,
Eli Shamir, and
Naftali Tishby
Information,
prediction, and
query by committee
Advances in Neural
Information
Processing System
San Mateo,
CA
Freund, Y.,
Seung, H.S.,
Shamir, E. &
Tishby, N.
Information,
prediction, and
query by committee
Advances in Neural
Information
Processing Systems
San Mateo,
CA.
Name
Address
City
1993
Cusine
Second Avenue
Deli
156 2nd Ave. at
10th
New York
Delicatessen
Second Avenue
Deli
156 Second Ave.
New York City
Delis
32
Trainable Record Linkage
• MARLIN (Multiply Adaptive Record Linkage
using INduction)
• Learn parameterized similarity metrics for
comparing each field.
– Trainable edit-distance
• Use EM to set edit-operation costs
• Learn to combine multiple similarity metrics for
each field to determine equivalence.
– Use SVM to decide on duplicates
33
MARLIN
Record Linkage Framework
Trainable
duplicate
detector
Trainable
similarity
metrics
m1
…
mk
A.Field1 B.Field1
m1
…
mk
A.Field2 B.Field2
…
…
m1
…
mk
A.Fieldn B.Fieldn
34
Conclusions
• Pattern Learning for Link Discovery is an
important application of data mining for counterterrorism.
• Learning for Link Discovery requires Relational
Data Mining (RDM).
• Other problem domains require RDM
– Bioinformatics
– Web
– Natural Language Understanding
• RDM is an important next-generation KDD
capability.
35