Statistical Relational Learning for Link Prediction

Download Report

Transcript Statistical Relational Learning for Link Prediction

Statistical Relational Learning
for Link Prediction
Alexandrin Popescul and Lyle H. Unger
Presented by Ron Bjarnason
11 November 2003
Link Prediction
• Link Prediction is an important problem
arising in many domains
– Web pages
– Computers
– Scientific publications
– Organizations
– People
Being able to predict the presence of links or
connections in a domain is both important and
difficult to do well
Characteristics in Link Prediction
Domains
• Their nature is inherently multi-relational
– This makes the standard “flat” file domain
representation inadequate
• Data is often noisy or partially observed
– e.g. articles may be cited for any number of
reasons which reasons are not fully observed
Typical Learning Approaches
• Assume one-table “flat” domain
representation
• Process of feature creation is decoupled
from feature selection (and is often
performed manually)
• Relevant features may not be readily
observed by human eyes
The “Full Join” Approach
• Perform a full join on the entire database
and statistically analyze the entries
– Both impractical and incorrect
• Size is prohibitive
• Notion of an object is lost (stored across multiple
rows)
• Entries will be atomic attribute values, rather than
results from a complex search
• Negates option to introduce intelligent search
heuristics
The Relational Method
• Integrates standard statistical modeling (logistic
regression) with a process for systematically generating
features from relational data
• Feature generation is formulated as search in the space
of relational database queries
• Space bias can be controlled by specifying valid query
types
–
–
–
–
Aggregations or statistical operations
Groupings
Richer join conditions
Arg-max based queries
• Allows for discovery of complex, interesting relationships
Link Prediction in the Citeseer
Domain
• Can be used as a citation recommendation
service
– User would provide an abstract, author names,
possibly a partial reference list
• Citeseer provides a rich set of relational data
–
–
–
–
–
Texts of titles
Abstracts and documents
Citation information
Author names and affiliations
Conference or journal names
Methodology
• Couple the two main processes
– Generation of feature candidates from
relational data
– Their selection with statistical model selection
criteria
Relational Feature Generation
• Main principle of search formulation is
based on the concept of refinement
graphs
• Start with the most general clauses and
progress by refining them into more
specialized clauses
Relational Feature Generation –
Refinement Graphs
• Directed acyclic graphs specifying search space
• Constrained by specifying legal clauses
– Negation and recursion disallowed
• Structured by partial ordering of clauses
• A search node is expanded (refined) to produce
the most general specializations
• ILP systems using refinement graph search
usually apply two refinement operators
– Add a predicate to a clause
– A single variable substitution
Relational Feature Generation –
Aggregates
• Query results are aggregated to produce scalar
numeric values to be used in statistical learning
• Any statistical aggregate can be valid, but some
are expected to be more useful than others
–
–
–
–
–
–
Count
Average
Max
Min
Mode
Empty
• Aggregations are considered for inclusion at
each node, but not factored into further search
Relational Feature Selection
• Logistic Regression is used for binary
classification problems
• Regression coefficients are learned to
maximize the likelihood function
• Stepwise model selection and Bayesian
Information Criterion (BIC) are used to
avoid overfitting
Tasks and Data – IID Violation
• The relational structure violates the
assumption of independence
• This can be remedied by choosing the
right features
• When the right features are used, the
observations are independent given the
features
Two Prediction Tasks
1. The identity of all objects is known.
Some link structure is known. Predict
unobserved links.
2. New objects arrive. Predict their links.
-
What do we know about the objects?
-
-
Some of their links
Some of their attributes
This paper presents results for task 1
The Citeseer Environment
• 271,343 documents
• 1,092,200 citations
• Five data sets defined
– Four data sets consist of links among
documents containing a certain query phrase
(e.g. “artificial intelligence”)
– Fifth data set includes all documents
Learning Methodology
• Populate three relations Citation, Author
and PublishedIn
• Sample 2,500 citations each of
– Positive training examples (from available
links)
– Negative training examples (absence of a
link)
– Positive test examples
– Negative test examples
Learning Methodology
• Remove citations from test set (but no
other relevant information)
• Remove citations from training set (so
answers are not contained in background
information)
• Perform learning
– Using citations only
– Using all relevant information (citation,
authors and venue)
Results : Training and Test set
accuracies – balanced priors
Dataset
BK : Citation BK: All
Train Test
Train Test
“artificial intelligence” 90.24 89.68 92.60 92.14
“data mining”
87.40 87.20 89.70 89.18
“information retrieval” 85.98 85.34 88.88 88.82
“machine learning”
89.40 89.14 91.42 91.14
Entire collection
92.80 92.28 93.66 93.22
The End