Introduction – First-Order Probabilistic Models
Download
Report
Transcript Introduction – First-Order Probabilistic Models
IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
First-Order Probabilistic Models for
Information Extraction
Bhaskara Marthi, Brian Milch,
Stuart Russell
Computer Science Div.
University of California
NIPS 15th, 2003
Identity Uncertainty and Citation
Matching
Hanna Pasula, Bhaskara Marthi,
Brian Milch, Stuart Russell, Ilya
Shpitser
Computer Science Div.
University of California
Advisor: Hsin-His Chen
Reporter: Chi-Hsin Yu
Date: 2007.06.21
Outlines
Introduction
Related works
Models for the bibliography domain
Experiment on model A
Desiderata for a FOPL
Conclusions
2/18
Introduction –
Citation Matching Problem
Citation matching:
the problem of deciding which citations correspond to the
same publication
Difficulties
Different citation styles
An imperfect copy of the book’s title
Different ways to refer an object (identity)
Ambiguity
“Wauchope, K. Eucalyptus: Integrating Natural language
Input with a Graphical User Interface”
Tasks
Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ?
Parsing
Disambiguation
Matching
3/18
Introduction –
Citation Matching Problem: Examples
Journal of Artificial Intelligence Research,
or Artificial Intelligence Journal ??
4/18
Introduction –
First-Order Probabilistic Models
Logic
Propositio
nal
First-order
Probabilistic Model
Formula
A B
P(j m a b e),
Bayesian Network
Inference/
Algorithms
Resolution, Model Checking,
Forward chaining, DPLL,
WalkSAT …
Bayes’ rule, Summing-out,
smoothing, prediction,
approximation (likely-hood,
MCMC …), …
Formula
x King(x) Greedy(x)
Evil(x)
x King(x) Greedy(x) Evil(x)
= 0.8766
Inference/
Algorithms
Unification, Resolution, …
Learning, Approximation, …
System/
Languages
Prolog, Rule Engine
(JBoss), …
FOPL, RPM
5/18
Introduction –
Result of Model B
6/18
Related Works
IE
Bayesian modeling
the Message Understanding Conferences
[DARPA,1998]
finding stochastically repeated patterns (motifs)
in DNA sequences [Xing et al., 2003]
Robot localization [Anguelov et al., 2002]
FOPL/RPM (Relational Prob. Model)
A. Pfeffer. Probabilistic Reasoning for Complex
Systems. PhD thesis, Stanford, 2000.
7/18
Models for the Bibliography Domain –
Model A
[Pasula et al. 2003]
8/18
Models for the Bibliography Domain –
Model A (Cont.)
Suggest a declarative approach to
identity uncertainty using a formal
language
Algorithm
Steps
Generate objects/instances
Parse and fill attributes
Inference (Approximation, MCMC)
Cluster the identity (publication)
9/18
Models for the Bibliography Domain –
Model A (Cont.)
Attributes using unconditional probability
learn several bigram models
using the following resources
letter-based models of first names, surnames, and title words
the 2000 Census data on US names
a large A.I. BibTeX bibliography
a hand-parsed collection of 500 citations
Attributes using conditional probability
Using noise channels for some attributes
Citation.parse
It keeps track of the segmentation of Citation.text
An author segment, a title segment, and three filler segments (one
before, one after, and one in between)
Citation.text
the corruption models of Citation.obsTitle, AuthorAsCited.surname, and
AuthorAsCited.fnames
The parameters of the corruption models are learnt online, using stochastic
EM
Be constrained by Citation.parse, Paper.pubType, …
These models were learned using our pre-segmented file.
10/18
Models for the Bibliography Domain –
Model B
Publisher
Name, City
Collection
Authors
Name, Type, Date,
Publisher
Name
Area+ (Fields)
Publication
Citation Groups
Title
Area
Type (Book/conf. …)
AuthorList
Collection
Type (Area, Author)
Style
PublicationList
CitationList
Citation
Publication
TitleAsCited
AuthorsAsCited
Text
Parse
11/18
Models for the Bibliography Domain –
Model B (Cont.)
Generating objects
The set of Author objects, and the set of
Collection objects are generated independently.
the set of Publication objects is generated
conditional on the Authors and Collections.
CitationGroup objects are generated
conditional on the Authors and Collections.
Citation objects are generated from the
CitationGroups.
12/18
Models for the Bibliography Domain –
Model B (Cont.)
Fill attributes
Author.Name
Publications.Title
is chosen from a mixture of a letter
bigram distribution with a distribution that
chooses from a set of commonly occurring
names
is generated from an n-gram model,
conditioned on Publications.area
More specific relations and conditions
between attributes
13/18
Experiment on model A –
Experiment Setting
Dataset
Citeseer’s hand-matched datasets
Each of these datasets contains several
hundred citations of machine learning papers
Citeseer’s phrase matching algorithm
a greedy agglomerative clustering method
based on a metric that measures the degrees
to which the words and phrases of any two
citations overlap
half of them in clusters ranging in size from
two to twenty-one citations
14/18
Experiment on model A –
Experiment Result
15/18
Desiderata for a FOPL
Contains
A probability distribution over possible worlds
The expression power to model the relational structure of
the world
An efficient inference algorithm
A learning procedure which allows priors over the
parameters
Has the ability
to answer queries
to make inferences about the existence or nonexistence of
objects having particular properties
to represent common types of compound objects
to represent probabilistic dependencies
to incorporate domain knowledge into the inference
algorithms
16/18
Conclusions
First-order probabilistic models
a useful, probably necessary, component of
any system that extracts complex relational
information from unstructured text data
Some of the directions we plan to pursue
in the future
defining a representation language that allows
such models to be specified declaratively,
scaling up the inference procedure to handle
large knowledge bases
17/18
Thanks!!