Introduction – First-Order Probabilistic Models

Download Report

Transcript Introduction – First-Order Probabilistic Models

IJCAI 2003 Workshop on Learning Statistical Models from Relational Data
First-Order Probabilistic Models for
Information Extraction
Bhaskara Marthi, Brian Milch,
Stuart Russell
Computer Science Div.
University of California
NIPS 15th, 2003
Identity Uncertainty and Citation
Matching
Hanna Pasula, Bhaskara Marthi,
Brian Milch, Stuart Russell, Ilya
Shpitser
Computer Science Div.
University of California
Advisor: Hsin-His Chen
Reporter: Chi-Hsin Yu
Date: 2007.06.21
Outlines






Introduction
Related works
Models for the bibliography domain
Experiment on model A
Desiderata for a FOPL
Conclusions
2/18
Introduction –
Citation Matching Problem

Citation matching:


the problem of deciding which citations correspond to the
same publication
Difficulties




Different citation styles
An imperfect copy of the book’s title
Different ways to refer an object (identity)
Ambiguity

“Wauchope, K. Eucalyptus: Integrating Natural language
Input with a Graphical User Interface”


Tasks



Author: “Wauchope, K. Eucalyptus” or “Wauchope, K.” ?
Parsing
Disambiguation
Matching
3/18
Introduction –
Citation Matching Problem: Examples
Journal of Artificial Intelligence Research,
or Artificial Intelligence Journal ??
4/18
Introduction –
First-Order Probabilistic Models
Logic
Propositio
nal
First-order
Probabilistic Model
Formula
A  B
P(j  m  a  b  e),
Bayesian Network
Inference/
Algorithms
Resolution, Model Checking,
Forward chaining, DPLL,
WalkSAT …
Bayes’ rule, Summing-out,
smoothing, prediction,
approximation (likely-hood,
MCMC …), …
Formula
x King(x)  Greedy(x) 
Evil(x)
x King(x)  Greedy(x)  Evil(x)
= 0.8766
Inference/
Algorithms
Unification, Resolution, …
Learning, Approximation, …
System/
Languages
Prolog, Rule Engine
(JBoss), …
FOPL, RPM
5/18
Introduction –
Result of Model B
6/18
Related Works

IE


Bayesian modeling



the Message Understanding Conferences
[DARPA,1998]
finding stochastically repeated patterns (motifs)
in DNA sequences [Xing et al., 2003]
Robot localization [Anguelov et al., 2002]
FOPL/RPM (Relational Prob. Model)

A. Pfeffer. Probabilistic Reasoning for Complex
Systems. PhD thesis, Stanford, 2000.
7/18
Models for the Bibliography Domain –
Model A

[Pasula et al. 2003]
8/18
Models for the Bibliography Domain –
Model A (Cont.)


Suggest a declarative approach to
identity uncertainty using a formal
language
Algorithm

Steps
Generate objects/instances
 Parse and fill attributes
 Inference (Approximation, MCMC)


Cluster the identity (publication)
9/18
Models for the Bibliography Domain –
Model A (Cont.)

Attributes using unconditional probability

learn several bigram models


using the following resources




letter-based models of first names, surnames, and title words
the 2000 Census data on US names
a large A.I. BibTeX bibliography
a hand-parsed collection of 500 citations
Attributes using conditional probability

Using noise channels for some attributes



Citation.parse



It keeps track of the segmentation of Citation.text
An author segment, a title segment, and three filler segments (one
before, one after, and one in between)
Citation.text


the corruption models of Citation.obsTitle, AuthorAsCited.surname, and
AuthorAsCited.fnames
The parameters of the corruption models are learnt online, using stochastic
EM
Be constrained by Citation.parse, Paper.pubType, …
These models were learned using our pre-segmented file.
10/18
Models for the Bibliography Domain –
Model B
Publisher
Name, City
Collection
Authors
Name, Type, Date,
Publisher
Name
Area+ (Fields)
Publication
Citation Groups
Title
Area
Type (Book/conf. …)
AuthorList
Collection
Type (Area, Author)
Style
PublicationList
CitationList
Citation
Publication
TitleAsCited
AuthorsAsCited
Text
Parse
11/18
Models for the Bibliography Domain –
Model B (Cont.)

Generating objects




The set of Author objects, and the set of
Collection objects are generated independently.
the set of Publication objects is generated
conditional on the Authors and Collections.
CitationGroup objects are generated
conditional on the Authors and Collections.
Citation objects are generated from the
CitationGroups.
12/18
Models for the Bibliography Domain –
Model B (Cont.)

Fill attributes

Author.Name


Publications.Title


is chosen from a mixture of a letter
bigram distribution with a distribution that
chooses from a set of commonly occurring
names
is generated from an n-gram model,
conditioned on Publications.area
More specific relations and conditions
between attributes
13/18
Experiment on model A –
Experiment Setting

Dataset



Citeseer’s hand-matched datasets
Each of these datasets contains several
hundred citations of machine learning papers
Citeseer’s phrase matching algorithm


a greedy agglomerative clustering method
 based on a metric that measures the degrees
to which the words and phrases of any two
citations overlap
half of them in clusters ranging in size from
two to twenty-one citations
14/18
Experiment on model A –
Experiment Result
15/18
Desiderata for a FOPL

Contains





A probability distribution over possible worlds
The expression power to model the relational structure of
the world
An efficient inference algorithm
A learning procedure which allows priors over the
parameters
Has the ability





to answer queries
to make inferences about the existence or nonexistence of
objects having particular properties
to represent common types of compound objects
to represent probabilistic dependencies
to incorporate domain knowledge into the inference
algorithms
16/18
Conclusions

First-order probabilistic models


a useful, probably necessary, component of
any system that extracts complex relational
information from unstructured text data
Some of the directions we plan to pursue
in the future


defining a representation language that allows
such models to be specified declaratively,
scaling up the inference procedure to handle
large knowledge bases
17/18
Thanks!!