Russell--Pasula-et-a.. - Computer Science Division

Transcript Russell--Pasula-et-a.. - Computer Science Division

Identity uncertainty and
citation matching
Pasula, H., Marthi, B., Milch, B., Russell, S., Shpitser, I.
NIPS 2002
1
Some relevant news…
 Facebook rolled out Graph Search on Tuesday. The
natural-language search engine can, for example,
search for "music that people who like Mitt Romney
like," or "photos of my friends in 1989." Although it's
limited to four topics at the moment - people, places,
interests and photos - the queries that users can run
against the millions of photos and connections stored
among Facebook's billion users are powerful, powerful
tools. … Bing - like Google - has begun to try and
provide answers to questions, rather than lists of links.
But that's the same territory Zuckerberg and Co. have
staked out, too.
2
Outline







Background of research
Key contributions
Citation matching and information extraction
Identity uncertainty
Generative model
Experimental results
Implications for information extraction more generally
3
Background of research
 Record linkage (Felegi & Sunter 1969):
 Naïve Bayes model for record-pair match/mismatch vector
given entity match/mismatch
 Trained on matched and unmatched pairs
 Sensitive to population sizes in train/test
 Bayesian analysis of identity:
 Data association literature (multitarget tracking)
 Huang and R 97 (freeway surveillance)
 Previous work on RPMs (Koller and Pfeffer)
 Previous work on MCMC for RPM++ (Pasula & R 01)
 CiteSeer not working too well
4
CiteSeer02: Russell w/4 Norvig

Russell S, Norvig P (1995) Artificial Intelligence: A Modern Approach,
Prentice Hall Series in Artificial Intelligence. Englewood Cliffs, New Jersey

Stuart Russell and Peter Norvig, Artificial Intelligence: A Modern Approach,
Prentice Hall, 1995.

Russell S.; Norvig, P. Articial Intelligence - A Modern Approach. PrenticeHall International Editions, 1995.

Russell S.J., Norvig P., (1995) Artificial Intelligence, A Modern Approach.
Prentice Hall.

S. Russell and P. Norvig. Articial Intelligence, a Modern Approach. Prentice
Hall, New Jersey, NJ, 1995.

Stuart Russell and Peter Norvig. Artificial intelligence: A modern approach. PrenticeHall Series on Artificial Intelligence. Prentice-Hall, Englewood Cliffs, New Jersey,
1995.

S. Russell and P Norvig. Artifical Intelligence: a Modern Approach. Prentice Hall,
1995. Book Details from Amazon or Barnes \& Noble

Stuart Russell and Peter Norvig. Articial Intelligence: A Modern Approach. Prentice
Hall, 1995.

S. J. Russell and P. Norvig. Artificial Intelligence, a modern approach. Prentice Hall,
Upper Saddle River, New Jersey 07458, 1995.

Stuart Russell and Peter Norvig. Artificial Intelligence. A modern approach. PrenticeHall, 1995.

S. J. Russell and P. Norvig. Articial Intelligence: A Modern Approach. Prentice Hall.
1995.

S. Russell and P. Norvig, Artificial Intelligence A Modern Approach Prentice Hall
1995.

S. Russell and P. Norvig. Introduction to Artificial Intelligence. Prentice Hall, 1995.

Stuart Russell and Peter Norvig. Artficial Intelligence: A Modern Approach. PrenticeHall, Saddle River, NJ, 1995.

Stuart Russell and Peter Norvig. Articial Intelligence a modern approach. Prentice
Hall series in articial intelligence. Prentice Hall, Upper Saddle River, New Jersey,
1995.

Chapter 18 Artificial Intelligence: A Modern Approach by Stuart Russell and Peter
Norvig, Prentice-Hall, 2000.

Dynamics of computational ecosystems. Physical Review A 40:404--421. Russell, S.,
and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice Hall.

S. Russell, P. Norvig: Artificial Intelligence -- A Modern Approach, Prentice Hall,
1995.

Russell, S. \& Norvig, P. (1995) Artificial Intelligence: A Modern Appraoch
(Englewood Cliffs, NJ: Prentice-Hall). Book Details from Amazon or Barnes \& Noble

Stuart Russell and Peter Norvig. AI: A Modern Approach. Prentice Hall, NJ, 1995.

S. Russell, P. Norvig. Artificial Intelligence: A Modem Approach. Prentice- Hall, Inc.,
1995.

391-414. Russell SJ, Norvig P (

Russell and Peter Norvig, "Artificial Intelligence - A Modern Approach
(AIMA)", pp. 33

Stuart Russell and Peter Norvig: Artificial Intelligence: A Modern Approach,
Prentice-Hall, 1994.

Russell, S. \& Norvig, P., An Introduction to Artificial Intelligence: A Modern
Approach, Prentice Hall International, 1996.

S. Russell, P. Norvig. Artician Intelligence. A modern approach. Prentice
Hall, 1995.

Stuart Russell and Peter Norvig. Artificial Intelligence: A Modern Approach.
Prentice Hall, 1995. Contributing writers: John F. Canny, Jitendra M. Malik,
Douglas D. Edwards. ISBN 0-13-103805-2.

Stuart Russell and Peter Norvig. Artificial Intelligence: A Mordern Approach.
Prentice Hall, Englewood Cliffs, New Jersey 07632, 1995.

In Proceedings of the Third Annual Conference on Evolutionary Programming
(pp. 131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P.
1995. Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ:
Prentice Hall.

John Wiley. Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A Modern
Approach. Prentice-Hall, Inc.

Stuart Russell and Peter Norvig: Artifcial Intelligence A Modern Approach,
Englewood Clioes, NJ: Prentice Hall, 1995.

In Scherer, K.R. \& Ekman, P. Approaches to Emotion, 13--38. Hillsdale, NJ:
Lawrence Erlbaum. Russell, S.J. and Norvig, P. 1995. Artificial Intelligent: A
Modern Approach. Englewood Cliffs, NJ: Prentice Hall.

Rosales E, Forthcoming Masters dissertation, Department of Computer
Science, University of Essex, Colchester UK Russell S and Norvig P (1995)
Artificial Intelligence: A Modern Approach. Prentice Hall: Englewood Cliffs,
New Jersey.

S. Russell and P. Norvig (1995) Artificial Intelligence; A Modern Approach,
Prentice Hall, New Jersey.

S. Russell and P. Norvig. Articial Intelligence. A Modern Approach. PrenticeHall, 1995. ISBN 0-13-360124-2.

Stuart J. Russell and Peter Norvig. Articial Intelligence: A Modern Approach, chapter
17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.

Stuart J. Russell and Peter Norvig. Articial Intelligence A Modern Approach. Prentice
Hall, Englewood Cli s, New Jersey, USA, 1995. 32

Morgan Kaufmann Publishers. Russell, S., and Norvig, P. 1995. Artificial Intelligence:
A Modern Approach. Prentice Hall.

Stuart J. Russell and Peter Norvig. Articial Intelligence: AModern Approach,chapter
17. Number 0-13-103805-2 in Series in Articial Intelligence. Prentice Hall, 1995.

W. Shavlik and T. G. Dietterich, eds., Morgan Kaufmann, San Mateo, CA. Russell, S.
and Norvig, P. (1995). Artificial Intelligence - A Morden Approach. Englewood Cliffs,
NJ: Prentice-Hall.

KeyGraph: Automatic indexing by co-occurrence graph based on building
construction metaphor. In Advanced Digital Library Conference. to appear. Russell,
S., and Norvig, P. 1995. Artificial Intelligence --A Modern Approach--.
Prentice-Hall.


Formal derivation of rule-based programs. IEEE Transactions on Software
Engineering 19(3):277--296. Russell, S., and Norvig, P. 1995. Artificial Intelligence: A
Modern Approach. Prentice Hall.

Russell, Stuart and Peter Norvig, Artificial Intelligence, A Modern Approach, New
Jersey, Prentice Hall, 1995.

S. Russell, P. Norvig: Articial Intelligence: A modern approach; Prentice Hall (1995).

Rechenberg, I. (89). Artificial evolution and artificial intelligence. In Forsyth, R. (Ed.),
Machine Learning, pp. 83--103 London. Chapman. Russell, S., \& Norvig, P. (1995).
Artificial Intelligence: A Modern Approach. Prentice Hall.

Russell, S and Norvig, P. 1995. Articial Intelligence: A Modern Approach PrenticeHall, Englewood Cli s, New Jersey, 1995.

Russell, S., \& Norvig, P. (1995) . Artificial intelligence: A modern monitoring methods
for information retrieval systems: From search approach. Prentice-Hall series on
artificial intelligence. Upper Saddle product to search process. Journal
of the American Society for Information Science, 47, 568 -- 583. River, NJ: PrenticeHall.


Stuart J. Russell and Peter Norvig. Artificial Intelligence: A Modern Approach, chapter
17. Number 0-13-103805-2 in Series in Artificial Intelligence. Prentice Hall, 1995.

S. Russell and P. Norvig. Articial Intelligence A Modern Approach. Prentice Hall,
Englewood Cli s, 1995.

Russell, Stuart and Norvig, Peter: Artificial Intelligence: A Modern Approach, Prentice
Hall, Englewood Cliffs NJ, 1995

S. Russell and P. Norvig. ????????? ????????????? ? ?????? ????????. Prentice
Hall, Englewood Cli s, NJ, 1995.

S. Russell and P. Norvig, Artificial Intelligence: A Modern Approach - The Intelligent
Agent Book, Prentice Hall, NY, 1995.

S. Russell and P. Norvig. Artificial Intelligence-aModern Approach. Prentice Hall
International, Englewood Cliffs, NJ,USA,1995.

S.J.Russell, P.Norvig: Arti cial intelligence. A modern approach", Prentice-Hall
International, 1995.

In Proceedings of the Third Annual Conference on Evolutionary Programming (pp.
131--139). River Edge, NJ: World Scientific. Russell, S.J., \& Norvig, P. 1995.
Artificial Intelligence, A Modern Approach. Englewood Cliffs, NJ: Prentice
Hall.


In Working Notes of the IJCAI-95 Workshop on Entertainment and AI/ALife, 19--24.
Russell, S., and Norvig, P. 1995. Artificial Intelligence: A Modern Approach. Prentice
Hall.

Stuart J. Russell and Peter Norvig. Artiilcial Intelligence: A Modern
Approach. Prentice Hall, Englewood Cliffs, N J, 1995.

Academic Press. 359--380. Russell, S., and Norvig, P. 1994. Artificial
Intelligence: A Modern Approach. Prentice Hall.

Stuart J. Russell, Peter Norvig, Artifical Intelligence: A Modern Appraoch,
Prentice-Hall, Englewood Cliffs, New Jersey. 1994.

Cambridge, MA: MIT Press. Russell, S. J., and Norvig, P. (1994). Artificial
Intelligence: A Modern Approach. Englewood Cliffs, NJ: Prentice-Hall.

Morgan Kauffman. Russell, S., and Norvig, P. 1994. Artificial Intelligence: A
Modern Approach. Prentice Hall.

Fast Plan Generation Through Heuristic Search Russell, S., \& Norvig, P.
(1995). Artificial Intelligence: A Modern Approach. Prentice-Hall, Englewood
Cliffs, NJ.

Hoffmann \& Nebel Russell, S., \& Norvig, P. (1995). Artificial Intelligence: A
Modern Approach. Prentice-Hall, Englewood Cliffs, NJ.

Stuart Russel and Peter Norvig. Artificial Intelligence: A Modern Approach,
chapter 12.1 - 12.3, pages 367--380. Prentice Hall, 1995.

Stuart Russel and Peter Norvig. Artificial Intelligence, A Modern Approach.
PrenticeHall, 1996. 2

Stuart Russel, Peter Norvig, Articial Intelligence: A Modern Approach,
Prentice Hall, New Jersey, US, 1995

Russel, S., and Norvig, P. Articial Intelligence. A Modern Approach.
Prentice Hall Series in Artificial Intelligence. 1995.

S. Russel and P. Norvig. Artificial Intelligence, A Modern Approach, Prentice
Hall: 1995. Book Details from Amazon or Barnes \& Noble

S. J. Russel and P. Norvig. Articial Intelligence A Modern Approach, chapter
14, pages 426-435. Prentice Hall Series in Articial Intelligence. Prentice Hall
International, Inc., London, UK, rst edition, 1995. Exercise 14.3.

Russel, S. and P. Norvig. Articial intelligence: A modern approach, Prentice
Hall, 1995. Book Details from Amazon or Barnes \& Noble

S. Russel and P. Norvig Artificial Intelligence: A Modern Approach, MIT Press 1995.

Russel, S. and Norvig, P., "Artificial Intelligence: A Modern Approch," p. 111-114,
Prentice-Hall.

J. Russel and P. Norvig. Artificial Intelligence, A Modern Approach. Prentice Hall,
Upper Saddle River, NJ, 1995. 71

Stuart Russel and Peter Norvig. A Modern, Agent-Oriented Approach to Introductory
Artificial Intelligence. 1995.

Stuart J. Russel and Peter Norvig. Artificial Intelligence---A Modern Approach,
chapter 14, pages 426--435. Prentice Hall Series in Artificial Intelligence. Prentice
Hall Internationall, Inc., London, UK, first edition, 1995. Excersice 14.3.

Russel S. and Norvig P. (1995). Articial Intelligence. A Modern Approach. Prentice
Hall Series in Artificial Intelligence.

S. Russel, P. Norvig Articial Intelligence - A Modern Approach Prentice Hall, 1995

Russel, S., P. Norvig. Artificial Intelligence: A Modern Approach Prentice Hall 1995.

Artificial Intelligence, S Russel \& P Norvig, Prentice Hall, 1995 21

Russel, S.J, Norvig P: Artificial Intelligence. A Modern Approach, Prentice
Hall Inc. 1995

Russel, S., Norvig, P. (1995) Artificial Intellience - A modern approach.
(Englewood Cliffs: Prentice Hall International).
Key contributions
 A “vertically integrated,” declarative, generative
model from objects, relations, and
interpretations all the way down to text strings
 Clear distinction between entities and strings
 Probability of identity calculated as a direct
consequence of the model (no “similarity
heuristics”)
 Semantically driven “collective disambiguation”
during parsing occurs automatically
 MCMC over possible worlds (object existence and
relational structures), scaled linearly with #citations
17
Citation information extraction
 Given a set of citation strings scraped from
online papers
 Determine
 What distinct papers exist (including offline papers)
 What authors (venues, publishers, etc.) exist
 What the titles and authors (etc.) of the papers are
 Which paper cites which paper
18
Example
[Lashkari et al 94] Collaborative Interface Agents, Yezdi
Lashkari, Max Metral, and Pattie Maes, Proceedings of
the Twelfth National Conference on Articial
Intelligence, MIT Press, Cambridge, MA, 1994.
Metral M. Lashkari, Y. and P. Maes. Collaborative
interface agents. In Conference of the American
Association for Artificial Intelligence, Seattle, WA,
August 1994.
 Are these descriptions of the same object? Who are
the authors? What is the venue?
 Core task in CiteSeer, Google Scholar, over 300
record linkage companies
 CiteSeer (2001) asserted that Russell and Norvig
wrote over 100 distinct books
19
Relational probability models
 Taxonomic hierarchy of object classes (fixed)
 Named instances in each class (fixed)
 Complex attributes A denote typed relations between
objects (e.g., AuthorOf(author,paper))
 Paper allows for nonfunctional relations and uncertainty over
the (bounded) number #[A] of values for A for each object
 Simple attributes B denote typed fixed-range functions
 Probability models P(B | Parents(B)) where Parents(B)
are attribute chains A1.….An.B’
 E.g., height ~ mother.height, father.height
20
RPM for citations
21
Priors for object attributes
Names:
0.9 x (~ US census) +
0.1 x letter bigram
Titles:
Word+letter bigrams from
AI BibTeX database
#authors, #fnames, pubType:
Counts in hand-parsed
training set of 500 citations
22
Observation models
Corruption of names:
Letter deletion/insertion/change
First name(s) replaced by initials
Omission/reordering of authors
Corruption of titles:
Letter and word
deletion/insertion/replacement
Parameters learned online by EM
without requiring ground truth
23
Models for Citation.parse and Citation.text
 Range of parse attribute has two parts:
 Style specifies order between title and authors plus how to
write the authors (J. Smith vs Smith, J.).
 Prior estimated from hand-segmented training set.
 Segmentation (tiny grammar) specifies cut points:
 <filler1>|<title>|<filler2>|<authors>|<filler3>
[A50] “|I, Robot|.” By |I. Asimov|. Gnome Press, 1950.
 Prior for cut points is uniform
 Filler: letter + word bigrams estimated from training set
 Filler 3 depends on PubType
 Style, segmentation, obsTitle, names of obsAuthors,
and fillers jointly determine text
24
Comment
 The model contains no “features” or other
heuristics related to determining if two citations
match each other
25
Identity uncertainty
 We name papers P1, P2, … cited by C1, C2, …
 Uncertainty as to whether P1 = P2 etc.
 If P1 = P2, just one title, author list; otherwise two
 Possible world includes an assignment ι that
specifies equivalence classes of papers:
 {{P1},{P2}} vs. {{P1,P2}}
 Probability model factors as P(ι)P(world | ι)
 P(ι) = ΠCP(ιC); “known” classes (e.g., citations)
are fixed sets of singletons (e.g., {{C1},…,{Ck}})
26
Identity uncertainty contd.
 Specify prior on number of objects |ιC|
 E.g., number of authors ~ lognormal[6.9,2.3]
 Make appropriate a priori uniformity assumption
 Each paper equally likely to be cited
 Each author equally likely to write a paper
 Suppose ιC,k,m maps k named instances to m
distinct objects in C out of n total; then
 P(ιC,k,m) = n! / (n-m)!nk
27
Identity uncertainty contd.
 Why does the number of objects matter?
 Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000]
 Prior distribution for values is uniform
 Observed objects A and B both have value 526881
 What is the probability that A=B?
28
Identity uncertainty contd.
 Why does the number of objects matter?
 Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000]
 Prior distribution for values is uniform
 Observed objects A and B both have value 526881
 What is the probability that A=B?
 If there are a thousand objects, 0.999
29
Identity uncertainty contd.
 Why does the number of objects matter?
 Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000]
 Prior distribution for values is uniform
 Observed objects A and B both have value 526881
 What is the probability that A=B?
 If there are a thousand objects, 0.999
 If there a billion objects, 0.001
30
Identity uncertainty contd.
 Why does the number of objects matter?
 Suppose, for any object, we observe just one
attribute: an integer in the range [1,…,1 000 000]
 Prior distribution for values is uniform
 Observed objects A and B both have value 526881
 What is the probability that A=B?
 If there are a thousand objects, 0.999
 If there a billion objects, 0.001
 Probability of identity is affected by
 Number of objects, size of measurement space
 Measurement accuracy, similarity between objects
31
Inference
 Given
 a fixed assignment ι
 a fixed “relational skeleton” (complex attributes)
 RPM can be grounded as a Bayes net
32
Inference contd.
 Metropolis-Hastings MCMC
 First part of proposal is a split/merge on ιC :
 Drop two clusters of citations (i.e., papers) Pa, Pb
 Add two empty clusters P1, P2
 Put citations from Pa, Pb u.a.r. into P1, P2
 Second part fills in attribute values:
 Names, title: apply perturbation model in reverse
(and occasionally pick a nearby census name)
 Parse: sample directly from precomputed list
33
MCMC inference
34
Scaling
 Naïve M-H merge proposals fail more
frequently for larger worlds
 Canopies (McCallum et al KDD-00; see also
blocking methods in data deduplication): sets of
entities that have >ε chance of matching
according to a simple distance metric (i.e., don’t
try merging “Smith” and “Baeza-Yates”).
 Pick a canopy first, then a pair of papers in it
 Runtime scales
35
Citation Matching Results
Error
(Fraction of Clusters Not Recovered Correctly)
0.25
0.2
Phrase Matching
[Lawrence et al. 1999]
0.15
Generative Model + MCMC
[Pasula et al. 2002]
Conditional Random Field
[Wellner et al. 2004]
0.1
0.05
0
Reinforce
Face
Reason
Constraint
Four data sets of ~300-500 citations, referring to ~150-300 papers
36
Cross-Citation Disambiguation
Wauchope, K. Eucalyptus: Integrating Natural
Language Input with a Graphical User
Interface. NRL Report NRL/FR/5510-94-9711
(1994).
Is "Eucalyptus" part of the title, or is the author
named K. Eucalyptus Wauchope?
Cross-Citation Disambiguation
Wauchope, K. Eucalyptus: Integrating Natural
Language Input with a Graphical User
Interface. NRL Report NRL/FR/5510-94-9711
(1994).
Is "Eucalyptus" part of the title, or is the author
named K. Eucalyptus Wauchope?
Kenneth Wauchope (1994). Eucalyptus:
Integrating natural language input with a
graphical user interface. NRL Report
NRL/FR/5510-94-9711, Naval Research
Laboratory, Washington, DC, 39pp.
Second citation makes it clear how to parse the first one
Cross-citation disambiguation
 Later experiments (Milch) showed
 33% of singleton citations are parsed correctly
 64% of non-singleton citations are parsed correctly
39
What about Russell and Norvig?
 Still generates several clusters:
 Correct version
 Introduction to Artificial Intelligence
 S. Russel and P. Norvig, Artificial Intelligence
 Etc.
 Copying of erroneous citations produces data
whose best explanation has multiple clusters
 Could add a copying model, plus times of citing
and cited papers
40
Summary
 A somewhat nontrivial application of a




declarative, relational, open-universe language
Relatively simple model for existence – did not
allow for dependence on other objects
Inference results suggest scalability is not the
huge barrier for model-based methods
Later work by Pasula raised accuracy to ~99%
(comparable to humans); errors were mainly on
partial or concatenated citation strings
Building the model was initially difficult
41

Russell--Pasula-et-a.. - Computer Science Division

Transcript Russell--Pasula-et-a.. - Computer Science Division

Directory