Record Linkage Survey

Transcript Record Linkage Survey

Record Linkage Survey
Tan Yee Fan
2007 February 9
WING Group Meeting
Contents


Introduction
Record linkage using internal knowledge





Record linkage using search engine


String matching
Classification or clustering
Graphical formalisms
Blocking
Adaptive methods
Conclusion
Introduction

Ambiguous
representation of
named entities







Citation records
Web pages
People names
Products
Customer records
Images
…



Merging information
from various sources
Typos? Errors?
Think of large scale
databases of millions of
records!
Commercial world

Addresses



Dongwon Lee, 110 E. Foster Ave. #410, State
College, PA, 16802
LEE Dong, 110 East Foster Avenue Apartment
410, University Park, PA 16802-2343
Products



Honda Fix vs. Honda Jazz
T-Fal vs. Tefal
Apple iPod Nano 4GB vs. 4GB iPod nano 4GB
Examples courtesy of Dongwon Lee (Penn State University)
Author names and citations
Jeffrey D. Ullman
(Stanford University)
Some images courtesy of Dongwon Lee
(Penn State University)
Web pages
Note:
I did not author any of these web pages!
More on person names

Highly ambiguous


Only 90,000 different names for 100 million
people (U.S. Census Bureau)
Valid changes:



Customs: Lee, Dongwon vs. Dongwon Lee vs. LEE
Dongwon
Marriage: Carol Dusseau vs. Carol Arpaci-Dusseau
Misc.: Sean Engelson vs. Shlomo Argamon
Examples courtesy of Dongwon Lee (Penn State University)
Record linkage

Input


Output


For each record a in A and for each record b in B,
does a and b refer to the same entity?
Note


Two lists of records, A and B
Entities do not come with unique identifiers
To disambiguate (deduplicate) items in a single list
L, we set A = B = L
Fellegi-Sunter model
no-decision region
(hold for human review)
designate as
definite non-match
designate as
definite match
* true matches
○ true non-matches
false non-matches
false matches
sim(a, b)
Contents


Introduction
Record linkage using internal knowledge





Record linkage using search engine


String matching
Classification or clustering
Graphical formalisms
Blocking
Adaptive methods
Conclusion
String matching

String similarity

Strings as ordered sequences



([a], [b], [c]) ≠ ([c], [b], [a])
Strings as unordered sets



Edit distance
Jaro and Jaro-Winkler
Jaccard similarity
Cosine similarity
{[a], [b], [c]} = {[c], [b], [a]}
Abbreviation matching


Usually pattern detection in texts
e.g. “Almost Locked Sets (ALS)”
Classification or clustering

Feature engineering + model selection

Features


String similarity, relationships (e.g. collaborators)
Models

Naïve Bayes, Support Vector Machine, K-means,
Agglomerative Clustering, …
Yoojin Hong, Byung-Won On and Dongwon Lee. System
Support for Name Authority Control Problem in
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Sudha Ram, Jinsoo Park and Dongwon Lee. Digital
Libraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
Classification or clustering

Graphical models

Structure



Characteristic




Nodes: Record fields or entire records
Edges: Join fields with similar values, fields to records, etc…
Model global knowledge
Propagate information around the graph until convergence
Usually very time consuming
Examples



Conditional random field
Dependency graph
Generative probabilistic model
Social network analysis

Social network


Nodes: entities (e.g. author names)
Edges: relationships (e.g. coauthored a paper)
J. C. Latombe
T.-H. Chiang
D. Hsu
A. Dhanik
Y. Wang
L. Qiu
M.-Y. Kan
H. Cui
T.-S. Chua
coauthorship network
Y. F. Tan
Social network analysis

Analysis






Connected components
Distance between nodes
Node/edge centrality
Cliques
Bipartite subgraphs
…
J. C. Latombe
T.-H. Chiang
D. Hsu
A. Dhanik
Y. Wang
L. Qiu
M.-Y. Kan
H. Cui
T.-S. Chua
Y. F. Tan
Social network analysis

Connected triple

Maximum flow
x1
s
x2

Random walk

x2
x1
x3
Clustering
t
Scalability Issues

Pairwise comparisons


Requires O(n2) time
Major bottleneck
Input: d1, d2, …, dn
for i = 1 to n
for j = (i + 1) to n
compute sim(di, dj)

Possible solutions


Blocking techniques
Avoiding pairwise
comparisons altogether
Contents


Introduction
Record linkage using internal knowledge





Record linkage using search engine


String matching
Classification or clustering
Graphical formalisms
Blocking
Adaptive methods
Conclusion
Record linkage using search engine

Previously…


What if…



We assumed input data records contain sufficient
information to perform linkage
There is insufficient or only noisy information?
e.g. linking short forms to long forms
Ask other people!

Use web as collective knowledge of people
Record linkage using search engine
Number of results
Ranked list
Title
Snippet
URL
Web page
sudoku strategies
sudoku OR strategies
“sudoku strategies”
Examples

Counts


Co-occurrence measure
between count(q),
count(q’) and
count(q and q’)

Snippets or web pages



Hostnames from URLs


Overlap between the
hostnames of results of
queries q and q’
Inverse Host Frequency
Cosine similarity using
the tokens
Counts of specific terms

e.g. number of snippets
for the query q
containing the string q’
Do natural language
processing
Googled name linkage


Suppose e and e’ refer to
the same entity
Then web pages of e and
web pages of e’ are likely to
share some representative
data






“Jeffrey D. Ullman”
384,000 pages
“Jeffrey D. Ullman” + “aho”
174,000 pages
“J. Ullman”
124,000 pages
“J. Ullman” + “aho”
41,000 pages
“Shimon Ullman”
27,300 pages
“Shimon Ullman” + “aho”
66 pages
Query probing
Googling and web page downloads
are expensive on time!

Consider




Joint Conference on Digital Libraries
European Conference on Digital Libraries
Digital Libraries
Query probing


Use common n-gram “digital libraries” as query
probe
If we can obtain information on all three
conferences, we save two queries
Adaptive querying

Methods



Ms: stronger method but
very slow (e.g. web page
similarity)
Mw: weaker method but
fast (e.g. host overlap)
Aim


Accuracy close to Ms
Significantly reduced
running time than Ms

Algorithm


Execute Mw
If heuristic suggests that
Mw results are likely
incorrect

Execute Ms
Comment

These techniques




Blocking
Query probing
Adaptive querying
Combine different methods to obtain the
better aspects of each
Contents


Introduction
Record linkage using internal knowledge





Record linkage using search engine


String matching
Classification or clustering
Graphical formalisms
Blocking
Adaptive methods
Conclusion
Conclusion

Comment


History


This survey is very brief and broad, but still many aspects
not covered
Record linkage became a research issue in the 1940s,
possibly due to analysis of census data or medical records
Research directions



Graphical models utilize global knowledge, but how to
make them scalable for large datasets
Utilizing external knowledge in an effective and scalable
manner
Adaptive methods
Thank You
Selected Bibliography

General and surveys






Ivan P. Fellegi and Alan B. Sunter. A theory for record linkage. Journal of the
American Statistical Association, 64(328):1183–1210, December 1969.
William E. Winkler and Yves Thibaudeau. An application of the Fellegi-Sunter
Model of record linkage to the 1990 U.S. Decennial Census. Technical
Report RR91/09, U.S. Bureau of the Census, 1991.
Ahmed K. Elmagarmid, Panagiotis G. Ipeirotis, and Vassilios S. Verykios.
Duplicate record detection: A survey. IEEE Transactions on Knowledge and
Data Engineering (TKDE), 19(1):1–16, January 2007.
William E. Winkler. Overview of record linkage and current research
directions. Technical Report RRS2006/02, U.S. Bureau of the Census,
February 2006.
Mikhail Bilenko, Raymond J. Mooney, William W. Cohen, Pradeep
Ravikumar, and Stephen E. Fienberg. Adaptive name matching in
information integration. IEEE Intelligent Systems, 18(5):16–23,
January/February 2003.
Min-Yen Kan and Yee Fan Tan. Record Matching in Digital Library Metadata.
To appear in Communications of the ACM (CACM).
Selected Bibliography

String matching













Robert A. Wagner and Michael J. Fischer. The string-to-string correction problem. Journal of the Association of
Computing Machinery, 21(1):168–173, January 1974.
Saul B. Needleman and Christian D. Wunsch. 1970. A general method applicable to the search for similarities in the
amino acid sequence of two proteins. Journal of Molecular Biology, 148(3):443–453, March 1970.
Temple F. Smith and Michael S. Waterman. Identification of common molecular subsequences. Journal of Molecular
Biology, 147(1):195–197, March 1981.
Andrés Marzal and Enrique Vidal. Computation of normalized edit distance and applications. IEEE Transactions on
Pattern Analysis and Machine Intelligence, 15(9):926–932, September 1993.
Alvaro E. Monge and Charles Elkan. The field matching problem: Algorithms and applications. In ACM SIGKDD
International Conference on Knowledge Discovery and Data Mining, pages 267–270, August 1996.
Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(3):311–321, March
2004.
Mikhail Bilenko and Raymond J. Mooney. Adaptive duplicate detection using learnable string similarity measures. In
ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 39–48, August 2003.
Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field For Discriminatively-Trained
Finite-State String Edit Distance. In Conference on Uncertainty in Artificial Intelligence (UAI), July 2005.
William. W. Cohen, Pradeep Ravikumar, and Stephen E. Fienberg. A comparison of string distance metrics for namematching tasks. In Information Integration on the Web (IIWeb), pages 73–78, August 2003.
Ariel S. Schwartz and Marti A. Hearst. A simple algorithm for identifying abbreviation definitions in biomedical text. In
Pacific Symposium on Biocomputing (PSB), pages 451–462, January 2003.
Youngja Park and Roy J. Byrd. Hybrid text mining for finding abbreviations and their definitions. In Conference on
Empirical Methods in Natural Language Processing (EMNLP), pages 126–133, June 2001.
Jeffrey T. Chang , Hinrich Schütze, and Russ B. Altman. Creating an online dictionary of abbreviations from MEDLINE.
Journal of the American Medical Informatics Association, 9(6):612–620, November/December 2002.
Hiroko Ao and Toshihisa Takagi. ALICE: An algorithm to extract abbreviations from MEDLINE. Journal of the American
Medical Informatics Association, 12(5):576–586, September/October 2005.
Selected Bibliography

Direct classification or clustering, and blocking









Hui Han, Hongyuan Zha, and C. Lee Giles. A model-based K-means algorithm for name disambiguation. In Workshop
on Semantic Web Technologies for Searching and Retrieving Scientific Data, October 2003.
Hui Han, C. Lee Giles, Hongyuan Zha, Cheng Li, and Kostas Tsioutsiouliklis. Two supervised learning approaches for
name disambiguation in author citations. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 296–305,
June 2004.
Hui Han, Wei Xu, Hongyuan Zha, and C. Lee Giles. A hierarchical naive bayes mixture model for name disambiguation
in author citations. In ACM Symposium on Applied Computing (SAC), pages 1065–1069, March 2005.
Hui Han, Hongyuan Zha, and C. Lee Giles. Name disambiguation in author citations using a K-way spectral clustering
method. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 334–343, June 2005.
Dongwon Lee, Byung-Won On, Jaewoo Kang, and Sanghyun Park. Effective and scalable solutions for mixed and split
citation problems in digital libraries. In ACM SIGMOD Workshop on Information Quality in Information Systems (IQIS),
pages 69–76, June 2005.
Byung-Won On, Dongwon Lee, Jaewoo Kang, and Prasenjit Mitra. Comparative study of name disambiguation problem
using a scalable blocking-based framework. In ACM/IEEE Joint Conference on Digital Libraries (JCDL), pages 344–
353, June 2005.
Andrew McCallum, Kamal Nigam, and Lyle Ungar. Efficient clustering of high-dimensional data sets with application to
reference matching. In ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 169–
178, August 2000.
Matthew Michelson and Craig A. Knoblock. Learning blocking schemes for record linkage. In National Conference on
Artificial Intelligence (AAAI), July 2006.
Mikhail Bilenko, Beena Kamath, and Raymond J. Mooney. Adaptive Blocking: Learning to Scale Up Record Linkage
and Clustering. In IEEE International Conference on Data Mining (ICDM), December 2006.
Selected Bibliography

Graphical models







Jie Wei. Markov edit distance. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 26(3):311–321, March 2004.
John Lafferty, Andrew McCallum, and Fernando Pereira. Conditional random fields:
Probabilistic models for segmenting and labeling sequence data. In International
Conference on Machine Learning (ICML), pages 282–289, June/July 2001.
Andrew McCallum and Ben Wellner. Object consolidation by graph partitioning with a
conditionally-trained distance metric. In ACM SIGKDD Workshop on Data Cleaning,
Record Linkage, and Object Consolidation, pages 19–24, August 2003.
Ben Wellner, Andrew McCallum, Fuchun Peng, and Michael Hay. An integrated,
conditional model of information extraction and coreference with application to citation
matching. In Conference on Uncertainty in Artificial Intelligence (UAI), pages 593–601,
July 2004.
Andrew McCallum, Kedar Bellare, and Fernando Pereira. A Conditional Random Field
For Discriminatively-Trained Finite-State String Edit Distance. In Conference on
Uncertainty in Artificial Intelligence (UAI), July 2005.
Xin Dong, Alon Halevy, and Jayant Madhavan. Reference reconciliation in complex
information spaces. In ACM SIGMOD International Conference on Management of
Data, pages 85–96, June 2005.
Indrajit Bhattacharya and Lise Getoor. A latent dirichlet model for unsupervised entity
resolution. In SIAM International Conference on Data Mining, pages 47–58, April 2006.
Selected Bibliography

Social network analysis














H. A. Kautz, B. Selman, and M. A. Shah. The hidden web. AI Magazine, 18(2):27–36, 1997.
P. Mutschke. Mining networks and central entities in digital libraries. A graph theoretic approach applied to co-author
networks. In Intelligent Data Analysis (IDA), pages 155–166, August 2003.
M. E. J. Newman. Who is the best connected scientist? A study of scientific coauthorship networks. In Complex
Networks, pages 337–370, February 2004.
E. Otte and R. Rousseau. Social network analysis: a powerful strategy, also for the information sciences. Journal of
Information Science, 28(6), December 2002.
T. Krichel and N. Bakkalbasi. A social network analysis of research collaboration in the economics community. In
International Workshop on Webometrics, Informetrics and Scientometrics & Seventh COLLNET Meeting, May 2006.
R. Rousseau and M. Thelwall. Escher staircases on the world wide web. First Monday, 9(6), June 2004.
D. G. Feitelson. On identifying name equivalences in digital libraries. Information Research, 9(4), October 2004.
R. Bekkerman and A. McCallum. Disambiguating web appearances of people in a social network. In International
conference on World Wide Web (WWW), pages 463–470, May 2005.
R. Holzer, B. Malin, and L. Sweeney. Email alias detection using social network analysis. In Workshop on Link
Discovery: Issues, Approaches and Applications (LinkKDD), August 2005.
B. Malin, E. Airoldi, and K. M. Carley. A network analysis model for disambiguation of names in lists. Computational
and Mathematical Organization Theory, 11(2):119–139, July 2005.
G. Flake, S. Lawrence, and C. L. Giles. Efficient identification of web communities. In ACM SIGKDD International
Conference on Knowledge Discovery and Data Mining, pages 150–160, August 2000.
P. K. Reddy and M. Kitsuregawa. An approach to build a cyber-community hierarchy. In SIAM ICDM Workshop on Web
Analysis, April 2002.
Patrick Reuther. Personal name matching: New test collections and a social network based approach. Technical
Report Mathematics/Computer Science 06-01, University of Trier, March 2006.
Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida,
and Mitsuru Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In International
conference on World Wide Web (WWW), pages 397-406, May 2006.
Selected Bibliography

Web-based methods














Jamie P. Callan, Margie E. Connell, and Aiqun Du. Automatic discovery of language models for text databases. In ACM SIGMOD
International Conference on Management of Data, pages 479–490, June 1999.
Jamie P. Callan and Margie E. Connell. Query-based sampling of text databases. ACM Transactions on Information Systems
(TOIS), 19(2):97–130, April 2001.
Panagiotis G. Ipeirotis and Luis Gravano. Distributed search over the hidden-web: Hierarchical database sampling and selection. In
International Conference on Very Large Databases (VLDB), pages 394–405, August 2002.
Luis Gravano, Panagiotis G. Ipeirotis, and Mehran Sahami. QProber: A system for automatic classification of hidden-web
databases. ACM Transactions on Information Systems (TOIS), 21(1):1–41, January 2003.
Aron Culotta, Ron Bekkerman, and Andrew McCallum. Extracting social networks and contact information from email and the web.
In Conference on Email and Anti-Spam (CEAS), July 2004.
Philipp Cimiano, Siegfried Handschuh, and Steffen Staab. Towards the self-annotating web. In International conference on World
Wide Web (WWW), pages 462–471, May 2004.
Philipp Cimiano, Günter Ladwig, and Steffen Staab. Gimme the context: Context-driven automatic semantic annotation with CPANKOW. In International conference on World Wide Web (WWW), pages 332–341, May 2005.
Yutaka Matsuo, Junichiro Mori, Masahiro Hamasaki, Keisuke Ishida, Takuichi Nishimura, Hideaki Takeda, Kôiti Hasida, and Mitsuru
Ishizuka. POLYPHONET: an advanced social network extraction system from the web. In International conference on World Wide
Web (WWW), pages 397-406, May 2006.
Yee Fan Tan, Min-Yen Kan, and Dongwon Lee. Search engine driven author disambiguation. In ACM/IEEE Joint Conference on
Digital Libraries (JCDL), June 2006.
Ergin Elmacioglu, Min-Yen Kan, Dongwon Lee, and Yi Zhang. Googled name linkage. 2007.
Yee Fan Tan, Ergin Elmacioglu, Min-Yen Kan, and Dongwon Lee. Record Linkage of Short Forms to Long Forms: A Case Study of
Publication Venues. 2007.
Min-Yen Kan. Web page classification without the web page. In International conference on World Wide Web (WWW), pages 262–
263, May 2004.
Min-Yen Kan and Hoang Oanh Nguyen Thi. Fast webpage classification using url features. In International Conference on
Information and Knowledge Management (CIKM), pages 325–326, October/November 2005.
Panagiotis G. Ipeirotis, Eugene Agichtein, Pranay Jain, and Luis Gravano. To search or to crawl? Towards a query optimizer for
text-centric tasks. In ACM SIGMOD International Conference on Management of Data, pages 265–276, June 2006.

Record Linkage Survey

Transcript Record Linkage Survey

Directory