Transcript Document
Disambiguation Problems
in Digital Libraries
Tan Yee Fan
2006 August 11
WING Group Meeting
Introduction
Bibliographic digital libraries
DBLP, Citeseer, ACM Portal, …
Metadata records
Authors, title, venue, year, …
Inconsistencies and errors
Typographical errors
Abbreviation
Different entities sharing same name
…
Problem formulation
General disambiguation problem
Given a list of data items X
Find a function δ : X × X → {0, 1} such that
Matching relation is not necessarily transitive
δ(x1, x2) = 1 if x1 and x2 matches
δ(x1, x2) = 0 otherwise
δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,
but δ(“ab”, “cd”) = 0
If transitive, it is clustering/classification
Related fields
String similarity
Abbreviation matching
Mostly deals with biomedical texts and in
predefined formats
Data cleaning
Edit distance, Jaro-Winkler, …
High level architectures by database people
Social network analysis
Collaboration graphs of authors
Citation matching, author name
disambiguation
Can be cast as classification/clustering
Usual information source
Models
Coauthor information, titles and venues
i.e. within the records themselves (internal)
Naïve Bayes, K-means, SVM, vector space
model, graphical models, …
Some apply methods to reduce number of
comparisons required
Resources
Internal resources
External resources
May contain insufficient information
Information may be difficult to extract
Web resources, ontologies
Contains additional freely available information
Objective
Combine internal and external resources
Mixed citation problem
Given an ambiguous name X (belonging to k
different authors)
Given a list of citations C containing X
Which citations in C belong to which author?
Yoojin Hong, Byung-Won On and Dongwon Lee. System
Support for Name Authority Control Problem in
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Sudha Ram, Jinsoo Park and Dongwon Lee. Digital
Libraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
Search engine results
For each citation c in C
Query search engine with title of c to obtain
relevant URLs
Represent c by a feature vector of relevant URLs
Each URL weighted by its inverse host frequency
Cosine similarity between feature vectors
Perform clustering on C to derive k clusters
External coauthor network
Coauthor network from DBLP metadata
Connected if they are
coauthors in some
DBLP citation
Each node
represents a name
Delete the node representing X and its edges
Similarity between two author names
computed as an inverse of their distance
Similarity between two citations is pairwise
sum of their author similarities
Results
0.86
0.850
0.85
0.844
0.85
0.84
0.836
0.84
0.83
0.83
IHF (IP address, single link)
Coauthor linkage (complete link)
Combined (hybrid)
Venue name disambiguation
To determine e.g. “TREC” = “Text Retrieval
Conference”
Problems
Not using other parts of the citation records
Abbreviations are extremely common
Venues change name over time
Experiments using Google in progress
Using URL features
Using Google snippets