Transcript Document

Disambiguation Problems
in Digital Libraries
Tan Yee Fan
2006 August 11
WING Group Meeting
Introduction

Bibliographic digital libraries


DBLP, Citeseer, ACM Portal, …
Metadata records


Authors, title, venue, year, …
Inconsistencies and errors




Typographical errors
Abbreviation
Different entities sharing same name
…
Problem formulation

General disambiguation problem


Given a list of data items X
Find a function δ : X × X → {0, 1} such that



Matching relation is not necessarily transitive


δ(x1, x2) = 1 if x1 and x2 matches
δ(x1, x2) = 0 otherwise
δ(“ab”, “bc”) = 1 and δ(“bc”, “cd”) = 1,
but δ(“ab”, “cd”) = 0
If transitive, it is clustering/classification
Related fields

String similarity


Abbreviation matching


Mostly deals with biomedical texts and in
predefined formats
Data cleaning


Edit distance, Jaro-Winkler, …
High level architectures by database people
Social network analysis

Collaboration graphs of authors
Citation matching, author name
disambiguation


Can be cast as classification/clustering
Usual information source



Models


Coauthor information, titles and venues
i.e. within the records themselves (internal)
Naïve Bayes, K-means, SVM, vector space
model, graphical models, …
Some apply methods to reduce number of
comparisons required
Resources

Internal resources



External resources



May contain insufficient information
Information may be difficult to extract
Web resources, ontologies
Contains additional freely available information
Objective

Combine internal and external resources
Mixed citation problem



Given an ambiguous name X (belonging to k
different authors)
Given a list of citations C containing X
Which citations in C belong to which author?
Yoojin Hong, Byung-Won On and Dongwon Lee. System
Support for Name Authority Control Problem in
Digital Libraries: OpenDBLP Approach. ECDL 2004.
Sudha Ram, Jinsoo Park and Dongwon Lee. Digital
Libraries for the Next Millennium: Challenges and
Research Directions. Information Systems Frontiers 1999.
Search engine results

For each citation c in C


Query search engine with title of c to obtain
relevant URLs
Represent c by a feature vector of relevant URLs



Each URL weighted by its inverse host frequency
Cosine similarity between feature vectors
Perform clustering on C to derive k clusters
External coauthor network

Coauthor network from DBLP metadata
Connected if they are
coauthors in some
DBLP citation



Each node
represents a name
Delete the node representing X and its edges
Similarity between two author names
computed as an inverse of their distance
Similarity between two citations is pairwise
sum of their author similarities
Results
0.86
0.850
0.85
0.844
0.85
0.84
0.836
0.84
0.83
0.83
IHF (IP address, single link)
Coauthor linkage (complete link)
Combined (hybrid)
Venue name disambiguation

To determine e.g. “TREC” = “Text Retrieval
Conference”


Problems



Not using other parts of the citation records
Abbreviations are extremely common
Venues change name over time
Experiments using Google in progress


Using URL features
Using Google snippets