Citation duplicates

Download Report

Transcript Citation duplicates

Citation matching
Vu Tinh Ky
Citation duplicates
Different representations may appear in
Name of authors
Title of the publication
Year of publication
Conference name
Different representations maybe due to
Typo errors
Word misspelling
Individual preferences on format of the citation
Motivation of resolving citation duplicates
Enhance the performance of information retrieval
Accurate credit attributions to authors of publications
Example of a citation duplicate
Citation 1 : Hang Cui, Min-Yen Kan and Tat-Seng Chua (2004) Unsupervised Learning of Soft Patterns for Generating
Definitions from Online News. In Proceedings of the 13th International World Wide Web Conference (WWW2004), May
2004.New York, New York, USA.
Citation 2 : Cui Hang, Min-Yen Kan and Tat-Seng Chua. Unsupervised learning of soft patterns for generating definitions
from online news. 13th Int’l World Wide Web conference (WWW'04). May. New York, USA. 90-99.
Approach
Focus on citation de-duplicate on web pages
Three-step clustering approach : 1st two steps are course-grain clustering, the last step in the fine-grain de-duplication.
Step 1. soft-cluster the sets of citations into blocks (all citations in one webpage is considered one set)
Use co-author names as the clustering criteria
All the authors who happen to co-author a publication is put into one block.
Step 2. In each block, cluster the individual citations into sub-blocks
Use the content of the webpage as the clustering criteria
Compute similarity using cosine similarity between the text of the web pages.
Step 3. Fine-grain de-duplicate within each sub-block
Use the tuned string-edit distance as the ambiguous resolution criteria