Author Name Disambiguation

Download Report

Transcript Author Name Disambiguation

Author Name Disambiguation for
Citations Using Topic and Web
Correlation
Prior work
• Supervised classification approaches:
Model all authors’ patterns from a set of training
data.
• Unsupervised Classification approaches:
Ambiguous citations are clustered into groups of
distinct authors by measuring the similarities
between the attributes in the citations.
Proposed Approach
• Topic Correlation
• Web Correlation
• Pair-Wise Grouping Algorithm
Topic Correlation
• Build a topic association network
1.利用Apriori算法构造有向图,权值为置信度
(结果为一个超图)。
2.利用k-way hypergraph partition算法,将超图
分解为一些簇。
3.这些簇叫做topic association network,研究
课题的相关强度是citations在这个网络中的
距离。
Web Correlation
• Use each title to query a search engine.
• Filter the URLs of several digital libraries.
• If two citations appear in the same URL, we
use them as an instance of Web correlation.
Pair-Wise Grouping Algorithm
• Generate pairs of citations by using similarity
metrics
• Use the training data to train a binary classifier
• Apply the classifier to determine whether the
pairs are matched
• Combine the predicted results to group the
citations into appropriate clusters.
• Filter out the pairs that would cause the clusters
sparse.
Pair-Wise Similarity Metrics
• similarity metrics for Coauthor, Title, and
Venue:
1.CSM
2.MSF
• Similarity metrics for topic correlation:
TSM
• Similarity metrics for web correlation:
MNDF
Binary Classifier
• A binary classifier is used to learn the
distribution of pair-wise vectors.
• The pairs predicted as matched are used to
build citation clusters ( constructing an
undirected graph).
Cluster Filter
• A threshold is set for choosing which bridges
should be removed.
• A bridge is removed if the numbers of vertices
in two separate, but connected, components
are above the given threshold.
Detecting Ambiguous Author Names
in Crowdsourced Scholarly Data
Prior Work
• Name disambiguation has been cast into the
problem of clustering a set of publications into
profiles such that each profile corresponds to
a single author.
Name Variations and Citations
• Extract the name variations from a collection
of publications
• Sort them by number of citations
• Look at the percentage of the total citations
that are attributed to the top name
variations.( A high percentage suggests that
the name is not ambiguous.)
Topic Consistency
• Leverage the discipline tags crowdsourced
from the users of the Scholarometer system
• Detect different but related disciplines
associated with an author name:
• Map an author’s publications to topics, and
measure the similarity between these topics.
• Derive an author’s topic profile
A brief survey of automatic methods
for author name disambiguation
Two problems
• Synonyms: the same author may appear
under distinct names
• Polysems: distinct authors may have similar
names.
Proposed taxonomy
Author Grouping Methods
• Defining a similarity function:
1.Using predefined functions: the Levenshtein distance,
Jaccard coefficient, cosine similarity, soft-TFIDF and others.
2.Learning a similarity function: Use the training data to
produce a similarity function S from R*R(R: the set of
references) to {0, 1}, where 1 means that the two
references do refer to the same author and 0 means that
they do not.
3.Exploiting graph-based similarity functions: Create a
coauthorship graph G=(V, E) for each ambiguous group. The
same coauthor names are represented by a vertex, and the
weight is related to the amount of articles coauthored by
the corresponding author names represented by the two
vertices.
Author Grouping Methods
• Clustering Techniques:
1.Partitioning
2.Hierarchical agglomerative clustering
3.density-based clustering
4.Spectral clustering
Author assignment methods
• Classification: Assign the references to their
authors using a supervised machine learning
technique.
• Clustering: Use probabilistic techniques to
determine the author in a iterative way to fit
the model.
Explored evidence
• Citation information: the attributes directly
extracted from the citations, such as
author/coauthor names, work title, publication
venue title, year, and so on.
• Web information: Data retrieved from the web
that is used as additional information about an
author publication profile.
• Implicit evidence: Evidence inferred from visible
elements of attributes, such as the latent topics
of a citation.
Summary of characteristics-Author
grouping methods
Summary of characteristics-Author
assignment methods
Open challenges
• Very little data in the citations
• Very ambiguous cases -- ambiguous references will
have coauthors who have also ambiguous names
(especially Asian names)
• Citations with errors
• Efficiency
• Different knowledge areas -- our focus is only about
computer science
• Incremental disambiguation
• Author profile changes
• New authors
pandasearch 重名问题研究计划
• 相关论文的阅读,找出最适合当前问题的解决
措施。
• 着重从implicit evidence和web information(特
别是学者个人主页和cv)入手。
• 从效率和准确度两个方向着手,着重准确度。
• 数据挖掘和机器学习基础知识的学习。
pandasearch 重名问题实现计划
• Type of approach: author grouping methods–
learning a similarity function.
• Explored evidence: citation information, web
information, implicit evidence.