Author linkage

Download Report

Transcript Author linkage

Author linkage
Vetle I. Torvik
PubMed/MEDLINE is topic-driven
• Articles in MEDLINE are assigned
medical subject headings (MeSH)
• PubMed converts a free text into a query that
utilizes MeSH
• One can search for an authors’ last name and initials,
restricted by title words, MeSH, 1st author's
affiliation, etc.
 CANNOT restrict by corresponding author’s affiliation,
full first names, or sole, last or first author
 NOT sufficient when searching for papers by a particular
individual
Until 2002, MEDLINE author names are encoded by (last name, initials)
– 1200+ articles with the name JA Smith
– hard to find papers by an author with a common name
106
Number of Author Names
105
104
103
102
101
100
100
2
3 4 5 67
101
2
3 4 5 67
102
2
Number of Articles
3 4 5 67
103
2
3 4 5 67
104
If we knew who published what,
we could...
 Study the structure of the scientific
enterprise (e.g. collaboration graphs)
 Improve citation analysis
 Link authors across disciplines A and C
– to find a collaborator in another field, with
expertise (A) complementary to yours (C)
– to make a list of invitees for a
cross-disciplinary meeting/workshop
 ETC.
Author-ity: A probabilistic model for
author name disambiguation
 Does a pair of author names (sharing last name, first initial),
on two different MEDLINE articles, refer to the same
individual?
 Automatically generate two large reference sets of pairs of
matching and non-matching papers, unbiasedly representing
MEDLINE as a whole
 Capture multiple aspects of similarity between a pair of articles
(title, jrnl, co-authors, MeSH, lang, affl, mid initial, suffix)
x=( 2 0
0
1
1 1
2
0 )
Pr{x|Match}/Pr{x|Non-match} = 22.6
C. Friedman overall probability of match, Pr{Match} = 0.021
Bayes theorem says, Pr{Match|x} = 0.32
AUTHOR-ITY
INPUT:
1) a last name and
first initial
2) click on a
Medline article
OUTPUT:
All articles with
that name ranked
by decreasing
match probability
Turns out that...
 Even though matching papers tend to have much
more in common than non-matching ones,
 Almost 40% of all matching papers have nothing
in common other than last name, initials and
language, partly because
– only 40% of MEDLINE records have affiliations
(mostly older ones)
– middle initial is often omitted
 That is,
– The pairwise model is not sufficient
– MEDLINE information alone is not sufficient
Can we partition all of MEDLINE
by author-individuals?
 Using clustering algorithms
 Using supplemental information
– from publishers (EBSCO, OVID, Elsevier, ...)
• full author names
• affiliations for all authors
– on the web
• automatically recognizing scientists home pages
• extraction information into database form
(e.g. list of publications)
Improving accuracy by using
clustering algorithms
 To create clusters of papers by individuals
 Takes into account higher order of
interactions between papers
– even though the pair of papers (P2, P3) have a low match
probability, due to paper P1, they are likely to refer to the same
individual
P1
0.9
P2
0.9
0.2
P3
Improving accuracy with supplemental information
from publishers web sites and scientists’ home pages
 Supplemental information can be
automatically extracted from the internet
 Original articles most often encode
– full first names
– all affiliations for all authors (by superscripts)
 Scientists’ home pages often include
– their affiliation
– a list of their publications
Author linkage: the story of how
Professor Cohen found Professor Gould
 Professor Cohen, a vascular surgeon, has had number of
patients who presented with aortic aneurysm and retinal
detachment
 He searches the literature and finds some articles describing
similar cases but nothing directly explaining the connection
between the two symptoms
 He then performs an Arrowsmith search and finds many
potential connections among the B-terms like Marfan’s
syndrome, Ehler-Danlo’s syndrome, and amyloidosis
 He wants to find an expert in retinal detachment who would
be interested in studying these potential connections with a
vascular surgeon
Who would be a good candidate
collaborator to study these connections?
 Professor Cohen then
– defines the A-literature by narrowing down the retinal detachment
literature to include some of the interesting B-terms
– defines the C-literature by aortic aneurysm
– performs an Arrowsmith author-mode search and finds that
Professor Gould
• has published a number of articles on retinal detachment in relation to
several of the interesting B-terms (but not to aortic aneurysm)
• and has co-authored papers with Dr. Williams who has separately
published articles on aortic aneurysms
 Turns out that Arrowsmith has a link to Professor Gould’s
home page with contact information and everything.
 Professor Cohen then picks up the phone...
Four degrees of B-authors
 0th - with papers in the direct A  C literature
 1st - with papers in both A and C, but not in the direct A 
C literature
 2nd - with papers in either A or C, but not both and have co-
authored with individuals who have papers in the other
literature (Professor Gould)
 3rd - no papers in either A or C, but have co-authored
papers with an A-author and a C-author
What type of research are
the B-authors conducting?
 0th - a research project that crosses A and C?
 1st - somewhat disparate research in each of A and C?
 2nd - research in A or C, and collaborating with
individuals working in the other discipline?
 3rd - research in a collaborative discipline (e.g.,
bioethics, statistics, or bioinformatics)?
Who would want to identify
B-authors, and why?
 scientists looking for information related to A and
C but is not in the public domain (e.g., raw data,
failed experiments, personal research notes)
 scientists looking for collaborators that are
specialists in a different discipline
 administrators (e.g., program directors for funding
agencies) or meeting organizers looking for
individuals that may facilitate research
collaborations across two disciplines
 Etc.
Are you looking to branch into
another discipline?
Who ya gonna call?