Author linkage
Download
Report
Transcript Author linkage
Author linkage
Vetle I. Torvik
PubMed/MEDLINE is topic-driven
• Articles in MEDLINE are assigned
medical subject headings (MeSH)
• PubMed converts a free text into a query that
utilizes MeSH
• One can search for an authors’ last name and initials,
restricted by title words, MeSH, 1st author's
affiliation, etc.
CANNOT restrict by corresponding author’s affiliation,
full first names, or sole, last or first author
NOT sufficient when searching for papers by a particular
individual
Until 2002, MEDLINE author names are encoded by (last name, initials)
– 1200+ articles with the name JA Smith
– hard to find papers by an author with a common name
106
Number of Author Names
105
104
103
102
101
100
100
2
3 4 5 67
101
2
3 4 5 67
102
2
Number of Articles
3 4 5 67
103
2
3 4 5 67
104
If we knew who published what,
we could...
Study the structure of the scientific
enterprise (e.g. collaboration graphs)
Improve citation analysis
Link authors across disciplines A and C
– to find a collaborator in another field, with
expertise (A) complementary to yours (C)
– to make a list of invitees for a
cross-disciplinary meeting/workshop
ETC.
Author-ity: A probabilistic model for
author name disambiguation
Does a pair of author names (sharing last name, first initial),
on two different MEDLINE articles, refer to the same
individual?
Automatically generate two large reference sets of pairs of
matching and non-matching papers, unbiasedly representing
MEDLINE as a whole
Capture multiple aspects of similarity between a pair of articles
(title, jrnl, co-authors, MeSH, lang, affl, mid initial, suffix)
x=( 2 0
0
1
1 1
2
0 )
Pr{x|Match}/Pr{x|Non-match} = 22.6
C. Friedman overall probability of match, Pr{Match} = 0.021
Bayes theorem says, Pr{Match|x} = 0.32
AUTHOR-ITY
INPUT:
1) a last name and
first initial
2) click on a
Medline article
OUTPUT:
All articles with
that name ranked
by decreasing
match probability
Turns out that...
Even though matching papers tend to have much
more in common than non-matching ones,
Almost 40% of all matching papers have nothing
in common other than last name, initials and
language, partly because
– only 40% of MEDLINE records have affiliations
(mostly older ones)
– middle initial is often omitted
That is,
– The pairwise model is not sufficient
– MEDLINE information alone is not sufficient
Can we partition all of MEDLINE
by author-individuals?
Using clustering algorithms
Using supplemental information
– from publishers (EBSCO, OVID, Elsevier, ...)
• full author names
• affiliations for all authors
– on the web
• automatically recognizing scientists home pages
• extraction information into database form
(e.g. list of publications)
Improving accuracy by using
clustering algorithms
To create clusters of papers by individuals
Takes into account higher order of
interactions between papers
– even though the pair of papers (P2, P3) have a low match
probability, due to paper P1, they are likely to refer to the same
individual
P1
0.9
P2
0.9
0.2
P3
Improving accuracy with supplemental information
from publishers web sites and scientists’ home pages
Supplemental information can be
automatically extracted from the internet
Original articles most often encode
– full first names
– all affiliations for all authors (by superscripts)
Scientists’ home pages often include
– their affiliation
– a list of their publications
Author linkage: the story of how
Professor Cohen found Professor Gould
Professor Cohen, a vascular surgeon, has had number of
patients who presented with aortic aneurysm and retinal
detachment
He searches the literature and finds some articles describing
similar cases but nothing directly explaining the connection
between the two symptoms
He then performs an Arrowsmith search and finds many
potential connections among the B-terms like Marfan’s
syndrome, Ehler-Danlo’s syndrome, and amyloidosis
He wants to find an expert in retinal detachment who would
be interested in studying these potential connections with a
vascular surgeon
Who would be a good candidate
collaborator to study these connections?
Professor Cohen then
– defines the A-literature by narrowing down the retinal detachment
literature to include some of the interesting B-terms
– defines the C-literature by aortic aneurysm
– performs an Arrowsmith author-mode search and finds that
Professor Gould
• has published a number of articles on retinal detachment in relation to
several of the interesting B-terms (but not to aortic aneurysm)
• and has co-authored papers with Dr. Williams who has separately
published articles on aortic aneurysms
Turns out that Arrowsmith has a link to Professor Gould’s
home page with contact information and everything.
Professor Cohen then picks up the phone...
Four degrees of B-authors
0th - with papers in the direct A C literature
1st - with papers in both A and C, but not in the direct A
C literature
2nd - with papers in either A or C, but not both and have co-
authored with individuals who have papers in the other
literature (Professor Gould)
3rd - no papers in either A or C, but have co-authored
papers with an A-author and a C-author
What type of research are
the B-authors conducting?
0th - a research project that crosses A and C?
1st - somewhat disparate research in each of A and C?
2nd - research in A or C, and collaborating with
individuals working in the other discipline?
3rd - research in a collaborative discipline (e.g.,
bioethics, statistics, or bioinformatics)?
Who would want to identify
B-authors, and why?
scientists looking for information related to A and
C but is not in the public domain (e.g., raw data,
failed experiments, personal research notes)
scientists looking for collaborators that are
specialists in a different discipline
administrators (e.g., program directors for funding
agencies) or meeting organizers looking for
individuals that may facilitate research
collaborations across two disciplines
Etc.
Are you looking to branch into
another discipline?
Who ya gonna call?