Agenda - Elsevier

Download Report

Transcript Agenda - Elsevier

1
Developing a Web
Integrated Database
~Data mining, data quality
and performance evaluation~
國立中山大學 (National Sun Yat-Sen University)
Kaohsiung, December 2006
_______________________________________
國立中山大學 (National Sun Yat-Sen University)
Kaohsiung, December 2006
Y. Adachi (足立泰)
Regional Manager, Asia Pacific
Elsevier
2
Agenda
 Developing a web integrated citation database





User Centered Design
Why a publisher?
Searching – four domains of information space
Integrating our all-science search engine
Ensuring quality of citations
 Technology behind citation count and citation matching
 Data flow and processing
 References matching
 Evaluating scientific research output
 Why is evaluation so important?
 Case study – evaluating an author
Define: data mining
Intro
Developing a web
integrated database
5
Researching Research
Research carried out on behalf of Elsevier by
Redesign Research at the University of Toronto, department of Pharmacology & Pharmaceutical Sciences
6
User Centered Design Approach…
Focus on what they do,
Not on what they say…
Users talk “Aloud”
7
Starting from the users’ needs
If we understand
the researcher
workflow we can
design better
products
Understand - users, their tasks, and their work environments
Design - user interfaces that enable users to achieve their goals efficiently
Evaluate - product designs with users throughout the product lifecycle
Why a publisher?
9
• Hundreds of new editors per year
• 10-20 new journals per year
• Article submissions:
500,000+
Organize editorial boards
Launch new specialist journals
Solicit and manage
submissions
• 1,800+ journals
• 7.5 million
articles
• 20 million
researchers
• 6,000+ institutions
• 180+ countries
• 240 million+
downloads per year
• 2.5 million print
pages per year
Archive and
promote
• 200,000 referees
• 1 million referee
reports per year
Manage
peer review
Publish and
disseminate
Edit and
prepare
Production
• 40-90% of
articles rejected
• 7,000 editors
• 70,000 editorial board
members
• 6.5 million
author/publisher
communications per
year
• 250,000 new articles produced per year
• 180 years of back issues scanned, processed and
data-tagged
10
Technologies that drive the process
Organize editorial boards
Launch new specialist journals
Solicit and manage
submissions
Archive and
promote
Manage
peer review
Publish and
disseminate
Electronic Warehouse
Edit and
prepare
Production
Production Tracking System
eJournal Backfiles
eReference Works
11
?X!
How do users cope with
this complex environment?
12
Searching the four domains
Websites
and digital
archives
Peer
reviewed
literature
Science
Medicine
Technology
Social sciences
Patents
Institutional
repositories
13
Increased use of web documents
Increase in number of
Web citations
 2.4% of references in all
Biomedical journals published
between Aug ’97 and April ‘05
are pure URL citations*
 % of articles in Oncology
journals which include one or
more web citations increased
from 9% in 2001; to 11% in
2002; to 16% in 2003**
Type of Web Content cited in
Scopus abstracts
Other (cited more
than 10x)
31%
Standards
11%
Theses
29%
Patents
29%
*Webcitations archived with WebCite: going, going, still there, Gunter Eysenbach
**Internet Citations in Oncology Journals: A Vanishing Resource?, Eric.J.Hester, Journal of the National Cancer Institute, Vol. 96, No.12
14
Web search engine - Scirus
Repositories
NDLTD, DiVA
.edu
.ac.uk
.org
Pre-print servers
Scientific
Web pages
Proprietary
Content
200M+
50M+
.gov
.com
others
only when relevant to
research
ArXiv, CogPrints, Pre print servers
Other publishers
AIP, BioMedCentral
Societies
Siam
Elsevier
Patents
ScienceDirect
JPO, USPTO
Searching
Functionality
Author, journal
Ranking
Optimised for science content
“Best Directory or Search Engine”
WebAward – Won 3 consecutive years
from the Web Marketing Association.
Classification
document and subject
Pinpointing Results: The Inverted Pyramid
Seed List Creation
& Maintenance
Database Load
Focused Crawling
OAI Harvesting
Classification
Scirus Index
Query
Ranking
Results
15
16
Seed list creation and maintenance
Automatic URL extractor tool identifies new scientific
seeds
 link analysis of the most popular sites in specific subject
areas
Elsevier publishing units supply a list of sites in their
subject area
Scientific, Library and Technical Advisory Boards
provide input
Webmasters and Scirus users regularly submit
suggestions for new sites
Easily identifiable URLs are added on a regular basis
 Example: www.newscientist.com
17
Focused crawling
The Scirus robot crawls the Web to find new documents and
update existing documents.
A scheduler coordinates the crawl. The job of the scheduler
is to prioritize documents for crawling, track the rules set by
webmasters in their robots.txt files and limit the number of
requests a robot sends to a server.
Independent machine nodes crawl the Web. They work in
tandem and share link and meta information.
The robot collects documents and sends them to the Index.
A copy of the page is stored so that Scirus can show the
portion of the document that actually contains the search
query term.
18
Results ranking – terms and links
Term frequency




Is the term in the title?
Is the term in the text in a link?
Where is the term located in the text (top, bottom)?
How many times is the term used?
Link analysis
 The number of links to a page is analyzed
 Importance of a page is determined by calculating the
number of links to a page
 Scirus analyses the anchor text – the text of a link or
hyperlink – to determine the relevance of a site
19
Do a search on ‘nanotube’ using GoogleTM
What do you do with 2,170,000 results?
20
‘nanotube’ search on Google ScholarTM
What do you do with 77,900 results?
Search ‘nanotube’ using an integrated database
Websites
and digital
archives
Peer
reviewed
literature
Science
Medicine
Technology
Social sciences
Patents
Institutional
repositories
21
Results overview from peer reviewed literature
22
Results overview from selected web sources
23
Results overview from patent offices
24
Results overview from NSYSU selected sources
NSYSU + NDLTD
NSYSU - eThesys
Electronic Theses Harvestable
and Extensible System
25
26
Limit documents to NSYSU
NSYSU Only
Link to eThesys
27
Web citations
Linking back to all
four domains from the
Abstract page
Websites
and digital
archives
Peer
reviewed
literature
Science
Medicine
Technology
Social sciences
Patents
Institutional
repositories
Web search on article
title, author name,
keywords
Ensuring the quality of citations
Technology behind citation count
and citation matching
29
Our database figures
15,670 titles
 13,500+ academic journals
 750+ conference proceedings
 600+ trade publications
28 million abstracts
250 Million references
How do we
maintain
quality?
30
Why is accurate citation so important?
Citation navigation
 Accurate forward citation (cited by) and
backward citation links (reference)
Citation count
 Accuracy in the number of times an article is
cited
The accuracy of the references determines the
quality of a citation database
31
Flow of bibliographic data
40%
60%
Receipt of FTP’d e-issue
from Publisher
Receipt of Printed issue
from Publisher
Registration
Content Indexing with
Controlled Vocabularies.
Enriching records
Scanning, OCR (text
reader) or retyping
OPSbank: Elsevier’s
content repository for
A&I. Quality Checks
Database Warehouse:
Citation Matching
FAST: Search Engine
Indexing
Dayton Server:
Ultimate storage
Content Indexing with
Controlled Vocabularies.
Enriching records
32
We have a data processing function
specifically for citation matching
OPSBANK
Database
Warehouse
Database server
Data is first exported from OPSbank to the Database Warehouse, where the
data is processed (de-duplication, citation matching) and then exported to
our database server.
33
Matching references
new
Reference
data
If a match is found, the
item is added to the
cluster
If no match is found, a new
cluster is created
34
Matching examples
the system overcomes a missing volume number and uses title to confirm the match
REF: Aracil R. et al., "Multirate sampling technique in digital control
systems simulation." IEEE Trans. Systems Man Cybernet., p. 776, 1984.
ITEM: Aracil R. et al., "MULTIRATE SAMPLING TECHNIQUE IN DIGITAL CONTROL
SYSTEMS SIMULATION." IEEE Trans Syst Man Cybern, v. SMC-14, p. 776, 1984.
there are page, author, article title and journal discrepancies, but still a match is found
REF: Keller-Wood M.K., Stenstrom B., Shinsako J. et al., "Interaction
between CRF and AngII in control ACTH and adrenal steroids."
Am. J. Physiol., v. 250, pp. 306-402, 1986.
ITEM: Keller-Wood, M., Kimura B., Shinsako J. et al., "Interaction between
CRF and angiotensin II in control of ACTH and adrenal steroids."
American Journal of Physiology - Regulatory Integrative and Comparative
Physiology, v. 250, pp. 19/3, 1986.
35
Linking references to records (items)
Volume/Issue number tagging, journal abbrev, author initial:
 ref: R. Oliver, "The spots and stains of plate tectonics"
Earth Sci. Rev. v. 2, p. 77-106, 1992
 item: J. Oliver, "The spots and stains of plate tectonics"
Earth-Science Reviews. v. 32, n. 1-2, p. 77-106, 1992
Author typo, incomplete page info:
 ref: X. Malague, "Pipe inspection by infrared thermography"
Mater Eval. v. 57, n. 9, p. 899-902, 1999.
 item: Xavier Maldague, "Pipe inspection by infrared
thermography" Materials Evaluation. (Mater Eval) v. 57, n. 9,
(6 pp), 1999.
Reference linking results:
Over 95% of possible links were found
Over 99.9% of links are correct
36
Bridging clusters
These two original clusters/dummy items couldn't be merged previously:
Original dummy item for a cluster with 4 refs:
Naragan R. et al., In: Supercritical Fluid Science and Technology,
pp. 226-241, 1989. (Joohnston, K. P., Penninger, J. M. L., Eds.;
American Chemical Society: Washington, DC)
Original dummy item for a cluster with 6 refs:
Narayan R. et al., "Kinetic elucidation of the acid-catalyzed mechanism
of 1-propanol dehydration in supercritical water." In: ACS Symposium
Series, v. 406, pp. 226-241, 1989 (Johnston, K. P., Penninger, J. M. L.,
Eds.; American Chemical Society: Washington, DC)
The following new reference came in with both article title and book title, which is sufficient to bridge the
two clusters (despite the omitted word in the book title):
Narayan R. et al., "A Kinetic Elucidation of the Acid-Catalyzed
Mechanism of 1-Propanol Dehydration in Supercritical Water." In:
Supercritical Science and Technology, pp. 226-241, 1989. (Johnston,
K. P., Penninger, J. M. L., Eds.; ACS Symposium Series 406; American
Chemical Society: Washington, DC)
37
Dummy records
A cluster may not match with a record in the database. These are called
'dummy records.‘ It contains all the information of the cluster taken from
the references. In our database, you will see them as:
dummy record
(no link to an abstract)
real record
(link to an abstract)
A dummy record also
have “cited times”
38
As a result you will see…
More accurate references
More citations
 references that seem different (eg typo, missing
volume/issue/page), cites the same item
39
Highly cited records
71846 citations:
Laemmli U.K., "Cleavage of structural proteins during the assembly
of the head of bacteriophage T4." Nature, v. 227, pp. 680-685, 1970.
61429 citations:
Bradford M.M., "A rapid and sensitive method for the quantitation of
microgram quantities of protein utilizing the principle of protein
dye binding." Analytical Biochemistry, v. 72, pp. 248-254, 1976.
37823 citations:
Chomczynski P., Sacchi N., "Single-step method of RNA isolation by
acid guanidinium thiocyanate-phenol-chloroform extraction."
Analytical Biochemistry, v. 162, pp. 156-159, 1987.
40
Highly cited dummy records
75452 citations:
Sambrook J. et al., "Molecular Cloning: A Laboratory Manual."
Cold Spring Harbor Laboratory Press, Cold Spring Harbor, NY, 1989.
39227 citations:
Lowry O.H. et al., "Protein measurement with the Folin phenol reagent."
Journal of Biological Chemistry, v. 193, pp. 265-275, 1951.
37571 citations:
"SAS/STAT User's Guide." SAS Institute, Cary, NC, 1989
37405 citations:
"Diagnostic and Statistical Manual of Mental Disorders." Washington, DC:
American Psychiatric Association, 1994.
35659 citations:
Sheldrick, G.M., "SHELXL-97 crystal structure refinement program"
University of Göttingen, Germany, 1997.
Conclusions: data mining
Data mining tools are in the
hands of the end users
Intro
Technology is enabling
researchers to conduct
complicating tasks in
a matter of a few clicks
General search engines
are not the answer
Evaluating scientific research output
Why is evaluation so important?
Case study – evaluating an author
43
Why do we evaluate scientific output
Government
Funding Agencies
Institutions
Faculties
Libraries
Researchers
Funding allocations
Grant Allocations
Policy Decisions
Benchmarking
Promotion
Collection management
44
Criteria for effective evaluation
Objective
Quantitative
Relevant variables
Independent variables (avoid bias)
Globally comparative
45
Data requirements for evaluation
Broad title coverage
Affiliation names
Author names
Including co-authors
References
Subject categories
ISSN (e and print)
Article length (page numbers)
Publication year
Language
Keywords
Article type
Etcetera …
Citation counts
Article counts
Usage counts
There are limitations that
complicate author evaluation
46
Data limitations
Author disambiguation
Normalising Affiliations
Subject allocations may vary
Matching authors to affiliations
Deduplication/grouping
Etcetera
Finding/matching all relevant information
to evaluate authors is difficult
47
The Challenge: finding an author
How to distinguish results between those
belonging to one author and those belonging to
other authors who share the same name?
How to be confident that your search has captured
all results for an author when their name is
recorded in different ways?
How to be sure that names with unusual
characters such as accents have been included –
including all variants?
48
The Solution: Author Disambiguation
We have approached solving these problems by using
the data available in the publication records such as






Author Names
Affiliation
Co-authors
Self citations
Source title
Subject area
… and used this data to group articles that belong
to a specific author
Case Study 1:
An approach to Author Searching
50
Step 1: Searching for an author
Professor Chua-Chin Wang
National Sun Yat-sen University
組別: 系統晶片組
學術專長: 積體電路設計、通信界面電路設計、類神經網路
實驗室名稱:VLSI設計實驗室
研究室分機: 4144
Enter name in Author Search box
51
Step 2: Select Professor Wang
Available information
Which author are you looking for?
52
Step 3: Details of Professor Wang
Unique Author ID & matched documents
53
No 100% recall…
The same author with different author ID’s
54
Why were these not matched?
Quality above all:
 Precision (>99%) was given priority over recall
(>95%)
Not enough information to match with enough
certainty
As there are many millions of authors, there
will be unmatched papers and authors
55
Precision and Recall
All items in
the database
Retrieved
items
A
Recall =
A∩B
D
Relevant
items
B
| A∩B |
|B|
Precision =
| A∩B |
|A|
56
Precision and Recall
All items in
the database
Retrieved
items
D
A
A∩B
Relevant
items
B
95%
Recall =
| A∩B |
|B|
99%
Precision =
| A∩B |
|A|
57
Recall and precision are inversely related
As recall
Conversely:
As recall
precision
precision
Recall
100%
Precision
100%
58
Solution: Author Feedback
59
Feedback loop includes check
by dedicated team to insure accuracy
Dedicated team
investigating
feedback requests to
guarantee quality
60
… we have matched the author to documents – now what?
Instant citation overview for an author
Evaluation Data
61
Step 4: The citation overview
Excluding self citations
62
Export to excel for further analysis
Case Study 2:
Analyzing Dr. Liang’s research…
The H-index
An alternative approach to evaluation
65
Issues around using single-number criteria
Total number of papers
 Advantage: Measures productivity
 Disadvantage: Does not measure importance or impact of
papers.
Total number of citations
 Advantage: measures total impact.
 Disadvantage: hard to find and may be inflated by a small
number of "big hits," which may not be representative of the
individual if he or she is a coauthor with many others on
those papers.
Citations per paper
 Advantage: allows comparison of scientists of different ages
 Disadvantage: hard to find, rewards low productivity, and
penalizes high productivity
66
The Hirsch Index
The h index measures the broad impact of an individual's work, avoids
all of the disadvantages of the aforementioned criteria.
Prof. JE Hirsch in August 2005
A scientist has index 5
h if h5 of his or her N11
p
papers have at least 5
h citations each and the other
( N11
p – h5 ) have at least ≤ h
5 citations each
Example calculation:
Rank
1 2 3 4
Citations 18 16 15 11
5
7
6
4
7
3
8
3
9
1
10 11
0 0
67
Step 1: The Author details
Professor Ting Peng Liang
National Sun Yat-sen University, Dept. of Information Management
68
Step 2: Rank articles by cited-by count
69
Step 3: Scroll to H-index
“Dr. Liang has index 10 as 10
of the 30 papers have at least
10 citations each, and the
remaining papers have no
more than 10 citations each”
70
Citation links to patents and websites
PatentCites are citations of articles that appear
in official patent documents
 US Patent Office
 European Patent Office
 World Intellectual Property Organisation
WebCites are citations of articles that appear in
selected web sources
 Theses and Dissertations (NDLTD)
 Courseware (MIT)
 Insititutional Repositories (DiVA, U of Toronto, Caltech….)
71
Information on Dr. Tin Peng Liang
Published 30 papers matched to ID
288 Citations received by 272 documents
H-Index = 10
Matched to 30 Co-Authors
Cited 1 times in patents
27 credible web citations – in high quality
sources
72
Evaluating authors is at root
of any evaluation of scientific output
Government
Funding Agencies
Institutions
Faculties
Libraries
Researchers
Starting with authors
we can aggregate
Aggregated data
= field of bibliometrics
Next step
Thank you !
Questions and answers