Transcript Slide 1

Extracting and re-using
research data
from chemistry e-theses:
the SPECTRa-T project
Peter Morgan
SPECTRa-T Project Director
Head of Medical and Science Libraries
Cambridge University Library
[email protected]
www.lib.cam.ac.uk/spectra-t/
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
1
Outline
• Why SPECTRa-T?
• Getting started
• Mining the text
– PDFs
– .docx
• Workflows
• Further thoughts
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
2
Why SPECTRa-T?
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
3
“theses should be semantic and interactive”
- Peter Murray-Rust
(ETD 2007 keynote address)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
4
SPECTRa-T background
• SPECTRa-T = Submission, Preservation, & Exposure
of Chemistry Teaching and Research data from
Theses)
• SPECTRa-T funded by JISC Digital Repositories
Programme
• 1 year project (April 2007 – March 2008)
• partners:
– University of Cambridge (Chemistry + Library)
– Imperial College London (Chemistry + ICT)
• team had previously worked together on “SPECTRa”
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
5
Why SPECTRa-T?
• research chemists produce experimental data
(materials, reactions, properties = “recipes”)
• these data are the basis of further research
• theses are a rich source of data
–
–
–
–
c.10k chemistry papers p.a. worldwide
a typical thesis contains 50-60 preparations
20% will be published in research papers
80% are not published
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
6
Why SPECTRa-T?
• text-mining can retrieve these data
• two basic data types:
– Named Chemical Entities (NCEs) (e.g. words/phrases
describing properties, procedures, instruments, etc)
– Chemical Objects (COs) (e.g. molecules, spectra)
• our Semantic Web aim:
– extract both data types
– create RDF triples and chemical objects
– link them to enable semantic querying
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
7
RDF triples
• RDF triples are statements containing a
subject (resource), predicate (property), and
object (value)
• “water boils at 100 degrees Celsius”
• the value of one property can be used as the
resource for another
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
8
Getting started
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
9
Test material
• 100 PDF chemistry theses from CalTech, MIT,
St Andrews & Stirling
– some MIT theses OCR-derived (later removed from
analysis because of misassigned characters)
• 20 Word chemistry theses from Cambridge
(converted to Office Open XML .docx mark-up
format)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
10
Software
• OSCAR3 (Open Source Chemistry Analysis
Routines) as text-mining tool
– developed by SciBorg Project (Cambridge)
– natural language processing to identify chemical
terms
– converts human-readable text into XML marked-up
content that machines can manipulate
– prefers SciXML documents
– uses ChEBI Ontology for chemical name recognition
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
11
OSCAR3 parsing
Highlighted experimental procedures created by OSCAR3
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
12
Mining the text
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
13
PDF ...
• wraps text in simple high-level elements
• is optimized for human, not machine,
readability
• produces poor SciXML
–
–
–
–
–
line breaks = loss of continuous text and paragraph structures
chemical drawings replaced by text and disconnected lines
loss of subscript and superscript characters
non-printing characters
OCR-derived text produces erroneous character assignment
(e.g. i,l,1)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
14
PDF processing
• SPECTRa-T tools...
– removed line-breaks
– removed non-printing characters
– removed text fragments resulting from broken
drawings
– used UTF-8 Unicode to preserve Greek characters
(lost in ASCII)
• (note: PDF/A can avoid some but not all such problems)
• text then converted to SciXML
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
15
SciXML from PDF
• OSCAR retrieves Named Chemical Entities
• OSCAR creates SAFXML (Standoff Annotated
Format XML) output
• NCE metadata transformed by XSL
stylesheets into RDF triples
• RDF triplestore can be queried
BUT...
• OSCAR cannot identify Chemical Objects
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
16
.docx processing
• Word theses converted to Office Open XML (.docx)
using MS Word 2007
• XML is converted into rich SciXML
• SciXML structure enables OSCAR3 to identify
“Experimental” sections and extract Chemical Objects
• XML converted to CML (Chemical Markup Language)
• URIs assigned to CO metadata & associated with NCEs
• CML COs deposited in lightweight data repository
• RDF triplestore and CO data repository, linked
by URIs, can now be queried semantically
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
17
Workflows
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
18
PDF workflow
PDF flow
SPECTRa-T text
processing tools
THESIS
Input PDF
document
(text)
OSCAR3
XSL stylesheet
Query
SciXML
SAFXML
RDF
Triplestore
(NCEs)
Processing of PDF e-theses to yield
named chemical entities in a queryable RDF Triplestore
(Text and lines in red indicate SPECTRa-T tools)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
19
.docx workflow
DOCX flow
SPECTRa-T text
processing tools
THESIS
Input .docx
document
(XML markup)
OSCAR3
SciXML
XSL stylesheet
SAFXML
Semantic Query
RDF
Add URI link
Triplestore
(NCEs)
URI
OSCAR3
Data XML
Create URI
CML Chemical
Objects
Data Repository
(COs)
Processing of DOCX e-theses to yield
named chemical entities and linked chemical objects
in a semantically queryable linked RDF triplestore and data repository
(Text and lines in red indicate SPECTRa-T tools)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
20
Further thoughts
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
21
Caveats
•
•
•
•
SPECTRa-T a proof-of-concept approach
restricted to a few chemistry sub-disciplines
investigated only 2 file formats
dangerous to generalise too far
• but our specific observations raise questions
about broader implications ...
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
22
File formats
•
•
•
•
PDF has some value for text-mining
born-digital PDF is better than OCR-derived
PDF/A will resolve some problems
but both still contain broken text and
unreliable structure for text-mining
– (and most legacy material is still only in PDF)
• XML better at providing structured documents
for text-mining
– (and may be good for preservation as well)
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
23
Role of institutional repository
• preservation versus re-usability?
• should a central IR require both PDF and Word/XML
versions of a thesis?
• which file format(s) should be openly accessible?
– cf. UKPMC XML policy for research papers
• should subject data be held in subject-specific data
repositories managed by domain experts?
• can subject-based departmental repositories co-exist
with a central IR?
• how can librarians and repository managers understand
researchers’ needs?
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
24
IPR
• institutions can best realise the value of their
research data assets by encouraging their
discovery
• facts cannot be copyrighted
• derived data and databases raise complex legal
issues
• ownership and licensing issues need urgent
clarification
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
25
Fit for purpose?
•
•
•
•
•
need to be clear why we collect theses
are they intended to be fully re-usable?
what does this entail for each subject?
do librarians understand researchers?
do thesis regulations ensure appropriate
formats and submission processes?
• do IPR policies facilitate re-use?
• in short, are our e-theses fit for purpose?
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
26
Thanks...
• thanks to my colleagues on the Project team
– at Cambridge:
• Jim Downing, Peter Murray-Rust, Diana Stewart, Alan
Tonge, Joe Townsend
– at Imperial College London
Matt Harvey, Henry Rzepa
• thanks to the Joint Information Systems
Committee (JISC) for funding the project
(see www.lib.cam.ac.uk/spectra-t for Final Report)
... and thanks to you for listening!
ETD 2008_Morgan_The SPECTRa-T Project
www.lib.cam.ac.uk/spectra-t/
27