Transcript Powerpoint

From research data to new
knowledge: a lifecycle
approach.
Dr Liz Lyon, Director
UKOLN, University of Bath, UK
JISC/SURF/CNI Conference May 2005, Amsterdam.
UKOLN is supported by:
www.ukoln.ac.uk
a centre of expertise in digital information management
www.bath.ac.uk
Overview
1. Scholarly communications in flux
2. e-Research and the diversity of data
3. Repositories & meta-functionality
•
•
•
Realising the link to learning: eBank UK
Providing value-added services
Enabling knowledge extraction & postprocessing
4. Look at (some of) the issues en route
JISC/SURF/CNI Conference May 2005
2
1. Scholarly communications in flux
A medieval scriptorium…..
JISC/SURF/CNI Conference May 2005
4
Presentation services: subject, media-specific, data, commercial portals
Data creation /
capture /
gathering:
laboratory
experiments,
Grids,
fieldwork,
surveys, media
Resource
discovery, linking,
embedding
Data analysis,
transformation,
mining, modelling
Searching ,
harvesting,
embedding
The scholarly knowledge
cycle.
Aggregator
services: national,
commercial
Liz Lyon, Ariadne, July 2003.
Harvesting
metadata
Research &
e-Science
workflows
Deposit / selfarchiving
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Publication
Peer-reviewed
publications: journals,
conference proceedings
JISC/SURF/CNI Conference May 2005
5
Presentation services: subject, media-specific, data, commercial portals
Searching ,
harvesting,
embedding
Aggregator
services: national,
commercial
Resource
discovery,
linking,
embedding
Learning object
creation, re-use
Harvesting
metadata
Learning &
Teaching
workflows
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Peer-reviewed
publications: journals,
conference proceedings
Deposit / selfarchiving
Institutional
presentation
services: portals,
Learning
Management
Systems, u/g, p/g
courses, modules
Resource
discovery, linking,
embedding
JISC/SURF/CNI Conference May 2005
Validation
Quality
assurance
bodies
6
Presentation services: subject, media-specific, data, commercial portals
Data creation /
capture /
gathering:
laboratory
experiments,
Grids,
fieldwork,
surveys, media
Resource
discovery, linking,
embedding
Data analysis,
transformation,
mining, modelling
Searching ,
harvesting,
embedding
Aggregator
services: national,
commercial
Resource
discovery,
linking,
embedding
Learning object
creation, re-use
Harvesting
metadata
Research &
e-Science
workflows
Deposit / selfarchiving
Learning &
Teaching
workflows
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Publication
Deposit / selfarchiving
Institutional
presentation
services: portals,
Learning
Management
Systems, u/g, p/g
courses, modules
Resource
discovery, linking,
embedding
Peer-reviewed
publications: journals,
conference proceedings
JISC/SURF/CNI Conference May 2005
Validation
Quality
assurance
bodies
7
2. e-Research and the diversity of data
Assuring permanent open access to the
records of science & the humanities?
Long term access to primary data
• Increasing data volumes from eScience and
Grid-enabled / cyberinfrastructure applications
• Changing research paradigm: data-driven
science, “big science”
• Observational data, simulations, large-scale
experimentation, computations
• Multi-media resources, statistical data, surveys,
geo-spatial data……
JISC/SURF/CNI Conference May 2005
9
Diversity of data collections
•
•
Very large, relatively homogeneous:
Large-scale Hadron Collider (LHC) outputs from CERN
Smaller, heterogeneous and richer collections:
World Data Centre for Solar-terrestrial Physics CCLRC
Small-scale laboratory results:
“jumping robots” project at the University of Bath
Population survey data: UK Biobank
•
Highly sensitive, personal data: patient care records
•
•
JISC/SURF/CNI Conference May 2005
10
Taxonomy of data collections
•
•
•
Research collections:
jumping robots
Community collections:
Flybase at Indiana (with
UC Berkeley )
Reference collections:
Protein Data Bank
Source: NSF Long-Lived Digital
Data Collections
Draft report
March 2005
JISC/SURF/CNI Conference May 2005
11
Taxonomy of data collections
•
•
•
Research collections:
jumping robots
Community collections:
Flybase at Indiana (with
UC Berkeley )
Reference collections:
Protein Data Bank
Evolution……
Source: NSF Long-Lived Digital
Data Collections
Draft report
March 2005
JISC/SURF/CNI Conference May 2005
12
Repository
evolution:
1971 Research
collection
<12 files
2005 Reference
collection
>2700 structures
deposited in 6
months
JISC/SURF/CNI Conference May 2005
13
1. Issues: research data as content
• Sharing it!
• Data diversity
–
–
–
–
Homo- or heterogeneous
Raw and derived / processed
Sensitivity
Fast or slow growth in volume
• Repository evolution:
– Likelihood to scale up (from bytes to petabytes)
– Quality assurance (from the start)
– Community-based standards development
(“folksonomies”)
– Build robust services
JISC/SURF/CNI Conference May 2005
14
3. Repositories & meta-functionality
eBank UK: linking research data to learning
• JISC-funded September 2003, Phase 2 February 2005
• UKOLN at the University of Bath (lead), University of
Southampton, University of Manchester
• Exemplar: e-Science testbed ‘Combechem’
–
–
–
–
Grid-enabled combinatorial chemistry
Crystallography, laser and surface chemistry examples
Development of an e-Lab using pervasive computing technology
National Crystallography Service
• Resource Discovery Network / PSIgate physical
sciences portal
• http://www.ukoln.ac.uk/projects/ebank-uk/
JISC/SURF/CNI Conference May 2005
16
Presentation services: subject, media-specific, data, commercial portals
Data creation /
capture /
gathering:
laboratory
experiments,
Grids,
fieldwork,
surveys, media
Resource
discovery, linking,
embedding
Data analysis,
transformation,
mining, modelling
Searching ,
harvesting,
embedding
Aggregator services:
eBank UK
Resource
discovery,
linking,
embedding
Learning object
creation, re-use
Harvesting
metadata
Research &
e-Science
workflows
Deposit / selfarchiving
Learning &
Teaching
workflows
Repositories :
institutional,
e-prints, subject,
data, learning objects
Validation
Publication
Deposit / selfarchiving
Institutional
presentation
services: portals,
Learning
Management
Systems, u/g, p/g
courses, modules
Resource
discovery, linking,
embedding
Peer-reviewed
publications: journals,
conference proceedings
JISC/SURF/CNI Conference May 2005
Validation
Quality
assurance
bodies
17
Data Flow in eBank UK
Create
HTML
Submit
OAI-PMH
present
Store/link
Index
and
Search
Harvest
(XML)
Institutional
repository
eBank
aggregator
HTML
present
JISC/SURF/CNI Conference May 2005
Data files
Metadata
18
Comb-e-Chem Project
Video
Simulation
Diffractometer
Properties
Analysis
Structures
Database
X-Ray
e-Lab
Properties
e-Lab
Grid Middleware
JISC/SURF/CNI Conference May 2005
20
The digital repository
ecrystals.chem.soton.ac.uk
Acknowledgement: Simon Coles
JISC/SURF/CNI Conference May 2005
21
Access to the underlying data
JISC/SURF/CNI Conference May 2005
22
Harvesting: OAIster
JISC/SURF/CNI Conference May 2005
23
Aggregating: search & discover
JISC/SURF/CNI Conference May 2005
24
Linking to publications
JISC/SURF/CNI Conference May 2005
25
eBank embedded in a science portal
JISC/SURF/CNI Conference May 2005
26
eBank Phase 2: linking to learning
• Embedding in e-Learning
processes
• Evaluating the pedagogical
benefits
– MChem course
– Chemical informatics
course
JISC/SURF/CNI Conference May 2005
27
2. Issues: generic data models,
metadata schema & terminology
• Validation against other schema
– CCLRC Scientific Data Model Vs 2
• Complex digital objects and packaging options
– METS
– MPEG 21 DIDL
• Terminologies
– Domain: crystallography
– Inter-disciplinary e.g. biomaterials
– Metadata enhancement: subject keyword additions to
datasets based on knowledge of keywords in related
publications
– Meaningful resource discovery?
JISC/SURF/CNI Conference May 2005
28
3. Issues: linking and identifiers
•
•
•
•
Links to individual datasets within an experiment
Links to all datasets associated with an experiment
or a data collection
Links to derived eprints and published literature
Context sensitive linking: find me
–
–
–
–
•
Datasets by this author / creator
Datasets related to this subject
Learning objects by this author / creator
Learning objects related to this subject
Identifiers and persistence
– “generic”
– domain: International Chemical Identifier (InChI code)
•
•
Resource discovery : Google Scholar?
Provenance: authenticity, authority, integrity?
JISC/SURF/CNI Conference May 2005
29
4. Issues: embedding and workflow
• Into the crystallographic publishing community
International Union of Crystallography
• Into the chemistry research workflow
– SMART TEA Digital Lab Book e-synthesis Lab
– Other analytical techniques and instrumentation
• Into the curriculum and e-Learning workflows
– MChem course
– Undergraduate Chemical Informatics courses
JISC/SURF/CNI Conference May 2005
30
Repositories and digital curation
For later use?
In use now (and the future)?
Static
Dynamic
Data preservation
Data curation
“maintaining and adding value to a trusted body
of digital information for current and future use”
JISC/SURF/CNI Conference May 2005
31
Provide value-added services
Annotation
• e-Lab books (Smart Tea Project in chemistry)
• Gene and protein sequences
JISC/SURF/CNI Conference May 2005
32
Enable “post-processing” and
knowledge extraction
The acquisition of newly-derived information and
knowledge from repository content
• Run complex algorithms over primary datasets
• Mining (data, text, structures)
• Modelling (economic, climate, mathematical,
biological)
• Analysis (statistical, lexical, pattern matching, gene)
• Presentation (visualisation, rendering)
JISC/SURF/CNI Conference May 2005
33
JISC/SURF/CNI Conference May 2005
34
5. Issues: “knowledge services”
• Layered over repositories
– Annotation
– Mining, modelling, analysis
– Visualisation
• Across multiple repositories
– Grid enabled applications
– Highly distributed, dynamic and collaborative
• Associated with curatorial responsibility
– UK Digital Curation Centre
http://www.dcc.ac.uk
JISC/SURF/CNI Conference May 2005
35
Issues summary
1. Research data is diverse, increasing rapidly in
volume and complexity
2. Repository collections are dynamic and evolve
3. Technical challenges associated with interoperability,
persistence, provenance, resource discovery and
infrastructure provision
4. Embedding in workflow is critical: scholarly
communications, research practice, learning
5. Knowledge extraction tools will generate new
discoveries based on repository content
6. Repository solutions must scale: M2M processing will
become the norm……
JISC/SURF/CNI Conference May 2005
36