Bell: Getting metadata to work harder

Download Report

Transcript Bell: Getting metadata to work harder

GETTING METADATA TO WORK HARDER: re-use,
standardisation and streamlining, a data archive perspective
……………………………………………………….………………………………..................................................................................................
LUCY BELL
………………………………………...
MANAGEMENT INFORMATION MANAGER
UK DATA ARCHIVE
UNIVERSITY OF ESSEX
………………………………………...
THE VALUE OF CATALOGUING, CIG 2012,
UNIVERSITY OF SHEFFIELD
10 – 11 SEPTEMBER 2012
Introduction
……………………………………………………………………………………………………………………………….……………………………..
• recent changes to 45 years’ worth of cataloguing and
indexing – and indexing practices
• changes are large, wide-ranging – and still underway!
• we hope they will both enhance the user’s experience
and create organisational efficiencies
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Themes
……………………………………………………………………………………………………………………………….……………………………..
• the UK Data Archive: what it is
• current practice: metadata schema and tools used at
the Archive
• recent internal initiatives
• generally: the problems we encountered; the solutions
we have employed
• next steps
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The UK Data Archive
……………………………………………………………………………………………………………………………….……………………………..
• based at the University of Essex since 1967
• curator of the largest collection of digital data in the
social sciences and humanities in the UK
• holds several thousand datasets relating to society,
both historical and contemporary, making these
available via its services:
• UK Data Service from October 2012
• previously, the Economic and Social Data Service
(ESDS)
• it is a place of national deposit for The National
Archives
• www.data-archive.ac.uk / (www.esds.ac.uk)
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The UK Data Archive: current cataloguing standards
……………………………………………………………………………………………………………………………….……………………………..
• the Archive provides access to over 5000 digital data
collections
• all of these items are catalogued at study level, and
many at variable level
• using the de facto standard data cataloguing schema,
DDI (Data Documentation Initiative, see
http://www.ddialliance.org/)
• currently, the Archive uses:
• DDI 2.1 (now known as DDI-C, for codebook)
• the Humanities and Social Science Electronic Thesaurus
(HASSET), © University of Essex, based on UNESCO
• internally-controlled authority lists and CVs
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
HASSET
……………………………………………………………………………………………………………………………….……………………………..
• multidisciplinary thesaurus developed to support the UK
Data Archive collection
• coverage in the core subject areas of social science
disciplines
• uses standard hierarchical relationships: TT (top term); BT
(broader term); NT (narrower term); RT (related term) etc.
• role of HASSET in the Archive is twofold:
• used internally for indexing studies and series with HASSET
terms
• also a separate product licensed to others
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Significant recent metadata/indexing developments
……………………………………………………………………………………………………………………………….……………………………..
1. May – October 2010: a review was carried out of the
UK Data Archive’s resource discovery tools.
•
2011: a project was started to apply the review’s results
to the Archive’s resource discovery applications.
2. 2011 onwards: work was started to move from the
DDI-C to DDI-L (for lifecycle) metadata schema.
3. June 2012 – January 2013: SKOS-HASSET, a JISCfunded project is being undertaken to apply SKOS to
HASSET and to test its automated indexing capacity
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Shared requirements…
……………………………………………………………………………………………………………………………….……………………………..
• it became clear that most of these initiatives were all
pointing at one thing:
The need for more controlled - and harder-working metadata
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
1. Resource discovery review
……………………………………………………………………………………………………………………………….……………………………..
• How do researchers find data?
• trends in information-seeking behaviour show that users
prefer simple, Google-like interfaces…
• …but which still return acutely-focused and highlyrelevant results.
• the look and feel of the interfaces should be simple but
the results must achieve academic rigour.
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Result of the review: the metadata conundrum
……………………………………………………………………………………………………………………………….……………………………..
• for data services to produce simple interfaces - which
still return highly-relevant results - metadata are
required which are both:
• extremely powerful
• increasingly invisible
• a conceptual shift has taken place: the work to focus
searches has moved behind the interface
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The previous Archive search context
……………………………………………………………………………………………………………………………….……………………………..
ESDS Qualidata
search interface
ESDS Government
Survey Finder
ESDS International
search interface
BROWSE
Major Studies
BROWSE
Subject Headings
SEARCH
ESDS Data catalogue
HASSET
Subject
Headings
BROWSE
Subject Headings
ESDS Longitudinal
search interface
BROWSE
New releases
Comparable
indicators
(Long)
BROWSE
Thematic pages
HASSET
and other
CVs may be
used in the
majority of
search and
browse
activities.
Comparable
geography
(Long)
ESDS Government
search interface
ESDS Qualidata free
text search interface
ESDS Government:
publications citing
ESDS International
data
DATA
ESDS
Government
Variable Search
Variable Search
ESDS Data
Catalogue
SEARCH
Survey Question
Bank
ESDS International:
publications citing
ESDS International
data
SEARCH
CESSDA catalogue
SEARCH
RELU-DSS
SEARCH
Census data
catalogue
SEARCH
SDS
ESDS Longitudinal:
publications citing
ESDS Longitudinal
surveys
SEARCH
(Data exploration)
Nesstar
SEARCH
(Data exploration)
Quali Online
SEARCH
UKDA-Store
SEARCH
HDS
21 interfaces
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The vision: use CVs to enhance the user’s experience
……………………………………………………………………………………………………………………………….……………………………..
• We wanted:
• a single search interface
• the ability to move seamlessly from one type of resource
to another:
• via faceted browsing and
• directly from within each resource type
• This required:
• cross-referencing data collections with publications, with
research outputs, with support guides, with case studies
using metadata
• Many controlled vocabularies!
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The result: single faceted search/browse interface
……………………………………………………………………………………………………………………………….……………………………..
•
We are moving from this:
•
To this:
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Facets needing controlled vocabularies
……………………………………………………………………………………………………………………………….……………………………..
• Some were already in a fit state:
• Depositor (existing authority list)
• Country (existing authority list)
• Others needed mapping to high levels:
• Subject categories (116 categories mapped to 21 top
terms)
• Many were populated with freetext:
•
•
•
•
Observation unit
Spatial unit
Kind of data
Time method
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Freetext to controlled vocabularies mapping
……………………………………………………………………………………………………………………………….……………………………..
• Mapping freetext values to controlled values
(all metadata held in SQL tables)
• Same principles for all:
• Obtain dump of metadata and manipulate in Excel
• Identify CV to be used
• Use Google Refine to identify existing, similar, freetext
entries
• Re-export into Excel and apply mapping (at item level
or, if possible, at value level)
• CVs to be used in the future
• So far, has taken 2 staff members, working c.0.4 FTE
4 months to clean 3 elements
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
……………………………………………………………………………………………………………………………….……………………………..
• Spatial unit <geogUnit>
• Previous Archive project, U.Geo, had created a spatial unit CV
• 653 unique values, now mapped to 194
• This has now been used for all items:
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
……………………………………………………………………………………………………………………………….……………………………..
• Unit of observation <anlyUnit>
• 183 unique values, now mapped to 11, using DDI
CVG recommended list:
•
•
•
•
•
•
•
•
•
•
•
Individuals
Organizations
Families/households
Housing Units
Events/Processes
Geographic Units
Time Units
Text units
Groups
Objects
Other
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
……………………………………………………………………………………………………………………………….……………………………..
• Kind of data <dataKind>
• 294 unique values, now mapped to 7:
•
•
•
•
•
•
•
Alpha-numeric
Audio
GIS
Image
Numeric
Textual
Video
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The mappings
……………………………………………………………………………………………………………………………….……………………………..
• More to come….
• Method of data collection
• Access/restrictions (Secure data; standard access
conditions etc.)
• Method of access (Explore online or download)
• Faceted search/browse will be released as a beta in
late 2012
• More development will occur during its beta phase
following user feedback
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
2. Metadata schema: DDI-C to DDI-L
……………………………………………………………………………………………………………………………….……………………………..
• Simultaneously, the Archive has been preparing for
the move from DDI-C to DDI-L
• DDI-C is similar to a traditional metadata schema
• DDI-L is more flexible – to the benefit of users:
• permits data as well as metadata to be encoded
• captures survey lifecycles
• gives users a fully-rounded view of a survey from inception
to results
• broad and flexible, allowing groupings to be made – re-use
is key
• to support all this, it requires CVs to be used in several
elements (the DDI Alliance Controlled Vocabularies Group
is working on these)
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
3. CVs for organisational efficiency: SKOS application
……………………………………………………………………………………………………………………………….……………………………..
• JISC project: SKOS-HASSET
• 8 months (June 2012 – January 2013)
• part of the JISC Research Tools Programme
• Multi-disciplinary project team:
• Information Scientists, Data/text Mining Programmer,
Linguist, RDF specialist, Developers
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
……………………………………………………………………………………………………………………………….……………………………..
• three aims:
• apply SKOS to HASSET – making the thesaurus more
flexible
• improve its online presence
• test its automated indexing capabilities; corpora:
•
•
•
•
questions
questionnaires
abstracts
publications
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
……………………………………………………………………………………………………………………………….……………………………..
• Progress so far:
• SKOS has been applied to HASSET
• Texts prepared for the automated indexing case study
• Gold standard of manual indexing of questions is taking
place
• TF/IDF, KEA and WEKA all being used for term
extraction – work underway
• Next steps:
• SKOS product licensing
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
SKOS-HASSET
……………………………………………………………………………………………………………………………….……………………………..
• Communication:
• SKOS-HASSET blog: http://hassetukda.wordpress.com/
• [email protected] email list
• Project web site: http://www.data-archive.ac.uk/find/ourprojects/skos-hasset
• Webinar planned for the winter
• User guidance
• Please contribute, give feedback!
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Developments… to issues … to improvements
……………………………………………………………………………………………………………………………….……………………………..
• For users:
• the faceted search/browse interface exposed a lack
of standardisation in the underlying metadata
• …freetext terms have been used over 45 years; these
are now being standardised
• ...rich freetext metadata has not been lost
• the move from DDI-C (DDI 2.1) to DDI-L (DDI 3.1)
brings in a conceptually different type of schema to the
users’ benefit…
• …but which also requires more controlled vocabularies
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Developments… to issues … to improvements
……………………………………………………………………………………………………………………………….……………………………..
• For us:
• Applying more CVs will provide efficiencies:
• ...the Archive wants to introduce an online deposit form for
its depositors which will include CV dropdowns
• ...create more ways of suggesting terms for the cataloguers
• SKOS gives the opportunity to work more flexibly with
the thesaurus
• …automated indexing using CVs is being tested
• ...SKOS will allow for easier future thesaurus
development
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
The future: analysis and reporting enhanced
……………………………………………………………………………………………………………………………….……………………………..
Web deposit
form captures
more and more
controlled
metadata from
depositors
Input programs
automatically
generate SN
user guides and
title pages
Manual
metadata
created, auto
metadata
checked; record
completed with
descriptors
Additional
metadata
created through
text mining;
geographic
coordinates
e’
t -in
-tim
d te
rm
and s aut
‘ s i m o ma
ti
ilar
’ r e ca l l y s
su l
ts r earch
etu
e
rn e d ‘ j u s
d
Metadata record
late
su
lts
o
Metadata results
of f us
returned
se e r
ar q
ch u a
an lity
al ev
ys a
ed lua
tio
n
User queries
database
, re
Re
Search and browse activity monitored
to inform data acquisition
Oth
er
Evaluation of
metadata
systems
Management Information
Management
Information
Analysis and reporting
and future acquisitions
decisions supported
User questioned about usefulness of
results
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Conclusion
……………………………………………………………………………………………………………………………….……………………………..
• we all NEED metadata so that we can find stuff
• there is too much stuff (or not enough bodies) to
create all the metadata ourselves in time these days
• searchers/users often expect the applications to do the
work for them
• use the tools at our disposal to make this happen by:
• employing more CVs where appropriate
• sharing and using RDF-enabled CVs
• and, crucially, continuing the creation of quality-assured
metadata using fewer resources
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
Conclusion
……………………………………………………………………………………………………………………………….……………………………..
• JISC Intrallect report; quotation from Vic Lyte:
• “A new researcher wishing to approach scholarly
inquiry to determine the impact of global warming on
penguin populations in South Antarctica doesn’t walk
up to a Librarian and shout ‘Penguins!’.”
(Duncan, C. & Douglas, P., (2009). Automatic metadata generation: use
cases and tools/priorities. Intrallect (for JISC): 2009)
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE
CONTACT
……………………………………………………………………………………………………………………………….……………………………..
UK DATA ARCHIVE
UNIVERSITY OF ESSEX
WIVENHOE PARK
COLCHESTER
ESSEX CO4 3SQ
……..……………………………….…..
T +44 (0)1206 872001
E [email protected]
www.data-archive.ac.uk
…………………………………………………………………………………………………………………………………………………………..…
UK DATA ARCHIVE