Transcript Example

The Natural History Museum
http://www.nhm.ac.uk
Speaker:
Charles Hussey
Science Data Co-ordinator
Department of Information
and Library Systems
[email protected]
 The Trustees of The Natural History Museum, 2002
Data Access challenges and opportunities
Move towards networks connecting distributed sources
Two components to this presentation
Start by drawing upon work for European Natural History
Specimen Information Network
(Personal view of what is achievable)
Then look at some of the approaches we have taken
within The NHM
Acknowledgements
Nicolas Bailly, MNHN Paris, ENHSIN
David Gee, originator of DSML
Dilshat Hewzullah, NHM, DSML & Querying distributed databases
Anne Hume, NHM, Online databases and DSML
Andrew Jones, University of Cardiff, SPICE for Species 2000
Mike Lowndes, NHM, Museum Information Locator System
Rachel Perkins, NHM, Collections Level Descriptions
Mike Sadka, NHM, Fast-track programme
Darrell Siebert, NHM, Fish Collection Database
Chris Sleep, NHM, DSML
Neil Thomson, NHM, BioCASE
Nature of Data
What do we have to deal with?
First Challenge: Integrating disparate sources
NHM Survey in 2000: 87 institutions responded:
33 different products; 40% using bespoke solutions;
5 using spreadsheets
BioCISE Survey in 1998/99: 292 institutions responded:
60 different products; 75% using bespoke solutions;
Only 8% providing web access to unit level data
Nature of Data
First Challenge: Integrating disparate sources
Do data providers have the means to:
1. Implement and maintain a local Internet Server
providing 24-hour a day access?
2. Compile metadata (collections level or unit level)?
3. Supply additional data (such as resolving localities or
providing elements of higher taxonomy)
4. Maintain quality of datasets
5. Construct views of their data or implement wrappers
6. Handle version control
Nature of Data
Second Challenge: Comparing like with like
1. Authorities for names
2. Personal names
3. Geographic co-ordinates
4. Place names
5. Language and spelling
Architectures
1. Single client/server database used by all providers and
users
2. Central summary system
3. Central Gateway to distributed databases
4. Peer-to-peer databases
5. Web directory pointing to data sources
Architectures
1. Single client/server database used by all providers
and users
Single database, subscribers have local client
Allows detailed and complex interaction with data
Example: NHM Palaeontology Collections Management System
Example: Packages for Observers – Recorder 2000, MapMate
Architectures
2. Central summary system
Contributors maintain their own systems and post copies
of data to centrally maintained database
Example: NBN Species Dictionary
Architectures
3. Central Gateway to distributed databases
No central database
…but “Common Access System” may store metadata
Example: Species 2000
Example: Biodiversity on the Web
Biodiversity on the Web
Selection of Searchable
Databases
Architectures
4. Peer-to-peer databases
multiple Z39.50 servers and clients
Example: Species Analyst
Example: AHDS
Architectures
5. Web directory pointing to data sources
Essentially, a portal
Example: BIODIV
Other Issues
• Scalability
• Sustainability
• Access
• Quality Control
Terminology Control
“Gaps” in data:
Still parts of collection not yet databased
Collection not suitable for databasing at unit level
Inadequate data dictionary
Data not available for a specimen
Data needs interpretation
Indicators for Quality
A Case in Point:
Wrapping a dataset for ENHSIN Pilot
• Copy table from Access to SQL Server
• Restructure table to add “new” fields
• Perform conversions:
•Place = Waterbody + Locality(verbatim) + Site.Ref.
•Split Collection date to DAY, MONTH, YEAR
•Convert Lat & Long to decimal degrees
•Convert Altitude to metres and deal with altitude ranges
•Shape = Material + “(“+Preservation Method +”)”
•Collector = Collector Surname + Initials + Title
•Determiner = Determiner Surname + initials + Title
•Populate blank fields with static data by creating view (e.g. for
Kingdom, Collection Name, Contact Info.)
• Delete fields not required after conversion
• Rename fields to match ENHSIN element names
NHM Initiatives
• Imaging of Primary Sources
• Zoology Accession Ledgers
• Entomology Card indexes (VIADOCS project)
• Rapid Data Entry
• Fish Collection
• Botany Pilot
• Collections Level Description
• Darwin Centre
• Entomology Index to Collections
• Integrated Access
• Data Locator
Links
ENHSIN: http://www.nhm.ac.uk/science/rco/enhsin/index.html
SPICE Project: http://www.systematics.reading.ac.uk/spice
Biodiversity on the Web: http://www.biodiversity.org.uk/ibs/
Species Analyst: http://habanero.nhm.ukans.edu
NBN Species Dictionary: http://yaw.nhm.ac.uk/nhm/
AHDS Gateway: http://prospero.ahds.ac.uk:8080/ahds_live/
BIODIV: http://www.br.fgov.be/biodiv/
NHM Collection Level Descriptions: http://www.nhm.ac.uk/cld/index.shtml
NHM Data Locator: http://internt.nhm.ac.uk/cgi-bin/locator/
Online databases at NHM: http://www.nhm.ac.uk/science/projects.html