Sikes_DS_2012x - Alaska Entomological Society

Download Report

Transcript Sikes_DS_2012x - Alaska Entomological Society

Arctos at the
University of Alaska
Museum Insect
Collection
Derek Sikes1
Gordon Jarrell2
Dusty McDonald1
1 University of Alaska Museum
Fairbanks, AK
2 Museum of Southwestern
Biology, NM
Alaska Entomological Society
5th Annual Meeting, Anchorage, AK
27-28 Jan 2012
Major repositories using the Arctos database:
(43 collections of specimens or observations, 1.4M records)
in partnership with
which is a member of
TeraGrid – A nationwide network
of 11 supercomputing facilities
which is sponsored by
U. S. National Science Foundation’s
Office of Cyberinfrastructure
Arctos: A 15 year history

MVZ: 1995 - Hired Stan Blum to develop relational data model (following modeling
by Assoc. Systematic Collections).

MVZ: 1997 - Hired John Wieczorek to implement model (desktop application) using
Sybase and Versata. Partial implementation (e.g., no loans).

UAM: 1998-2000 - John W. migrated mammal data to Oracle, set up Versata.

UAM: 2002 - Dusty McDonald replaced Versata with ColdFusion, implemented full
model (first web-based instance, aka Arctos).

MSB: 2003 – Joined Arctos at UAM (first multi-hosting instance).

MVZ and MCZ: 2005-2007 - Implemented separate instances of Arctos at Berkeley
and Harvard (MVZ: first Postgres, then Oracle).

MVZ: 2009 - Moved hosting of data to Alaska (Virtual Private Database version).
ARCTOS
Arctos
Specimen Catalog
label data (and more)
• Specimens (objects) - body
parts, tissues, containers, etc.
• Images, media (stored at TACC)
Accessions
Loans,
usage
Projects
contribute and/or
use specimens
• Projects, permits, publications
Citations
Publications
cite specimens
• Accessions, loans, usage
• Labels, as PDF files
The rest of
Cyberspace
GenBank
Federated portals
BerkeleyMapper
• Agents, agent activity
“Media” in TeraGrid
BerkeleyMapper & Google Maps, with error circles
Breadth of Data in Arctos
Fish, amphibians, reptiles, mammals, birds and
bird eggs/nests, plants, arthropods, fossils,
molluscs AND their parasites
 Specimens and observations
 Media (images, audio, video)
 Publications, fieldnotes

Arctos constantly evolving to incorporate new kinds
of data, e.g.,:
 Better representation of non-publication
documents (fieldnotes, correspondence)
 Cultural collections (art, anthropology...)
Nearly all that is known about an object (or
observation) can be included in Arctos.
Linking specimen records to archival documentation…
ECN Session – Arthropod Collections Databases
1) What is the primary user audience? - large/ small museum
management? taxonomic research? is a dedicated IT /
programmer required? Single vs multi-user? (annual cost?)
2) GBIF - does the database provide data to GBIF?
3) Barcoding - does the database handle batch processing of
specimens using barcodes? ( 'speed / ease of use')
4) Georeferencing - does it conform to the recommended 'best
practices' guide published by GBIF?
5) What is the ease / difficulty of websetup?
6) Security - can a data entry technician accidentally delete or
change (corrupt) large amounts of data? Is/are the database
server(s) protected from disaster (eg floods, fires)?
7) Likes / dislikes & pros/cons
ECN Session – Arthropod Collections Databases
1a) What is the primary user audience?
Museums / collections data management (also: observations, Federal
collections [USFWS], large private collections associated with public institution]
1b) is a dedicated IT / programmer required?
Yes, but the IT staff are shared among all participants.
1c) Single vs multi-user?
Multi-user without practical limits.
1d) Annual cost?
Negotiated per institution based on size and maintenance
needs
currently ranging $1,300 - $27,000
ECN Session – Arthropod Collections Databases
2) GBIF - does the database provide data to GBIF?
Arctos does this automagically every minute.
3) Barcoding - does the database handle batch processing of
specimens using barcodes? ( 'speed / ease of use')
Arctos attaches barcodes to “parts.” This lets you track
things like tissues, extractions, slides and pinned bodies of
each cataloged specimen separately.
ECN Session – Arthropod Collections Databases
4) Georeferencing - does it conform to the recommended 'best
practices' guide published by GBIF?
Arctos fully supports georeferencing "best practices," in part
because the authors of that document and of Arctos' spatial
data structure are one and the same. (John Wieczorek)
5) What is the ease / difficulty of websetup?
Acquire password. Enter data. (Arctos is only available via the web).
ECN Session – Arthropod Collections Databases
Preservation of specimens
and their associated data
for perpetuity
NSF will help us get our data online but ensuring they stay online
forever is a problem that hasn’t been solved
33,090 specimens
28 institutions / private collections
736 images
4,516 bibliographic images
428 users
DMNS
Arachnology
Data
In-house ->
NSD ->
Crash
->
K EMu
Database errors...
Cabinets
antiquated
wooden
damaged
= unsafe
Arctos
Database
home-made
weak security
mine alone
not online
= unsafe
Specimen Catalog
label data (and more)
Accessions
Loans,
usage
Projects
contribute and/or
use specimens
Citations
Publications
cite specimens
The rest of
Cyberspace
GenBank
Federated portals
BerkeleyMapper
“Media” in TeraGrid
ECN Session – Arthropod Collections Databases
6) Security - can a data entry technician accidentally delete or change
(corrupt) large amounts of data?
No – Data entry technicians enter data into a staging area
Data must be vetted before being loaded by someone with more
access privileges
All non-select transactions are audited. We can (theoretically) roll
back to any point in history, or roll any user's updates back to
any point in history. We can re-create all actions by all users.
ECN Session – Arthropod Collections Databases
6) Security - Is/are the database server(s) protected from disaster (eg
floods, fires)?
Yes – running a RAID array
Backups
– continuous logs to a remote NAS
– local drives
– Texas Advanced Computing Center
– San Diego Supercomputing Center
“If we lose all the nightly backups (3 tectonic plates), I'm betting
nobody will be overly worried about Arctos data.
Or breathing.” – D. McDonald
ECN Session – Arthropod Collections Databases
7) Likes / dislikes & pros/cons
DISLIKES:
- Learning curve fairly steep -> back to kindergarten
- Can’t customize to my heart’s content, each change must be
voted on & prioritized by other users
- Web access generally slower than I like ( we are all more
critical of others than ourselves)
- Only available when networked. Field work in remote areas
requires special solutions if data are to be accessed.
- User interface is ~ garish, clunky, industrial (but works)
ECN Session – Arthropod Collections Databases
7) Likes / dislikes & pros/cons
LIKES:
- Rock – solid security, the data will outlive me
- Web-published
- Cutting-edge web integration (mapping, GenBank, etc)
- No responsibility on my part to maintain backups, software
updates, etc. Need only a networked computer
- Arctos programmers & designers are biologists / users who
really care about “doing it right”
ECN Session – Arthropod Collections Databases
6) Security - can a data entry technician accidentally delete or change
(corrupt) large amounts of data?
There are multiple roles and partitions at various levels. A data entry technician has write access to exactly one table, the
bulkloader. Additionally, one VPD limits his access to his own collection, another limits access to his own rows, and yet
another prevents him from marking records to load. In short, he can only un-do anything he's done, and then only in a
"staging area" separate from "real" data.
A similar model is used throughout Arctos. We control access at the table and row level, and can easily implement finergrained control if such becomes necessary. Users (theoretically) get only the rights that they need and have demonstrated
an understanding of to the data they need, all the while having full access to shared data (like agents).
Data like agents and taxonomy - things where character strings rather than data concepts matter to collections - are
trigger-protected based on usage. You can't update an agent name after it's been used as an author, for example. This is
pretty basic referential integrity, and Arctos is the only thing that has it.
Data and user rules are all handled by the RDBMS, so we can plug in forms written by other people/projects, offer SQL
command-line access, webservices, etc., without worrying too much about security or referential integrity. (Specify, for
example, cannot safely support such access as all data and access rules live in the application layer.)
All non-select transactions are audited. We can (theoretically) roll back to any point in history, or roll any user's updates
back to any point in history. We can re-create all actions by all users.
In addition to ColdFusion's Application Security, we take full advantage of Oracle security - a breach of one just leads to
another layer. Oracle handles things like secondary user access and brute-force password crack attempts. An independent
semi-intelligent (and slightly paranoid) security wrapper watches for malicious behavior and blocks IP access if it detects
anything anomalous.
ECN Session – Arthropod Collections Databases
6) Security - Is/are the database server(s) protected from disaster (eg
floods, fires)?
The server is running a RAID array - we can lose a disk or two and not lose any data (or stop working). Rollback logs are
continuously written to a remote NAS (Networked Attached Storage) system. Daily backups are stored on the local drives, on
the NAS, and on tape in GVEA's "bunker." (They won't tell us what or where that is, but your electric bill and medical records
are in there and it makes the Department of Homeland Security happy.) Daily backups are also copied to the Texas Advanced
Computing Center at Austin (one copy on disk and another on tape) and to tape at the San Diego Supercomputing Center. We
may have another copy going to massively redundant disk at the National Center for Supercomputing Applications (University
of Illinois at Urbana- Champaign) by the time you get to Reno.
We can recover to the point of failure, or at least to within a couple minutes of it, with one copy of the most recent daily backup
and one copy of the rollback logs. (Depending on recent activity, we can usually actually recover from a week-or-so old daily +
the rollbacks.) We'll lose <24H of data if if we lose all the rollbacks - the sever and the NAS. Those are in two buildings, both
with serious security, separated by about a hundred yards of gravel parking lot. If we lose all the nightly backups (3 tectonic
plates), I'm betting nobody will be overly worried about Arctos data. Or breathing.
There are a couple dozen probes per day - I think it's fairly safe to say that Arctos security has been tested. (Actual attacks are
now kind of hard to detect due to the aforementioned paranoid IP killer, which generally shuts them off at the first probe, but
we used to get one per week or so.) A big DDoS attack would easily take us down, but (1) we're too boring to attract such a
thing, and (2) so what? - those things just eat servers, not data.
ECN Session – Arthropod Collections Databases
6) Security - Is/are the database server(s) protected from disaster (eg
floods, fires)?
We've lost a few disks over the years, but never lost data or had a server go down due to it. (We've had lots of downtime, just
not equipment-related.) Our biggest threat is probably a disgruntled employee with too much access and a long-term plan, but
we could probably (with expensive consultant help) even recover from that, and there's no lack of tools to detect such
behavior.
That might all be a little overkill - I'd settle for daily backups on 2 major tectonic plates if absolutely necessary –
but I certainly think that you have an obligation to do more than install [database X] on some junker computer and maybe buy
a tape drive when you take public money to create or curate digital data.
[database X] may be free, but supporting it takes a real commitment in
hardware, infrastructure, and expertise that most Universities are poorly
equipped to make.
I don't know of a single large project that hasn't at some point lost digital data.
- Dusty McDonald, Arctos programmer
Lessons Learned
1) Proprietary software is generally a bad idea unless you have
guaranteed, sustained budget for staff and upgrades.
2) Back-ups cannot merely be performed/scripted with the
assumption that the job is done.
3) Back-ups should NOT be incremental, MUST be stored
offsite, and MUST include separate images of operating
system and databases
4) Restoration from bare metal must be fully documented and
periodically performed to verify that the process DOES
work.
5) Source code must be in a distributed public repository like
Github.
- D. Shorthouse
University of Connecticut Bird Collection data
were found... on a single floppy
2031 records in a flat file
University of Connecticut Bird Collection data
were found... and made available on-line
But... Something with the server setup is not
stable.