Peabody Collections

Download Report

Transcript Peabody Collections

Toro 1
EMu on a Diet
Yale campus
Peabody Collections
Counts & Functional Cataloguing Unit
•
•
•
•
•
•
•
•
•
•
Anthropology
Botany
Entomology
Invertebrate Paleontology
Invertebrate Zoology
Mineralogy
Paleobotany
Scientific Instruments
Vertebrate Paleontology
Vertebrate Zoology
325,000
350,000
1,000,000
300,000
300,000
35,000
150,000
2,000
125,000
185,000
Lot
Individual
Lot
Lot
Lot
Individual
Individual
Individual
Individual
Lot / Individual
2.7 million database-able units => ~11 million items
Peabody Collections
Functional Units Databased
•
•
•
•
•
•
•
•
•
•
Anthropology
Botany
Entomology
Invertebrate Paleontology
Invertebrate Zoology
Mineralogy
Paleobotany
Scientific Instruments
Vertebrate Paleontology
Vertebrate Zoology
325,000
350,000
1,000,000
300,000
300,000
35,000
150,000
2,000
125,000
185,000
90
1
3
60
25
85
60
100
60
95
990,000 of 2.7 million => 37 % overall
%
%
%
%
%
%
%
%
%
%
The four YPM
buildings
Peabody
(YPM)
Environmental
Science Center
(ESC)
175 Whitney
(Anthropology)
Geology / Geophysics
(KGL)
VZ
Kristof Zyskowski
(Vert. Zool. - ESC)
Greg Watkins-Colwell
(Vert. Zool. - ESC)
Shae Trewin
(Scientific Instruments – KGL )
HSI
VP
Mary Ann Turner
(Vert. Paleo. – KGL / YPM)
Maureen DaRos
(Anthro. - YPM / 175 Whitney)
ANT
% Databased vs. Collection Size (in 1000s of items)
100
90
80
70
60
50
40
30
20
10
0
1
10
100
1000
% Databased vs. Collection Size (in 1000s of items)
100
90
80
70
60
50
40
Botany
Entomology
Invertebrate Paleontology
Invertebrate Zoology
30
20
10
0
1
10
100
1000
Peabody Collections
Approximate Digital Timeline
• 1991 Systems Office created & staffed
Peabody Collections
Approximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
Peabody Collections
Approximate Digital Timeline
• 1991 Systems Office created & staffed
• 1992 Argus collections databasing initiative started
• 1994 Gopher services launched for collections data
Peabody Collections
Approximate Digital Timeline
•
•
•
•
1991
1992
1994
1997
Systems Office created & staffed
Argus collections databasing initiative started
Gopher services launched for collections data
Gopher mothballed, Web / HTTP services launched
Peabody Collections
Approximate Digital Timeline
•
•
•
•
•
•
1991
1992
1994
1997
1998
2002
Systems Office created & staffed
Argus collections databasing initiative started
Gopher services launched for collections data
Gopher mothballed, Web / HTTP services launched
Physical move of many collections “begins”
Physical move of many collections “ends”
Peabody Collections
Approximate Digital Timeline
•
•
•
•
•
•
•
•
1991
1992
1994
1997
1998
2002
2003
2003
Systems Office created & staffed
Argus collections databasing initiative started
Gopher services launched for collections data
Gopher mothballed, Web / HTTP services launched
Physical move of many collections “begins”
Physical move of many collections “ends”
Search for Argus successor commences
Informatics Office created & staffed
Peabody Collections
Approximate Digital Timeline
•
•
•
•
•
•
•
•
•
•
1991
1992
1994
1997
1998
2002
2003
2003
2004
2005
Systems Office created & staffed
Argus collections databasing initiative started
Gopher services launched for collections data
Gopher mothballed, Web / HTTP services launched
Physical move of many collections “begins”
Physical move of many collections “ends”
Search for Argus successor commences
Informatics Office created & staffed
KE EMu to succeed Argus, data migration begins
Argus data migration ends, go-live in KE EMu
Big events
EMu migration in '05
(all disciplines went live simultaneously)
Physical move in ‘98-'02
(primarily neontological disciplines)
What do you do …
What do you do …
… when your EMu is out of shape & sluggish ?
What do you do …
… when your EMu is out of shape & sluggish ?
The Peabody Museum Presents
The Peabody Museum Presents
What clued us in that we should put our EMu on a diet ?
Area of Server Occupied by Catalogue
980 megabytes in Argus
10,400 megabytes in EMu
Area of Server Occupied by Catalogue
980 megabytes in Argus
?
10,400 megabytes in EMu
Default EMu “cron” maintenance job schedule
Mo
Tu
We
Th
Fr
late night
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
Default EMu “cron” maintenance job schedule
Mo
Tu
We
Th
Fr
late night
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
Default EMu “cron” maintenance job schedule
Mo
Tu
We
Th
Fr
late night
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
Default EMu “cron” maintenance job schedule
Mo
Tu
We
Th
Fr
late night
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
Three Fabulously Easy Steps !
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff
• ( best quick loss plan ever ! )
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff
• ( best quick loss plan ever ! )
• 2. The Darwin Core Binge & Purge
• ( eat the big enchilada and still end up thin ! )
Three Fabulously Easy Steps !
• 1. The Legacy Data Burnoff
• ( best quick loss plan ever ! )
• 2. The Darwin Core Binge & Purge
• ( eat the big enchilada and still end up thin ! )
• 3. The Validation Code SlimDing
• ( your Texpress metabolism is your friend ! )
1. The Legacy Data Burnoff
980 mB
10,400 mB
Anatomy of the ecatalogue database
File Name
~/emu/data/ecatalogue/data
~/emu/data/ecatalogue/rec
~/emu/data/ecatalogue/seg
Function
the actual data
indexing (part)
indexing (part)
The combined size of these was 10.4 gb -- 4 gb in data and 3 gb in each of rec and seg
The ecatalogue database was a rate limiter
typical EMu data directory
23 files, 2 subdirs
Closer Assessment of Legacy Data
In 2005, we had initially
adopted many of the existing
formats for data elements from
the USNM’s EMu client, to
allow for rapid development of
the Peabody’s modules by KE
prior to migration -- Legacy
Data fields were among them
Closer Assessment of Legacy Data
In 2005, we had initially
adopted many of the existing
formats for data elements from
the USNM’s EMu client, to
allow for rapid development of
the Peabody’s modules by KE
prior to migration -- Legacy
Data fields were among them
Closer Assessment of Legacy Data
sites – round 2
constant data
lengthy prefixes
sites – round 2
data of temporary
use in migration
catalogue – round 2
data
seg
rec
How did we do the LegacyData Burnoff in 2005 ?
Repetitive scripting of texexport & texload jobs
Conducted around a million updates of records
Manually adjusted cron jobs to accommodate
Did the work at night over six-month-long period
Watched process closely to keep from filling server
disks
How did we do the LegacyData Burnoff in 2005 ?
Repetitive scripting of texexport & texload jobs
Conducted around a million updates of records
Manual;y adjusted nightly cron jobs to accommodate
Did the work at night over six-month-long period
Watched process closely to keep from filling server
disks
ecatalogue
data
seg
rec
ecatalogue
nulls from
Crunch delete
2
AdmOriginalData
data
seg
rec
ecatalogue
nulls from
Crunch delete
3
AdmOriginalData
data
seg
rec
shorten labels on
AdmOriginalData
ecatalogue
nulls from
Crunch delete
4
AdmOriginalData
data
seg
rec
shorten labels on
AdmOriginalData
delete prefixes on
AdmOriginalData
ecatalogue
nulls from
Crunch delete
4
AdmOriginalData
data
seg
rec
shorten labels on
AdmOriginalData
delete prefixes on
AdmOriginalData
2. The Darwin Core Binge & Purge
Charles Darwin, 1809-1882
Natural History Metadata Standard
“ DwC ”
Affords interoperability of different database systems
Widely used in collaborative informatics initiatives
Circa 40-50 fields depending on particular version
Directly analogous to the Dublin Core standard
Populate DwC fields at 3.2.02 upgrade in 2006…
so what ?
Populate DwC fields at 3.2.02 upgrade in 2006…
so what ?
IZ Department: total characters existing data
43,941,006
Populate DwC fields at 3.2.02 upgrade in 2006…
so what ?
IZ Department: total characters existing data
IZ Department: est. new DwC characters
43,941,006
20,000,000
Populate DwC fields at 3.2.02 upgrade in 2006…
so what ?
IZ Department: total characters existing data
IZ Department: est. new DwC characters
IZ Department: est. expansion factor
43,941,006
20,000,000
45 %
We’re about to gain back most of the pounds
we just lost in the Legacy Data Burnoff !
catalogue – round 2
data
seg
rec
catalogue – round 2
data
action in
ecollectionevents
seg
rec
catalogue – round 2
data
action
segin eparties
rec
catalogue – round 2
data
action
segin ecatalogue
rec
catalogue – round 2
data
Before
seg actions
rec
catalogue – round 2
data
After
seg actions
rec
ExtendedData
SummaryData
ExtendedData
SummaryData
ExtendedData
ExtendedData field is a full duplication of
IRN + SummaryData fields… delete the
ExtendedData field, use SummaryData
when in “thumbnail mode” on records
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data
IZ Department: est. new DwC characters
IZ Department: est. expansion factor
43,941,006
20,000,000
45 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters modified data 43,707,277
IZ Department: total new DwC characters
22,358,461
IZ Department: actual expansion factor
- 0.1 %
Populate DwC fields at 3.2.02 upgrade… so what ?
IZ Department: total characters existing data
IZ Department: total new DwC characters
IZ Department: actual expansion factor
43,707,277
22,358,461
- 0.1 %
3. The Validation Code SlimDing
We’ve taken off the easiest pounds… any other fields to trim ?
Some sneakily subversive texpress tricks
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” into its computer code ?
3. The Validation Code SlimDing
Can history of query behavior by users help identify some EMu soft spots ?
If so, can we slip EMu a “dynamic diet pill” into its computer code ?
texadmin
EMu actions in the background you don’t see
…you make certain common types of changes to
any record in any EMu module
…and automatic changes then propagate via
“emuload” to numerous records in linked modules
…those linked modules can grow a lot and slow
EMu significantly between maintenance runs
Why not harness EMu’s continuously ravenous appetite for pushing local copies
of linked fields into remote modules… and put it to work slimming for us !
Why not harness EMu’s continuously ravenous appetite for pushing local copies
of linked fields into remote modules… and put it to work slimming for us !
Need to first understand how different EMu queries work
Drag and Drop Query
Drag and Drop Query
checks the link field
Straight Text Entry Query
instead checks a local copy of the
SummaryData from the linked record
that has been inserted into the catalogue
EMu’s audit log - gigantic activity trail
How often do users employ these two very
different query strategies, on what fields,
and are there distinctly divergent patterns ?
catalogue audit
In this one week sample, only 7 of 52 queries for accessions from inside
the catalogue module used text queries, the other 45 were drag & drops
Of those 7 text queries, every one asked for a primary id number
for the accession, or the numeric piece of that number, but not
for any other type of data from within those accessions
Over a full year of catalogue audit data, less than 1% of
all the queries into accessions used other than the primary
id of the accession record as the keyword(s).
Over a full year of catalogue audit data, less than 1% of
all the queries into accessions used other than the primary
id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
Over a full year of catalogue audit data, less than 1% of
all the queries into accessions used other than the primary
id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
We don’t need more than the primary id of the accession
record in the local copy of the accession module data
stored in the catalogue module.
Over a full year of catalogue audit data, less than 1% of
all the queries into accessions used other than the primary
id of the accession record as the keyword(s).
This is where we gain our SlimDing advantage !
We don’t need more than the primary id of the accession
record in the local copy of the accession module data
stored in the catalogue module.
This pattern also held true for queries launched from the
catalogue against the bibliography and loans modules !
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Catalogue Database
Internal Movements Database
Internal Movements Database
Internal Movements Database
Internal Movements Database
Default EMu “cron” maintenance job schedule
Mo
Tu
We
Th
Fr
late night
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
Default EMu “cron” maintenance job schedule
Mo
late night
Tu
*
We
Th
Fr
*
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
*
Default EMu “cron” maintenance job schedule
Mo
late night
Tu
*
We
Th
Fr
*
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
*
Default EMu “cron” maintenance job schedule
Mo
late night
Tu
*
We
Th
Fr
*
workday
evening
= emulutsrebuild
= emumaintenance batch
= emumaintenance compact
Sa
Su
*
Quick backup
A Happy EMu Means Happy Campers
finis