CalBug - College of Natural Resources, UC Berkeley

Download Report

Transcript CalBug - College of Natural Resources, UC Berkeley

© PT Oboyski
© PT Oboyski
CalBug: Digitizing California’s Terrestrial Arthropods
Peter T Oboyski, Joan Ball, Rosemary Gillespie, Joyce Gross, Traci Grzymala, Gordon Nishida, Kipling Will
Essig Museum of Entomology, University of California at Berkeley ,USA
© Joyce Gross
© Joyce Gross
Workflow
Summary
Databasing of entomology collections has lagged behind that of other disciplines primarily
due to large collection sizes and the highly abbreviated and inconsistent data on very small
specimen labels. CalBug is a National Science Fundation funded collaboration of the eight
major entomology collections in California* that intends to capture 1.1 million specimenlevel data records from our combined holdings. Data from all institutions will be combined in
a single online cache. We will analyze these data using geospatial technology to explore the
relationship between changes in distribution and habitat modification. Developing timesaving methods and technology for getting data from specimen labels into databases is
paramount. We have focused on developing and testing methods and workflows to increase
the rate of data capture, while maximizing data quality. Digital imaging of labels provides an
easy-to-view verbatim archive of specimen data and allows remote data entry from image
files through manual entry, crowd-sourcing, and automated OCR and data parsing. Specimen
handling remains a significant obstacle for efficient data capture from entomological
collections because of costs in time and risk to specimens. Georeferencing is also a challenge
due to the highly abbreviated and inconsistent nature of location data on specimen labels. To
address these challenges we are exploring strategies that combine computer and human
data handling.
Label Image Capture
Figure 1. (upper left) DinoLite® digital microscope and software
used to capture images of specimens and labels. (upper right)
Essig database data entry screen with specimen image – clicking
on image icon makes image appear in a separate movable
window. Yellow fields are carried-over to the next specimen.
(lower right) Dragonfly with labels removed for imaging.
Improving image & data acquisition
Minimize imaging time: We are currently developing high-throughput assembly lines to
increase the rate of image capture by spatial arrangement of handling tasks and automating
file naming and saving. Online crowd-sourcing: We are collaborating with the Zooniverse
citizen science program to engage thousands of volunteers in label data entry from digital
images. Multiple volunteers enter data multiple times for each label, which are then
compared for consistency (as a proxy for accuracy). OCR and automated data parsing: We are
developing user dictionaries for Optical Character Recognition software to increase percent
recognition and accuracy. We are also looking for programmers to create a “smart” parsing
program that can assign data elements to appropriate database fields based on context and
dictionary terms. Developing a data cache: Data from each collaborating institution will be
added to a combined online cache (see required fields in Figure 4).
1. Select taxa for databasing
5a. Manually enter data
into MySQL database
with some error checking
2. Sort specimens
by location & date
Assessment and Progress
6. Error Checking
7. Georeference locality
5b. Online crowd-sourcing
of manual data entry
3. Tease apart labels to view all
text, add catalog # label
8. Upload data to cache
5c. Optical Character
Recognition & data parsing
4. Take, name, and save digital
image of labels
Optional step
© Joyce Gross
9. Temporospatial analyses
In development
Specimen handling: A significant time expenditure includes retrieval of individual specimens,
positioning of labels for viewing, adding a catalog number label, and returning the specimen to
its unit tray.
Digital Imaging: Protocols for entering data directly from specimens into a verbatim field
followed by parsing into interpreted fields proved slow. Digital imaging of specimen labels
provides advantages, including a true verbatim digital archive, the ability to enlarge labels
onscreen, and the opportunity for remote data entry and/or Optical Character Recognition (OCR)
to automate data extraction. Using a naming convention that includes the specimen catalog
number, digital images are automatically linked to database records. Each specimen takes ~2
seconds to photograph, but naming and saving files adds ~7-10 seconds/specimen.
Databasing: Several fields, including higher taxonomy and “higher geography” are automatically
filled names already in the database. Data are carried-over from one specimen to the next
(yellow fields in Figure 1). These features, along with pick lists and controlled fields, reduce
errors.
Progress: 27,000 Hymenoptera; 8,400 Odonata; 7,000 Lepidoptera entered into Essig Database.
4,000 specimens fully georeferenced. 36,000 images taken with 24,000 awaiting data entry.
Figure 3. General workflow for image capture, databasing, georeferencing, and analysis. See Methods for workflow details.
Methods
Taxa and localities to database: Priority species were selected to address urgent
environmental issues and target localities to examine changes in biodiversity at sites
with long-term sampling, including Natural Area Reserves. Sort specimens by
location and date (optional): A “carry-over” function reduces time spent typing when
consecutive specimens have similar data. Digital imaging: DinoLite® digital
microscopes (Figure 1) capture images of label data in JPEG format. Manual data
entry into MySQL database: Label data are interpreted and entered into appropriate
database fields (Figure 4). Error checking: Records are successively sorted by locality
and date to identify typographic errors/inconsistencies. Georeference locality data:
Database records are uploaded to BioGeomancer georeferencing software (Figure 5)
which suggests coordinates and an error radius for each locality based on
standardized protocols. Upload data to cache (in development): At the completion of
the project each institution will upload records to a central cache for inter-institution
analyses (Figure 4). Temporospatial analyses (in development): GIS tools will be used
to correlate species distributions with climate and habitat factors and to predict
changes in species distributions based on climate change projections (Figure 6).
Collecting Event Data
Specimen Data
eventID (DC)
country (DC)
stateProvince (DC)
county (DC)
locality (DC)
minimumElevationMeters (DC)
maximumElevationMeters (DC)
decimalLatitude (DC)
decimalLongitude (DC)
coordinateUncertaintyMeters (DC)
geodeticDatum (DC)
verbatimCoordinateSystem (DC)
georeferenceSources (DC)
georeferencedBy (DC)
georeferencedDate
georeferenceRemarks (DC)
collectionBeginDate (*)
collectionEndDate (*)
recordedBy (DC) = collectors
samplingProtocol (DC)
associatedTaxa (DC)
sex (DC)
individualCount (DC)
catalogNumber (DC)
institutionCode (DC)
kingdom (DC)
phylum (DC)
class (DC)
order (DC)
family (DC)
genus (DC)
specificEpithet (DC)
subspecies
taxonIDCertainty
scientificNameAuthorship (DC)
identifiedBy (DC)
dateIdentified (DC)
eventID (DC)
Database
Georeferencing
and Mapping
Figure
5.
Semi-automated programs,
such
as
BioGeomancer, estimate latitude-longitude coordinates
with an adjustable error radius based on text descriptions
(above example: 15 miles E of Cloverdale, CA). Queries of
georeferenced specimens are mapped “on-the-fly” using
Berkeley Mapper (right example: specimens near
Sacramento, California of Libellula luctuosa Burmeister
dragonflies in the Essig Database).
Response to climate change
Bold = required
Normal = recommended
(DC) = Darwin Core field
(*) = Darwin Core recommends
one field that accommodates
several date options. We prefer
“begin” and “end” dates.
Figure 4. Each institution uses its own database system. Records will
be collected into a Darwin Core-compliant, flat-file, cache with
required fields for collecting event data and specimen data as
indicated in the above tables from the Essig database. Labels are
often highly abbreviated – unrecognized abbreviations are entered
“as is” and bulk updated after data entry is completed.
Figure 6. Annual average high temperatures under a high emissions scenario of climate change
(Source: Cal-Adapt and the Public Interest Energy Research program, California Energy Commission).
Records of arthropod collections over the past 100 years along with projections of future climates
will be used to predict the impact of climate change on arthropod distributions.
*Collaborators: Bohart Museum – UC Davis, California Academy of Sciences, California State Collection of Arthropods, Entomology Research Museum – UC Riverside, Essig Museum of Entomology – UC Berkeley, LA County Natural History Museum, San Diego Natural History Museum, Santa Barbara Museum of Natural History