OBIS/LifeWatch quality control procedures File

Download Report

Transcript OBIS/LifeWatch quality control procedures File

Quality control of biodiversity data:
tools & techniques
Leen Vandepitte
On behalf of WoRMS, EurOBIS
& LifeWatch data management teams
What needs to be checked?
Quality control procedures for OBIS data - background
•
Aim:
1. Help data providers & data managers
• Checking quality
• Checking completeness
• Detect (possible) errors
2. Assign quality flags to each available record
•
Evaluation of fitness for purpose & use (NOT: good or bad)
•
Filter out records with certain quality standard
Example
Abra alba at latitude 24,53 & longitude 67,94 in 1983
 Record suitable for general distribution analysis (species occurrence)
 Record suitable for general temporal analysis (yearly trends)
 Record not suitable for seasonal analysis
 Record not suitable for abundance-related analyses (presence only)
Quality control procedures for OBIS data - background
•
Aim:
1. Help data providers & data managers
• Checking quality
• Checking completeness
• Detect (possible) errors
2. Assign quality flags to each available record
•
•
Evaluation of fitness for purpose & use (NOT: good or bad)
•
Filter out records with certain quality standard
Approach:
1. Automated process, within the database
•
Allows creation of filters
•
Allows feedback to data providers
2. Online web services
•
Freely available for use to everyone
•
Allows direct feedback to user (result reports)
•
Technical:
–
18 quality control steps, on individual record level
–
10 outlier checks, on dataset or species level
–
Each QC step = yes (1)/no (0) question
–
Creation of a bit-sequence (2(x-1))
=> stored as an integer value for the QC
=> unique value for each possible combination
QC step
Value
Bit-seq.
1
1
2(1-1)
2
1
2(2-1)
3
0
4
1
5
0
2(4-1)
TOTAL
QC step
Value
=1
1
1
2(1-1)
=1
=2
2
1
2(2-1)
=2
=0
3
1
2(3-1)
=4
=8
4
1
2(4-1)
=8
=0
5
1
2(5-1)
= 16
TOTAL
= 31
= 11
Bit-seq.
LifeWatch: home for a multitude of web services
•
Part of European Strategy Forum on Research Infrastructures (ESFRI)
•
Distributed virtual laboratory:
– Biodiversity research
– Climatological & environmental impact studies
– Support development of ecosystem services
– Provide information for policy makers
– Biodiversity observatories, databases, web services and modelling tools
– Integration of existing systems, upgrades, new systems
•
LifeWatch wants
– Standardization of species data
– Integration of distributed biodiversity data repositories & operating facilities
•
LifeWatch needs
– Species information services
•
LifeWatch offers compilation and combination of several web services
•
These services = taxonomic backbone
– Taxonomy access services
– Taxonomic editing environment
– Species occurrence services
– Catalogue services
•
LifeWatch infrastucture:
– Identify, analyze and design online data services, models and applications
– Make use of all LifeWatch data
– = interactive part of LifeWatch
LifeWatch web services
•
Login / password required
•
System keeps track of all your “jobs”
Taxonomic quality control
•
Taxon match: World Register of Marine Species (WoRMS)
•
Taxon match: LifeWatch taxon match:
 World Register of Marine Species
 Integrated Taxonomic Information System (ITIS)
 Catalogue of Life (CoL)
 International Plant Name Index (IPNI)
 Index Fungorum (IF)
 PalaeoBiology Database (Palaeo-DB)
 Pan-European Species Infrastructure (PESI)
…
Taxonomic QC step by step
TAXON NAME X
Match with WoRMS?
Document LSID
no
yes
Check habitat (marine/non-marine)
Match with other registers?
Check tax level (genus/species)
no
yes
Is the taxon marine?
Contact the data provider
for secondary check
yes
no
Contact taxonomic editor:
add taxon to WoRMS
Add taxon to
annotated list
Go through matches again
WoRMS Taxon Match Tool
Freely available, no password/login required
This tool uses the following components:
 TAXAMATCH fuzzy matching algorithm by Tony Rees
 PHP/MySql port of TAXAMATCH by Michael Giddens
 Scientific Names Parser by Dmitry Mozzherin
 Prepare your own file (Plain text [TXT], Comma Separated [CSV] & Excel
Sheet [XLS, XLSX]
 For convenience => colum “scientific_name”
 Upload onto website
•
WoRMS taxon match results:
– Exact match
– Phonetic match
– Near_1 match
– Near_2 match
– No match
Check and verify everything that is not an exact match…
•
Some examples:
– Phonetic: Fragilaria aurivillii => Fragilaria aurivilii
– Near_1: Chaetoceros seychellarum => Chaetoceros seychellarus
– Near_2: Gammarus finnmarchius => Gammarus finmarchicus
Syllis armoricanus => Syllis armoricana
LifeWatch taxon match tool
LifeWatch taxon match tool
LifeWatch taxon match tool
•
Currently available taxon services
If a taxon is not in WoRMS:
- Send email to [email protected]
- Let us know if it is available in any of the other registers
Use this report as feedback
to your provider / WoRMS
Taxonomic quality control – ambiguous matches
Scientific name: Chondracanthus, unknown species
Kingdom Plantae (Rhodophyta)
Kingdom Animalia (Crustacea)
Scientific name: Alebion
Alebion Krøyer, 1863
=> Animalia, Crustacea, parasitic copepods
Alebion Gray, 1867
=> Animalia, Porifera
=> Accepted as Iophon Gray, 1867
Taxonomic quality control – its importance illustrated…
“… In total, 6,172 unique taxon names were submitted …. After a thorough QC, however, this number
was reduced to 4,525, mostly due to spelling variations and synonymy.”
“ … Such [taxonomic] quality control is highly needed, since a misspelled or obsolete name could
be compared to the introduction of a rare species, with adverse effects on further (biodiversity)
calculations…”
Source: Vandepitte et al. (2010). Data integration for European marine biodiversity research: creating a database on benthos and
plankton to study large-scale patterns and long-term changes. Hydrobiologia 644: 1-13
Geographic quality control
•
LifeWatch: Show on map
•
LifeWatch: Marine Regions Gazetteer services
–
Get lat-lon by MrgID
–
Get lat-lon by name
–
Get Gazetteer name by lat-lon
–
Get lat-lon by accepted name
?
Geographic QC – the concept
Communication with
provider
Before quality control
After quality control
18°30’25’’N – 5°15’E
18.51 ; 5.25
54,23N – 16.5S
54.23 ; -16.5
WGS84 = World Geodetic System 1984; most used geographical reference system
Decimal degrees => easy to work with
Coordinates are indispensable
•
Coordinates = basis of a biogeographic information system
•
When no coordinates are provided…
Check with the data provider / the source
• When existing: complete the file & run QC
• When not existing:
– Derive from provided map
– Check Marine Regions to assign coordinates
Marine Regions
•
= Standard, relational list of geographic names
•
Coupled with information and maps of the geographic location
•
Improve access and clarity of the different geographic, mainly marine
names such as seas, sandbanks, ridges and bays
http://www.marineregions.org
Fish species “A” present in Kenya
Marine species on land?
Link with adjacent sea area: EEZ
Indicate precision!!!!
Geographical quality control – its importance illustrated
•
Some examples
“Monitoring in Kongsfjorden area”
“Monitoring in Belgian part of the North Sea”
“+” & “-” signs switched
Latitude & longitude switched
Geographical quality control – its importance illustrated
Sightings and strandings of marine turtles around the coast of UK and Ireland
Left: coordinates as received; right: corrected. Errors due to missing minus sign
What else to check…?
•
Use common sense…
•
LifeWatch: Data format validation
–
Checks if lat-lon are completed
–
Checks if lat-lon are within possible boundaries (-90/+90 & -180/+180)
–
Checks date-format (ISO standard)
Dates
•
What can go wrong?
– Year: “1972” vs “72” vs “972”
– Month: between 1-12
– Day: between 1-31, check takes into account the given month
•
Also…
– Dataset from 1990, with a few records in 1909…
Units
•
OBIS can capture:
– Counts
– Biomass
– Depth
•
Are units defined?
– Counts: individuals per m², cm², liter, m³
– Biomass: wet weight, dry weight, ash-free dry weight
– Depth: meter, centimeter
“I collected 4 individuals of species X from location Z”
=> Sample size? 10 cm² - 50 cm² - 1 m² - …?
36
•
Significance:
– Needs thorough documenting
– Know what you are dealing with
– Comparison
– Convert to OBIS standards
• Depth: in meter, positive values
• Abundance: NULL versus 0 (absence); positive values
Quality control procedures for OBIS data - automated
•
Remember - technical:
–
18 quality control steps, on individual record level
–
10 outlier checks, on dataset or species level
–
Each QC step = yes (1)/no (0) question
–
Creation of a bit-sequence (2(x-1))
=> stored as an integer value for the QC
=> unique value for each possible combination
•
Some steps not (yet) available through web services
– Will be identified once data is in OBIS
– Harvest report will give indication of possibly erroneous records
•
Geographic quality control – 3-dimensional check
 Latitude – longitude <> 0
 Latitude – longitude between -90/+90 and -180/+180
 Latitude – longitude within sea area (20 km buffer)
 Depth value possible
 Plot corresponding latitude-longitude on the General Bathymetric Chart of the
Oceans (GEBCO)
 Compare GEBCO depth with actual sampling depth
 Take into account 100m margin
Taxon
Given depth (m)
GEBCO depth (m)
Difference (m)
Desmoscolex
2080
510
+ 1570
Halieutichthys aculeatus
110
1140
- 1030
Negative evaluation in QC
Needs to be looked at…
Pancake batfish => usually bottom-dwelling…
Quality control procedures for OBIS data - outlier analyses
•
Only performed on OBIS database (=global coverage)
•
Geographic outliers – dataset level
–
Analysis on dataset level
–
Possible location outlier(s)
–
Methodology based on centroid calculations and assuming normal distribution => not
applicable for strong asymetric datasets…
–
Communication with provider on results
Dataset: “ICES Biological Community” (DOME)
Also identified as incorrect in recordlevel check of lat-lon (=land)
Not identified through record-level
check of lat-lon (=sea), but seen as
potential outlier through geographic
outlier check
Centroid
No outlier
Possible outlier
Vandepitte et al. (2015). Fishing for data and sorting the
catch […]. Database. DOI: 10.1093/database/bau125
Provider communication:
- Antarctic locations are incorrect (data error)
- Northern locations are correct (sampling bias)
•
Environmental outliers – species level
=> Check for outliers within the available distribution records of a species
=> Geography, depth, sea surface salinity (SSS), sea surface temperature (SST)
Depth
Geography
Centroid
No outlier
Possible outlier
Verruca stroemia (Crustacea: Cirripedia)
No outlier
Possible outlier
Vandepitte et al. (2015)
Dubious results => additional verification:
-
World Register of Marine Species: literature & expert-based species distribution information
-
Expert: ecological information
Questions?
Hands-on sessions: Wednesday & Thursday
Analysing the content of the European Ocean Biogeographic Information System (EurOBIS):
available data, limitations, prospects and a look at the future
Hydrobiologia 667(1): 1-14 (2011)
Vandepitte L., Hernandez F., Claus S., Vanhoorne B., De Hauwere N., Deneudt K., Appeltans W.,
Mees, J.
Finding what you need in a sea of data: Assessing the data quality, completeness and fitness for
use of data in marine biogeographic databases
Database (2015), 1-14 (doi: 10.1093/database/bau125)
Vandepitte L., Bosch S., Tyberghein L., Waumans F., Vanhoorne B., Hernandez F., De Clerck O. &
Mees J.