OBIS/LifeWatch quality control procedures File
Download
Report
Transcript OBIS/LifeWatch quality control procedures File
Quality control of biodiversity data:
tools & techniques
Leen Vandepitte
On behalf of WoRMS, EurOBIS
& LifeWatch data management teams
What needs to be checked?
Quality control procedures for OBIS data - background
•
Aim:
1. Help data providers & data managers
• Checking quality
• Checking completeness
• Detect (possible) errors
2. Assign quality flags to each available record
•
Evaluation of fitness for purpose & use (NOT: good or bad)
•
Filter out records with certain quality standard
Example
Abra alba at latitude 24,53 & longitude 67,94 in 1983
Record suitable for general distribution analysis (species occurrence)
Record suitable for general temporal analysis (yearly trends)
Record not suitable for seasonal analysis
Record not suitable for abundance-related analyses (presence only)
Quality control procedures for OBIS data - background
•
Aim:
1. Help data providers & data managers
• Checking quality
• Checking completeness
• Detect (possible) errors
2. Assign quality flags to each available record
•
•
Evaluation of fitness for purpose & use (NOT: good or bad)
•
Filter out records with certain quality standard
Approach:
1. Automated process, within the database
•
Allows creation of filters
•
Allows feedback to data providers
2. Online web services
•
Freely available for use to everyone
•
Allows direct feedback to user (result reports)
•
Technical:
–
18 quality control steps, on individual record level
–
10 outlier checks, on dataset or species level
–
Each QC step = yes (1)/no (0) question
–
Creation of a bit-sequence (2(x-1))
=> stored as an integer value for the QC
=> unique value for each possible combination
QC step
Value
Bit-seq.
1
1
2(1-1)
2
1
2(2-1)
3
0
4
1
5
0
2(4-1)
TOTAL
QC step
Value
=1
1
1
2(1-1)
=1
=2
2
1
2(2-1)
=2
=0
3
1
2(3-1)
=4
=8
4
1
2(4-1)
=8
=0
5
1
2(5-1)
= 16
TOTAL
= 31
= 11
Bit-seq.
LifeWatch: home for a multitude of web services
•
Part of European Strategy Forum on Research Infrastructures (ESFRI)
•
Distributed virtual laboratory:
– Biodiversity research
– Climatological & environmental impact studies
– Support development of ecosystem services
– Provide information for policy makers
– Biodiversity observatories, databases, web services and modelling tools
– Integration of existing systems, upgrades, new systems
•
LifeWatch wants
– Standardization of species data
– Integration of distributed biodiversity data repositories & operating facilities
•
LifeWatch needs
– Species information services
•
LifeWatch offers compilation and combination of several web services
•
These services = taxonomic backbone
– Taxonomy access services
– Taxonomic editing environment
– Species occurrence services
– Catalogue services
•
LifeWatch infrastucture:
– Identify, analyze and design online data services, models and applications
– Make use of all LifeWatch data
– = interactive part of LifeWatch
LifeWatch web services
•
Login / password required
•
System keeps track of all your “jobs”
Taxonomic quality control
•
Taxon match: World Register of Marine Species (WoRMS)
•
Taxon match: LifeWatch taxon match:
World Register of Marine Species
Integrated Taxonomic Information System (ITIS)
Catalogue of Life (CoL)
International Plant Name Index (IPNI)
Index Fungorum (IF)
PalaeoBiology Database (Palaeo-DB)
Pan-European Species Infrastructure (PESI)
…
Taxonomic QC step by step
TAXON NAME X
Match with WoRMS?
Document LSID
no
yes
Check habitat (marine/non-marine)
Match with other registers?
Check tax level (genus/species)
no
yes
Is the taxon marine?
Contact the data provider
for secondary check
yes
no
Contact taxonomic editor:
add taxon to WoRMS
Add taxon to
annotated list
Go through matches again
WoRMS Taxon Match Tool
Freely available, no password/login required
This tool uses the following components:
TAXAMATCH fuzzy matching algorithm by Tony Rees
PHP/MySql port of TAXAMATCH by Michael Giddens
Scientific Names Parser by Dmitry Mozzherin
Prepare your own file (Plain text [TXT], Comma Separated [CSV] & Excel
Sheet [XLS, XLSX]
For convenience => colum “scientific_name”
Upload onto website
•
WoRMS taxon match results:
– Exact match
– Phonetic match
– Near_1 match
– Near_2 match
– No match
Check and verify everything that is not an exact match…
•
Some examples:
– Phonetic: Fragilaria aurivillii => Fragilaria aurivilii
– Near_1: Chaetoceros seychellarum => Chaetoceros seychellarus
– Near_2: Gammarus finnmarchius => Gammarus finmarchicus
Syllis armoricanus => Syllis armoricana
LifeWatch taxon match tool
LifeWatch taxon match tool
LifeWatch taxon match tool
•
Currently available taxon services
If a taxon is not in WoRMS:
- Send email to [email protected]
- Let us know if it is available in any of the other registers
Use this report as feedback
to your provider / WoRMS
Taxonomic quality control – ambiguous matches
Scientific name: Chondracanthus, unknown species
Kingdom Plantae (Rhodophyta)
Kingdom Animalia (Crustacea)
Scientific name: Alebion
Alebion Krøyer, 1863
=> Animalia, Crustacea, parasitic copepods
Alebion Gray, 1867
=> Animalia, Porifera
=> Accepted as Iophon Gray, 1867
Taxonomic quality control – its importance illustrated…
“… In total, 6,172 unique taxon names were submitted …. After a thorough QC, however, this number
was reduced to 4,525, mostly due to spelling variations and synonymy.”
“ … Such [taxonomic] quality control is highly needed, since a misspelled or obsolete name could
be compared to the introduction of a rare species, with adverse effects on further (biodiversity)
calculations…”
Source: Vandepitte et al. (2010). Data integration for European marine biodiversity research: creating a database on benthos and
plankton to study large-scale patterns and long-term changes. Hydrobiologia 644: 1-13
Geographic quality control
•
LifeWatch: Show on map
•
LifeWatch: Marine Regions Gazetteer services
–
Get lat-lon by MrgID
–
Get lat-lon by name
–
Get Gazetteer name by lat-lon
–
Get lat-lon by accepted name
?
Geographic QC – the concept
Communication with
provider
Before quality control
After quality control
18°30’25’’N – 5°15’E
18.51 ; 5.25
54,23N – 16.5S
54.23 ; -16.5
WGS84 = World Geodetic System 1984; most used geographical reference system
Decimal degrees => easy to work with
Coordinates are indispensable
•
Coordinates = basis of a biogeographic information system
•
When no coordinates are provided…
Check with the data provider / the source
• When existing: complete the file & run QC
• When not existing:
– Derive from provided map
– Check Marine Regions to assign coordinates
Marine Regions
•
= Standard, relational list of geographic names
•
Coupled with information and maps of the geographic location
•
Improve access and clarity of the different geographic, mainly marine
names such as seas, sandbanks, ridges and bays
http://www.marineregions.org
Fish species “A” present in Kenya
Marine species on land?
Link with adjacent sea area: EEZ
Indicate precision!!!!
Geographical quality control – its importance illustrated
•
Some examples
“Monitoring in Kongsfjorden area”
“Monitoring in Belgian part of the North Sea”
“+” & “-” signs switched
Latitude & longitude switched
Geographical quality control – its importance illustrated
Sightings and strandings of marine turtles around the coast of UK and Ireland
Left: coordinates as received; right: corrected. Errors due to missing minus sign
What else to check…?
•
Use common sense…
•
LifeWatch: Data format validation
–
Checks if lat-lon are completed
–
Checks if lat-lon are within possible boundaries (-90/+90 & -180/+180)
–
Checks date-format (ISO standard)
Dates
•
What can go wrong?
– Year: “1972” vs “72” vs “972”
– Month: between 1-12
– Day: between 1-31, check takes into account the given month
•
Also…
– Dataset from 1990, with a few records in 1909…
Units
•
OBIS can capture:
– Counts
– Biomass
– Depth
•
Are units defined?
– Counts: individuals per m², cm², liter, m³
– Biomass: wet weight, dry weight, ash-free dry weight
– Depth: meter, centimeter
“I collected 4 individuals of species X from location Z”
=> Sample size? 10 cm² - 50 cm² - 1 m² - …?
36
•
Significance:
– Needs thorough documenting
– Know what you are dealing with
– Comparison
– Convert to OBIS standards
• Depth: in meter, positive values
• Abundance: NULL versus 0 (absence); positive values
Quality control procedures for OBIS data - automated
•
Remember - technical:
–
18 quality control steps, on individual record level
–
10 outlier checks, on dataset or species level
–
Each QC step = yes (1)/no (0) question
–
Creation of a bit-sequence (2(x-1))
=> stored as an integer value for the QC
=> unique value for each possible combination
•
Some steps not (yet) available through web services
– Will be identified once data is in OBIS
– Harvest report will give indication of possibly erroneous records
•
Geographic quality control – 3-dimensional check
Latitude – longitude <> 0
Latitude – longitude between -90/+90 and -180/+180
Latitude – longitude within sea area (20 km buffer)
Depth value possible
Plot corresponding latitude-longitude on the General Bathymetric Chart of the
Oceans (GEBCO)
Compare GEBCO depth with actual sampling depth
Take into account 100m margin
Taxon
Given depth (m)
GEBCO depth (m)
Difference (m)
Desmoscolex
2080
510
+ 1570
Halieutichthys aculeatus
110
1140
- 1030
Negative evaluation in QC
Needs to be looked at…
Pancake batfish => usually bottom-dwelling…
Quality control procedures for OBIS data - outlier analyses
•
Only performed on OBIS database (=global coverage)
•
Geographic outliers – dataset level
–
Analysis on dataset level
–
Possible location outlier(s)
–
Methodology based on centroid calculations and assuming normal distribution => not
applicable for strong asymetric datasets…
–
Communication with provider on results
Dataset: “ICES Biological Community” (DOME)
Also identified as incorrect in recordlevel check of lat-lon (=land)
Not identified through record-level
check of lat-lon (=sea), but seen as
potential outlier through geographic
outlier check
Centroid
No outlier
Possible outlier
Vandepitte et al. (2015). Fishing for data and sorting the
catch […]. Database. DOI: 10.1093/database/bau125
Provider communication:
- Antarctic locations are incorrect (data error)
- Northern locations are correct (sampling bias)
•
Environmental outliers – species level
=> Check for outliers within the available distribution records of a species
=> Geography, depth, sea surface salinity (SSS), sea surface temperature (SST)
Depth
Geography
Centroid
No outlier
Possible outlier
Verruca stroemia (Crustacea: Cirripedia)
No outlier
Possible outlier
Vandepitte et al. (2015)
Dubious results => additional verification:
-
World Register of Marine Species: literature & expert-based species distribution information
-
Expert: ecological information
Questions?
Hands-on sessions: Wednesday & Thursday
Analysing the content of the European Ocean Biogeographic Information System (EurOBIS):
available data, limitations, prospects and a look at the future
Hydrobiologia 667(1): 1-14 (2011)
Vandepitte L., Hernandez F., Claus S., Vanhoorne B., De Hauwere N., Deneudt K., Appeltans W.,
Mees, J.
Finding what you need in a sea of data: Assessing the data quality, completeness and fitness for
use of data in marine biogeographic databases
Database (2015), 1-14 (doi: 10.1093/database/bau125)
Vandepitte L., Bosch S., Tyberghein L., Waumans F., Vanhoorne B., Hernandez F., De Clerck O. &
Mees J.