Robert Kukla

Download Report

Transcript Robert Kukla

Experience from Mapping
Existing Models to the Transfer
Schema
Robert Kukla
Introduction
• Three test databases:
– ITIS (plants part)
– Berlin Model (mosses/higher plants)
– Taxonomer (fishes)
• Imported into mySQL
• Java program to generate XML
• Three main aspects:
– Identifying concepts
– Extracting relationships
– Concept details
• No CharacterCircumscription, SpecimenCircumscription
• No hybrids as implications are not fully understood
ITIS
• Integrated Taxonomic Information System
• “authoritative” taxonomic information
• Continuously evolving:
– New records get added
– Existing records get updated (!)
• 331886 taxonomic units (97741 plants) 206649 concepts
• Most explored DB
ITIS - Identifying Concepts
• ITIS’ own concepts (type = revision)
– taxonomic unit
– usage = “accepted”
• Synonyms (type = referenced)
– usage = “not accepted”
– referenced from synonym table
• Vernaculars (type = vernacular)
– from vernacular table
ITIS: Extracting Relationships
• Concept Circumscription
– parent_tsn field
• Synonymy Relationships
– Explicit synonyms
– Vernaculars
• Lineage Relationships
– to concept of same name according to
different publication
ITIS – concept details
• Names:
– up to 4 epithets (only 3 used) plus 4 category
indicators to be interpreted depending on rank
– authorTeam from separate table
– NameSimple calculated
• Publications:
– Multiple publication per taxon_unit
– Not completely atomised - compromise
Berlin Model Mosses/(German Higher Plants)
• Database of Taxonomic Concepts
– Records will not change
– Explicit concept relationships + (name-)
synonymy
– 24368 concepts – 24368 concepts
Berlin Model - Identifying Concepts
• From table pTaxon
Taxonomer
• Relational data model for managing
information relevant to taxonomic research
• Records get added; not changed
• “Assertion” – mention of a taxonomic
name in the taxonomic literature
• “Protonym” – taxonomic name in the
context of its first publication
• Relationships between assertions
• 36305 assertions – 14971 concepts
Taxonomer - Identifying Concepts
• Concepts (type=referenced)
– from table tbl_Assertions
– ReliabilityID >= 4 (4-revision, 5 original/new
combination)
Taxonomer – extracting
relationships
• ConceptCircumscription
– ParentAssertionID
• Relationships
– Table not populated
Taxonomer – concept details
• Number of fields in the database
suggested a complexity that was not
supported by the data (not all fields filled)
• Atomised name difficult to recreate as only
terminal epithet is stored – omitted it
• Use of cheat fields for NameSimple
• Large number of AccordingTo (>4000)
• Publication data transferred 1:1
Technical Aspects
• Database consistency e.g.
– getting all publication records
– no relationships to non-existant concepts
• Charset
– assume windows-1252 code page
• Slow!
– indexes essential
– fewer queries with big result sets faster
• Recursive approach is more suitable for wrapper
– guarantees small, consistent subset
Mapping software
• Universal transformation software to convert relational
data to XML (XMlizer)
– Often GUI based; filling in a skeleton XML file
– Relate a single query (table or join) to collection of XML nodes
– Map fields from that query to attributes or child elements of the
XML node
• Problems
–
–
–
–
No mechanism to use multiple sources (queries) for one
No conditional transformation
No splitting of fields
Limited merging of fields
• Write our own universal mapping software
– addresses first 2 problems
Conclusion
• Conversion of legacy data is possible but
– information missing
– information will be lost
• Data in original DB is open to
interpretation so expert should be
consulted
• Required computing resources should not
be underestimated