Narang Knuepffer Poster Mansfeld DB

Download Report

Transcript Narang Knuepffer Poster Mansfeld DB

Standardizing Mansfeld's World Database of Agricultural and Horticultural
Crops by Implementing a Concept-Based Data Model
Ram Narang and Helmut Knüpffer
Leibniz Institute of Plant Genetics and Crop Plant Research, D-06466 Gatersleben, Germany
[email protected]
Introduction
The Berlin Data Model for Taxonomic Information
The integration of species-related information from multiple sources in federated information systems or
web portals faces the problem of different taxonomic approaches used. Many global and local
taxonomic databases, among them ITIS and Species2000, provide information about species, based on
a single taxonomic view, where information is attached to a single accepted (or preferred) name.
Taxonomic opinions and standards vary with time, place, and investigator, and depend upon many
factors like geographical range of study, interpretation of collected specimens, the fossil record,
morphology, genetics and molecular phylogeny. New classifications may arise from more detailed
studies of specimens, the discovery of new taxonomic information, or the description of new species
and groupings. Consequently, biological taxa often have multiple names, which in turn may have been
applied to multiple taxon concepts. When combining such data from diverse sources into a single
database or portal, one needs to reconcile those different standards. In addition, the increasing use of
DNA sequence comparison as a tool to analyse phylogenetic relationships is accelerating the rate of
taxonomic revision, which is thus unlikely to stabilize in the foreseeable future. Therefore, the availability
and implementation of a data model representing multiple, alternative taxonomic views is crucial for a
sound taxonomic information management.
Mansfeld’s World Database of Agricultural and Horticultural
Crops
The Mansfeld Database (http://mansfeld.ipk-gatersleben.de) is an online database developed at IPK
since 1998, initially as a contribution to the project “Federal Information System on Genetic Resources”
(BIG, http://www.big-flora.de/). It reflects the contents of “Mansfeld’s Encyclopedia of Agricultural and
Horticultural Crops” (Hanelt and IPK 2001) and contains information on ca. 6,100 crop plant species,
excluding forestry and ornamental plants. Each species entry provides nomenclature and synonymy,
common names in different languages, the distribution of the species in the wild and regions of
cultivation, uses, images, references, but also the ancestral species and notes on the phylogeny,
variation and history.
Originally developed under Microsoft Visual FoxPro, the Mansfeld Database has recently been migrated
to the database platform Oracle 10g, and the procedures for the web interface were re-programmed.
Various data models have been developed to support the representation of multiple, alternative
taxonomic views in taxonomic databases (cf. Kennedy et al. 2006), among them the Berlin Model
(Berendsohn et al. 2003), based on the IOPI model. The Berlin Model allows to use alternative
taxonomic concepts (potential taxa) for species information. A number of projects, such as the
Euro+Med PlantBase, AlgaTerra, MoReTax, the IOPI Global Plant Checklist, the Dendroflora of El
Salvador and Med-Checklist, implemented the core of the Berlin Model as a taxonomic backbone for
their databases and contributed to its continuous development and optimization
(http://www.bgbm.org/biodivinf/docs/bgbm-model/). In addition, the Berlin Model is the underlying model
of several tools dedicated to taxonomic data management such as taxonomic revisions, data import
from external sources, data integrity checking and data publishing on the World Wide Web.
The Core of the Berlin Model contains four central functional sections: (1) Taxon Names, (2) Potential
Taxon (taxonomic concepts), (3) Facts and (4) References. Taxon names are the botanical names
according to the International Code of Botanical Nomenclature (ICBN).
Relation
The combination of such a name with a reference forms
a taxonym (or potential taxon, taxon concept). An
auxiliary section Authors assembles author teams for
the nomenclatural references. Finally, the fact
component can be used to store any kind of factual
information.
Taxon
Concept
Name
Reference
Facts
Basic data integrity rules in the Berlin Model are
implemented at the level of tables, keys, and relations
within the database model. For example, the rule that
every botanical name should have a rank can be
assured with a foreign key to the table defining the list
of valid ranks. More complex rules and functions, e.g. to
construct syntactically correct botanical names, are
implemented using stored procedures and trigger
functions. Triggers are functions executed automatically
when certain database events occur. For example, one
of the triggers automatically rebuilds an author team
when one of its author names was changed.
Concept-oriented database core
m
Taxon
Name
Reference
Title
1
cm
is accepted
name
1
is classified
c
Potential
Taxon Name
cm
cm
m
is higher taxon
in classification
cm
assigns
accepted
name
c
1
1
Reference
Status
Assignment
gives status
and other
taxonomic
information of
1
cm
Assigned
Status
Taxon
Rank
Entity-relationship model of the potential taxon
Implementing the Multiple Taxonomic Concepts Model
Web screenshots of the Mansfeld Database before the transformation to the Berlin Model
In a first step of implementation, the latest version of the Berlin Core Model, a database model under
MS SQL Server, was migrated into Oracle 10g. All database procedures, functions and triggers that
implement taxonomic logic, were translated into their PL/SQL equivalents.
Nomenclatural and bibliographical data of the Mansfeld Database was atomised using JAVA
programmes. The parsed information was tagged and stored in an XML file. The resulting soft-schema
XML-file was read with JDOM and corrected manually -- a time-consuming task --, to write a strict
schema XML file which was used to populate the tables in the Taxon, Reference and Potential Taxon
sections of the Berlin taxonomic model. After completion of the taxonomic core, the remaining
information from the Mansfeld Database, such as textual information on geographical distribution and
uses, was linked to the potential taxon as factual data. Finally, the web interface was adapted (reprogrammed) to the new data model.
Taxonomy Module of the Mansfeld Database
Mansfeld
Database
Like many other global taxonomic checklists, the Mansfeld Database represents a single taxonomic
view of nomenclatural information. It incorporates classifications that have gained broad acceptance in
taxonomic literature and by taxonomists working with the taxa concerned, and thus offers the
opportunity of standardizing scientific nomenclature and taxonomy for cultivated plant species.
Alternative taxonomic views (reflected by phrases such as sensu, amend., etc.) are presently stored as
part of the nomenclatural reference. Similarly, authors and bibliographical references are not yet
atomized into individual attributes. These information items need to be parsed and abstracted into the
entity-relationship model to allow a conceptual view on the taxon.
botnam
taxa
PK
botnam_id
PK
taxon_id
I15
I11
I10
I18
FK3,I31
I22,I20,I32
I21
I23
I5
I3
I9
I8
I7,I22
I16,I22
I17
homonym
dublette
dubl_mit
löschen
soi_id
name
name_ansi
name_gz
autor_bas
autor
autor_non
autor_id
autor_ges
jahr
jahr_non
publ_id
publ
publ_band
publ_seite
publ_non
publ_add
nam_stat
alt_name
name_voll
autor_apn
autor_chk
publ_bph
publ_tl2
publ_chk
ref_id
tax_text
rang_id
gruppen_id
art_autor
original
bemerkung
anzeigen
fuer_big
erstellt
erst_von
geaendert
geaend_von
FK1,I2
botnam_id
bnam_id_alt
hightax_id
famtax_id
familie
artikel_id
highart_id
db_id
soi_id
ref_id
erstellt
erst_von
geaendert
geaend_von
löschen
I24,I22,I28
I19
I4
I6
I25
I27
I26
I30
FK1,I29
FK2,I14
I2
I1
I13
I12
I4
I3
I1
FK2,I6
I5
I4
I3
I2
I1
I6
I5
FK1,I12
I14
I1
I7
I13
FK2,I11
I9
FK3,I8
I10
I3
I5
I2
I4
I6
rang
mf_rang_kuerz
mf
anzeigen
sprach_id
taxlevel
kulturpflanze
reihenf
erstellt
erst_von
geaendert
geaend_von
PK
gruppen_id
I2
I3
I1
kuerzel
name
anzeigen
soi
PK
id
PK
synstat_id
taxon_id
vtaxon_id
akztax_id
mf_artikel
text_tax
synstat_id
syn_oper
ppstat_id
syn_text
artikel_id
erstellt
erst_von
geaendert
geaend_von
anzeigen
fuer_big
löschen
bemerkung
I4
I3
I2
I5
I1
syn_symbol
syn_status
status_big
text
sortierung
bemerkung
pp_stat
PK
ppstat_id
U1
pp_kuerzel
bemerkung
vnam_tax
anzeigen
autor_orig
I6
I8
I5
I3
I4
I14
I1
I7
I2
I10
I9
I12
I11
I13
autor_id
dubl_mit
autor_ges
autor_apn
autor_bas
problem
autor
autor_non
autor_api
bemerkung
erstellt
erst_von
geaendert
geaend_von
löschen
I11
FK1,I19
FK2,I20
I22
I21
I13
I18
FK3,I14
I15
I10
I2
I7
I17
I16
I6
I5
I9
I8
I3
I4
I12
id
taxon_id
vnam_id
vnam_neu
vnam_id_alt
name_orig
sprach_id
namtyp_id
pfl_teil
geogr_info
add_info
artikel
fuer_big
soi_id
ref_id
erstellt
erst_von
geaendert
geaend_von
chk
chk_von
löschen
taxa_soi
PK
id
FK1,I6
FK2,I5
I2
I1
I4
I3
taxon_id
soi_id
erstellt
erst_von
geaendert
geaend_von
publ_bphtl
volksnam
PK
vnam_id
U1
I8,I5
I7
I6
I1
I3
name
name_ansi
soi_id
ref_id
anzeigen
erstellt
erst_von
geaendert
geaend_von
bemerkung
original_n
chk
chk_von
löschen
I2
I4
I2
publ_id
publ
bemerkung
erstellt
erst_von
geaendert
geaend_von
vnamtyp
PK
namtyp_id
I2
I3
name_d
name_e
bemerkung
anzeigen
soi_id
soi_big
name_d
name_e
autoren
publikat
rang_id
gruppe
dubl_botnam
botnam_id
dublette
dubl_mit
taxon_id
PK
taxrang
PK
syn_stat
syno
I1
Mansfeld Database – Taxonomy module
I
XML
soft schema
II
XML
strict schema
III
Conceptual Db
model
Implementation steps
Outlook
The implementation of the Berlin Model in the Mansfeld Database facilitates standardisation and
improves the quality of the taxonomic information by increasing accuracy, resolution and interpretability.
In addition, existing standard taxonomy management tools such as a web editors can be adapted to be
used on the underlying new conceptual Mansfeld Database model for updating the contents of the
database. Vast information about 6,100 species of agricultural and horticultural crop plants will thus
become more easily accessible to global portals on biodiversity information.
The Encyclopedia of Life (http://www.eol.org) launched in 2007 is developing “species
pages” for all known organisms, the contents to be provided and edited by experts
from all over the world, using a wiki-like editor. Its initial contents is being gathered
from existing web resources. The rich information contents of >6,000 of the
economically most important plant species documented in the Mansfeld Database
was offered for inclusion at the EoL Plant Species Pages Meeting (St. Louis,
Missouri), 31.10.-2.11.2007.
The Global Biodiversity Information Facility (http://www.gbif.org) is aiming at providing
free access to biodiversity information on the web, using standardised web services.
The Mansfeld Database developers have been approached by GBIF to make its ca.
38,000 common names of crop plant species in many languages available to GBIF, to
start developing an interface that would allow the world’s biodiversity data to be
queried also via common names, besides scientific names. Integrating the Mansfeld
Database fully into GBIF would also make its rich crop species information accessible
along with data from other providers of taxon-related data.
References
Berendsohn, W.G., M. Döring, M. Geoffroy, K. Glück, A. Güntsch, A. Hahn, W.-H. Kusber, J.L. Li, D. Röpert and F. Specht.
2003. The Berlin Model: a concept-based taxonomic information model. Pp. 15-26 in Berendsohn, W.G. (ed), MoReTax.
Handling Factual Information Linked to Taxonomic Concepts in Biology. Schriftenreihe für Vegetationskunde 39, Bonn.
Hanelt, P. and Institute of Plant Genetics and Crop Plant Research (eds), 2001. Mansfeld’s Encyclopedia of Agricultural and
Horticultural Crops (Except Ornamentals). 6 vols. 1st Engl. ed. Springer, Berlin, Heidelberg, New York, etc. (LXX+3645 pp.)
Kennedy, J., R. Hyam, R. Kukla and T. Paterson, 2006. Standard data model representation for taxonomic information.
OMICS. A Journal of Integrative Biology 10 (Special Issue on Data Standards), 220-230.
Leibniz Institute of Plant Genetics and Crop Plant Research