The DIGMAP project addressed the development of a digital library

Download Report

Transcript The DIGMAP project addressed the development of a digital library

Complex Data Transformations in
Digital Libraries
with Spatio-Temporal Information
B. Martins, N. Freire, J. Borbinha
Instituto Superior Técnico, Technical University of Lisbon
2008 International Conference on Asia-Pacific Digital Libraries
Introduction and Motivation
• The DIGMAP project addressed the development of a digital
library for materials related to old maps
–
–
Collecting metadata from different providers (e.g. OAI-PMH servers)
Processing the metadata and enriching it with inferred spatio-temporal information
• Challenges in handling heterogeneous metadata
–
–
–
Transforming the original sources into the DIGMAP format (i.e., TEL profile)
Dealing with data inconsistency, non-uniformity, incorrectness and incompleteness
Handling the spatio-temporal information (e.g. dates and geospatial coordinates)
• Challenges in DIGMAP service interoperability
–
Using the results from DIGMAP services to enrich the metadata
• DIGMAP required appropriate XML processing technology
for dealing with the above challenges
The Proposed Solution
• Use XML processing languages like XSLT and XQuery
• Extend the XPath 2.0 function library
–
–
–
–
Functions for managing geospatial information
Functions for managing temporal information
Functions for text processing
Other miscellaneous functions
• All the advantages of declarative languages like XSLT
and XQuery, together with powerful methods for
handling complex transformations
Outline
• Introduction
• Proposed Extensions to the XPath Function Library
• Implementation Issues
• Test Cases Within the DIGMAP Project
• Conclusions and Future Work
The Proposed Extensions
• Extensions for geospatial data handling
–
–
Combining spatial elements according to a geospatial predicates such as distance or intersection
Input given in GML, KML or textual strings with geospatial coordinates
• Extensions for temporal reasoning
–
–
Combining temporal information according to the predicates of Allen’s Algebra for temporal intervals
Input given in GML or string encodings (e.g. the ISO 8601 formats)
• Extensions for text mining
–
–
Keyword matching and textual similarity
Standard text mining operations (e.g. language recognition)
• Other miscellaneous extensions
–
Handling JDBC calls and calls to external Web services
Geospatial Data Handling
• Operators for performing geospatial analysis based on the OGC Simple
Features and Filter Encoding specifications
– Distance, union, intersection or difference between two geometries
– Validity of a given spatial filter
– Check if two geometries are spatially related (e.g. containment or overlap)
– Check if two geometries fall bellow a given distance threshold
– Area, length, buffer, centroid, boundary or envelope of a geometry
– Geometric computations (e.g. translation or scaling) over a geometry
– Conversion between GML, KML, C-Square, Geohash or WKT encodings
– Transformations on the coordinate systems used in geometries
Temporal Data Handling
• Operators for temporal analysis based on Allen's interval algebra
– Distance, union, intersection or difference between temporal intervals
– Check if two intervals are related (e.g. containment or overlap)
• Other operators for temporal data handling
– Compute lengths for temporal intervals (e.g. return seconds or years)
– Conversion between GML and string encodings
Textual Data Handling
• Keyword matching and textual similarity
– Tokenization and keyword-based search
– Phonetic similarity (Soundex and Double Metaphone)
– String similarity (e.g. Edit Distance, Jaro, Jaro-Winkler, Q-grams, …)
• Standard text mining operations
– Language recognition
– Keyword extraction (statistically significant keywords)
– Named entity recognition (regexp, dictionaries or machine learning)
– Text classification (machine learning)
Miscellaneous Functions
• Calling external Web services (REST and SOAP)
• Conversion from XML to JavaScript Object Notation (JSON)
• Handling Java DataBase Connectivity (JDBC) calls
• Reading malformed HTML
• Converting MARC formats into XML (MarcXml or MarcXchange)
• …
Implementation Issues
• Proposed extensions implemented on top of SAXON
–
–
–
–
SAXON is an open source XSLT/XQuery processor
Extension functions coded in Java (static methods)
Extension functions called by binding the Java class to a specific namespace
SAXON takes care of converting the arguments to make the functions fit
• Most extensions are wrappers over existing open-source libraries
–
–
–
–
–
–
GeoTools and Java Topology Suite (JTS) for the geospatial functions
Lucene and Nux for keyword matching
SimPack for textual similarity
NGramJ and LingPipe for text mining
MARC4J for metadata crosswalks (i.e. handling MARC formats)
Apache AXIS for external Web service calls
Test Cases Within DIGMAP
• Conversion between different metadata standards
–
–
–
Converting UNIMARC, MARC21 and other formats into the DIGMAP format
Geospatial coordinates were often given originally in general textual fields
DIGMAP currently indexes over 40.000 metadata records from different sources
• Wrappers around DIGMAP XML service interfaces
–
–
–
The DIGMAP Gazetteer uses formats like Alexandria DL Gazetteer Service format, KML, geoRSS, …
The DIGMAP GeoParser uses formats like SpatialML, geoRSS, OGC GeoParser, …
Converting between the different formats and calling the services for processing the metadata records
• Internal development of several DIGMAP services
–
–
–
Data integration within the DIGMAP Gazetteer
Convert different input sources into the Alexandria DL Gazetteer Content Standard
Handling duplicates and small corrections to the data
• The proposed approach was found to be expressive and computational
performance was within acceptable bounds
An Example XQuery
An XQuery for reading gazetteer data from an HTML source and convert the data
Into the Alexandria DL Gazetteer Content format
Conclusions
• Data transformations in Digital Libraries can be very complex
– Standard XML processing technology is often not enough
– But simple extensions can add the required extra functionality
• We propose using extension functions to the XPath 2.0 library
– Declarative syntax of XSLT and XQuery is not affected
– Extension functions add the required extra functionality
• Used in DIGMAP collection building and service composition
– Converting between different metadata formats
– Handling the spatio-temporal information included in the metadata
– Calling DIGMAP services to enrich the metadata records
Currently Ongoing Work
•
•
•
Implementing a visual interface for encoding the metadata transformations
Visual “pipelines” converted into XQuery instructions
Hide the complexity of the XSLT/XQuery languages from non-expert users
Thanks for your attention.
www.digmap.eu
http://transform.digmap.eu