Transcript Slide 1

The current state of Metadata
- as far as we understand it Peter Wittenburg
The Language Archive - Max Planck Institute
CLARIN Research Infrastructure
Nijmegen, The Netherlands
Old Concept
•
•
of course "metadata" is an old concept
library cards were introduced to cope with
mass and anonymity
•
•
not surprising that library people started thinking
about this to describe all kind web-accessible resources
DC and qualified DC wee the results
•
however, research world is different - not just search
•
•
therefore in many domains solutions were developed
2 years ago CLARIN revised its 15 year old set&framework
Big Ideas
•
•
of course managing increasing amounts of data
of course finding valuable data in the growing haystacks
•
but also
• machine usage of metadata
•
•
•
•
•
automatic profile matching
research statistics - virtual sub-collection building
etc.
multilinguality in a multilingual European society
interdisciplinary research
biodiversity people should find information in linguistic archives
etc.
•
•
linking with contextual information
document lifecycle management (provenance)
Big Change
•
•
until now researchers informed each other
culture of personal exchange
•
claim: this will only work partially in the future
•
have distributed centers storing lots of data
national and discipline dimensions
•
•
•
depositors upload their data into these centers
will have an anonymous landscape of data & tools
all offered as services
what do we have to find things:
• proper metadata descriptions
• social tagging by virtual organizations
• content to operate on by "smart" data mining
Big Question
•
are we ready to meet these wishes and changes?
•
probably not
•
some major issues
• quality
• interoperability
• registry and reference stability
• functional
• multilingual
• scalability
• IT principles
Quality Issue
•
lack quality in descriptions
• not all elements filled in
(researchers are lazy, lack of tool support)
• often not schema based (XLS) thus inconsistent
• lack agreed and standardized vocabularies
• ISO 639-3 - about 6000 language codes
• what about subject classification schemes
• what about institution names
• thus many errors and inconsistencies
• ontologies are expensive to maintain
• misinterpretations/misuse of element semantics
• etc
Interoperability Issue
•
•
•
•
•
•
•
•
hampered by different approaches
(closed DB, no modularity, embedded ontologies)
structural difficulties up to context dependency
difficult semantic mapping
• different description dimensions
• bad element definitions
• bad vocabulary definitions
only little support of OAI-PMH
reliance on DC semantics - but useless for research etc
often "hardwired" mappings
lack of a flexible framework to create/share/use relations
little is standardized - what about lifetime then
Registry and Reference Stability Issue
•
flexibility only when we separate things
• define & register all concepts in open registries
(we are using ISO 12620 - ISOcat)
• define & register all components/profiles
(we are using CLARIN registry)
• register all mappings (nothing yet)
•
•
but if we do this we need to refer
are our references stable??
• some are using Cool URIs - are they just URLs?
• some using explicit Handles - are they maintained?
• who takes care?
(we are using EPIC - European PID Consortium)
Functional Issue
•
do we address new functional requirements
•
what about provenance information
is it automatically generated
what about versions - are they visible
what about ltp information
what about formal access information
do we know what is needed for the web services scenario
(profile matching, deployment information, etc)
•
•
•
•
Multilingual Issue
•
what does it really include?
• localizing all software
• multilingual definitions of all concepts
elements and vocabulary terms
(no translations of proper names of course or?)
• or do we simply rely on some lingua franca
• answer probably discipline dependent
• how much is (should be) public involved
•
•
whatever we do it is a lot of work
CLARIN: ISOcat covers almost all major EU languages
Scalability Issue
•
•
•
•
•
are our solutions scalable?
in EUROPEANA millions of metadata records
in CLARIN about 270.000
• how to structure the offer
• how to present this to naive users
do we share same granularity
(md at collection and/or resource level)
• can we deal with aggregations in same way
can we apply semantic web technology
• automatic mapping
• automatic quality improvement
IT Principles
•
we need to disseminate the message of some
basic IT principles
•
•
•
•
•
define and register your semantics
specify and register your syntax
use a stable reference scheme
in some areas separate definitions and relations
get things standardized or use standards such as
• XML, some schema language
• ISO 12620, etc
• URI, Handles
What can we do?
•
listen to each other first
•
increase awareness about metadata and basic principles
•
see how we can create an interoperable landscape
• harmonizing approaches
• harmonizing along major issues
• making things explicit and scalable
• look for proper interdisciplinary solutions
moving towards an
ideal e-Science domain
Üm nicht to end in Babylonish scenario
nous avons still algo time om sistemas
te improve.
Thanks for your attention.