Power Point - E
Download
Report
Transcript Power Point - E
Comparability of language data and
analysis
Using an ontology for linguistics
Scott Farrar, U Bremen
Terry Langendoen, U Arizona
Jan 9, 2004
Symposium on Best Practice
LSA, Boston, MA
1
Multiple language resources
Symposium focus so far has been on
digital preservation of the work of
individual projects.
Imagine there are 100,000 or more Web
accessible digital language archives
covering most of the world’s languages.
Jan 9, 2004
annotated texts, lexicons, grammatical
descriptions, research papers, typological
comparisons, ...
Symposium on Best Practice
LSA, Boston, MA
2
Limits on access to content
Jan 9, 2004
Metadata gets you only a little way in.
String searching gets results, but it’s often
not reliable (low “precision” and “recall”).
Database searches typically can only be
carried out one site at a time.
Symposium on Best Practice
LSA, Boston, MA
3
Smart searches need smart data
Jan 9, 2004
Use informational, not presentational,
markup (cf. presentations by Simons and
Lewis).
XML can be used to represent linguistic
analyses to any desired degree of
refinement.
Analyses in other formats (e.g. relational
databases) can be migrated to XML for
both archiving, and smart web searching.
Symposium on Best Practice
LSA, Boston, MA
4
Smart markup isn’t enough
Meaning and use of structural markup
varies from site to site.
Same term used with different meanings.
Different terms used with the same
meaning.
Markup element and attribute names and
values, and structural content may be in
different natural languages.
Jan 9, 2004
Sites are encoded at different levels of
granularity.
Symposium on Best Practice
LSA, Boston, MA
5
How to say what you mean
Markup is syntax; it’s meaning can only be
inferred for individual sites, or groups of
sites that use a common markup scheme
(e.g. TEI).
So if markup term T means “x” in archive A
and “y” in archive B, then we need:
A resource (called an ontology) that provides
the definitions “x” and “y” in a systematic and
machine-interpretable format.
A mechanism to link T to “x” in A and T to “y”
in B.
Jan 9, 2004
Symposium on Best Practice
LSA, Boston, MA
6
What is an ontology?
Jan 9, 2004
A computational artifact;
A conceptualization of a domain;
A theory of what is;
The types in a knowledge base.
There can be many ontologies for a given
domain.
Symposium on Best Practice
LSA, Boston, MA
7
Why an ontology for linguistics?
Language documentation
need to decipher markup
semantics and markup
Semantic Web implementation
Natural language processing
conceptual basis for semantics (grounding)
as a common framework for linguistic and
non-linguistic knowledge
Jan 9, 2004
Symposium on Best Practice
LSA, Boston, MA
8
GOLD
General Ontology for Linguistic
Description—http://emeld.org/gold
Jan 9, 2004
Incorporated in EMELD’s FIELD tool.
Built using an upper ontology (SUMO)
http://ontology.teknowledge.com
Currently in a very early stage of
development.
Symposium on Best Practice
LSA, Boston, MA
9
Partial SUMO taxonomy
Entity
Abstract
Physical
Relation
Object
Perdurant
Proposition
SetOrClass
Region
Quantity
Agent
SelfConnectedObject
Jan 9, 2004
Attribute
Collection
Symposium on Best Practice
LSA, Boston, MA
10
What currently is in GOLD?
Categories for:
linguistic form
morphosyntactic categories
semantics for morphosyntactic categories
Jan 9, 2004
features
values
using SUMO
documentation
Symposium on Best Practice
LSA, Boston, MA
11
Format of GOLD
Semantic Web initiative
http://w3.org/2001/sw/
Web Ontology Language (OWL)
An emerging Web standard and growing
user base
Extensible
Lots of visualization tools and APIs are
available for OWL.
Jan 9, 2004
Symposium on Best Practice
LSA, Boston, MA
12
What’s still needed
Buildout of GOLD (and/or development of
companion ontologies) to cover the entire
field.
Mechanisms to link sites to ontologies.
Jan 9, 2004
Can be done in part using metadata.
Development of additional ontology-aware
tools for data creation and migration.
A way of ensuring that ontologies endure
just like the data they help interpret.
Symposium on Best Practice
LSA, Boston, MA
13