Transcript Whatever

The GOLD Effort So Far
Terry Langendoen
Brian Fitzsimons
Emily Kidder
Department of Linguistics
University of Arizona
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
1
Acknowledgments

Everyone else who’s worked on E-MELD at U
Arizona 2001-05, especially:



Graduate students: Scott Farrar, Will Lewis, Peter
Norquest, Ruby Basham
Undergraduate students: Jesse Kirchner, Shauna
Eggers, Alexis Lanham, Sandy Chow
Everyone who’s worked on E-MELD
elsewhere, especially:

July 1-3, 2005
Gary, Helen, Anthony, Laura, Zhenwei, Baden,
Doug
E-MELD 2005
Ontologies in Linguistic Annotation
2
Whalen’s problem

“We want to be able to describe the data in
just the way we want, but we don’t want to
program it.”

July 1-3, 2005
Doug Whalen, at 2001 E-MELD Workshop
E-MELD 2005
Ontologies in Linguistic Annotation
3
Our problem


We want to be able to describe the data in
just the way we want, and we want to be able
to use everybody else’s data described in just
the way they want, and we want to be able to
process it in all kinds of ways that make
sense to us as scientists and teachers.
Call this the interoperability problem.
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
4
TEI’s data interchange solution

Create a “data interchange” format such as
the Text Encoding Initiative’s P3.

Require projects that wish to share data to define
mappings to and from the interchange format.
φ
ψˉ¹
X ——————-> P3 ——————> Y
ψ
φˉ¹
Y ——————-> P3 ——————> X
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
5
Two lessons from the TEI

Use a standard markup language.


Our choice (like theirs): XML.
Individual projects don’t have to use XML, but
their software should export to XML.
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
6
XML markup is syntax

In TEI, the tags <s>, <w> and <m> were
designed to delimit sentences, words and
morphemes respectively.

But they can be used to describe any three-level
hierarchy over character strings, such as:




July 1-3, 2005
<s> = sentence, <w> = word, <m> = morpheme
<s> = paragraph, <w> = sentence, <m> = word
<s> = chapter, <w> = paragraph, <m> = morpheme
<s> = big chunk, <w> = middle-size chunk, <m> = small
chunk
E-MELD 2005
Ontologies in Linguistic Annotation
7
Two avenues to markup
semantics

The syntax is the semantics (SIS)


This is essentially the TEI solution.
Leave the semantics to us (LSU)

July 1-3, 2005
Essentially the “Semantic Web” idea
E-MELD 2005
Ontologies in Linguistic Annotation
8
Problems with SIS



Hard sell. Based on the TEI experience, it’ll
be hard to convince linguists to use it.
Expensive. It will be costly to retrofit existing
resources to conform to it.
Fragile. Future changes will be likely to break
existing applications.
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
9
Advantages of LSU



Easier sell. Can have lots of special purpose
markup schemas for different purposes,
which will be easier to use.
Cheap. Migration to best practice much less
costly.
Robust. Changes are less likely to break
existing applications.
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
10
Place of a linguistic ontology as
part of LSU

The central component of LSU is a linguistic
ontology that:



July 1-3, 2005
defines the common concepts used in linguistic
analysis and description,
expresses the relations that hold among those
concepts,
relates those concepts to concepts of commonsense understanding (“upper” ontology) and
concepts in other disciplines.
E-MELD 2005
Ontologies in Linguistic Annotation
11
Proof of concept that it works

Last year, the Arizona team, together with
Gary, Scott, and Will’s team at CSU Fresno,
showed that GOLD could be used for smart
searching across massive cross-linguistic
databases created from XML documents of
different types.
July 1-3, 2005

Interlinear glossed texts

Lexicons
E-MELD 2005
Ontologies in Linguistic Annotation
12
The GOLD Summit

Last November, Will hosted a summit meeting
of researchers most involved with GOLD to
plan for its further development and
maintenance after Arizona’s E-MELD funding
ran out yesterday. It recommended:



July 1-3, 2005
Creating a GOLD website.
Forming a GOLD Council with oversight
responsibility, and putting procedures in place
using the OLAC model to foster and evaluate
development and maintenance.
Focusing the E-MELD 2005 workshop on GOLD.
E-MELD 2005
Ontologies in Linguistic Annotation
13
Current state of play

We’re proposing to move GOLD “out of the
lab” effective with this meeting despite the
fact that:


GOLD version 0.2 has very small coverage, even
within morphosyntax, and many areas of the field
are not covered at all.
Several important design issues have not been
settled.



July 1-3, 2005
What upper ontology should we use? (Currently SUMO)
Some “core GOLD” concepts are in flux.
We broke last year’s applications with our redesign of the
treatment of grammatical features.
E-MELD 2005
Ontologies in Linguistic Annotation
14
Classes and instances in GOLD
0.1 (“Old GOLD”)

Reasoning with classes and instances


July 1-3, 2005
If i is of type A and A is a subclass of B, then i is of
type B.
For example, a search for instances of Verb will
find all instances of both TransitiveVerb and
IntransitiveVerb.
E-MELD 2005
Ontologies in Linguistic Annotation
15
A problem with saying what we
want about language X

In language X, verbs are inflected only for
tense.

Verb inflectedFor Tense?



XVerb inflectedFor XTense?


July 1-3, 2005
This won’t do if both subject and object of the relation are
classes.
Fails to represent the claim that tense is the only feature
that verbs are inflected for in X.
OK, since XVerb and XTense are both instances (of the
GOLD classes Verb and Tense respectively)
Lack of other inflectional features will show up in
response to query.
E-MELD 2005
Ontologies in Linguistic Annotation
16
A problem with saying what we
want in GOLD

XTense hasValue XFutureTense


OK since hasValue relates instances.
Tense hasValue FutureTense

July 1-3, 2005
Not OK since hasValue relates classes.
E-MELD 2005
Ontologies in Linguistic Annotation
17
Parallel structures for GOLD and
language-specific concepts


July 1-3, 2005
Allow certain GOLD concepts to be
instances of other GOLD classes. In
particular, define atomic feature values as
instances of particular feature classes.
Allow certain language-specific concepts to
be classes that are instantiated by other
language-specific concepts. In particular,
define language-specific features as classes
instantiated by their language-specific
values.
E-MELD 2005
Ontologies in Linguistic Annotation
18
Feature systems as
substructures
Any
/|\
/ | \
/ | \
/
|
\
/
|
\
/
|
\
NonP HodP PreHodP
TenseSystem-x as a substructure of TenseFeature
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
19
Mapping from a language class
to a GOLD class
+------------+
+------------+
| Any <------+----+-- XAny
|
|
|
|
|
| NonP <-----+----+-- XPres
|
|
|
|
|
| HodP <-----+----+-- XRecP
|
|
|
|
|
| PreHodP <--+----+-- XRemP
|
+------------+
+------------+
Mapping to GOLD TenseSystem-x from XTense
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
20
Isomorphism between a language
system and a GOLD system
XAny
/|\
/ | \
/ | \
/
|
\
/
|
\
/
|
\
XPres XRecP XRemP
XTense system isomorphic to TenseSystem-x
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
21
Future of GOLD
July 1-3, 2005
E-MELD 2005
Ontologies in Linguistic Annotation
22