A Common Ontology for Linguistic Concepts - E

Download Report

Transcript A Common Ontology for Linguistic Concepts - E

An Ontology for Linguistic
Representation
Scott Farrar, Terry Langendoen,
William Lewis
University of Arizona
Overview
• Discuss a proposal of how linguistic data
can be shared over the Semantic Web.
• Endangered language data (EMELD)
Special focus on a linguistic ontology
(conceptual modeling in linguistics)
Ontology and Linguistics
• How is the lexicon reflected in top-level
distinctions? (Gangemi, Guarino, Masolo,
and Oltramari 2001)
• How are functional/grammatical categories
reflected in the ontology?
EMELD
• EMELD (Electronic Metastructure for
Endangered Languages Data)
• As many as half of the world’s languages (3000
out of 6000) are in danger of disappearing
LaPolla (1998)
• Purpose is to preserve endangered language
data by creating a community of practice via the
Semantic Web.
Linguistic Data
• Field linguists collect data.
subject language
• Hopi:
analysis (markup)
sivu-’ikwiw-ta-qa
[vessel-carry: on: back-DUR-REL]
‘kachina’
gloss (markup)
• Linguistic data includes: grammars,
dictionaries, text, sound and video
recordings, glossed corpora
Grammatical Descriptions
• English has noun-verb agreement in
number and person.
• Warumungu is ergative-absolutive.
• Archi has extensive spatial cases.
• Spanish is SVO.
Facts about Language
Nouns represent objects.
general concepts
Tense relates an event with a point in time.
Case is a relation between a predicate and its
argument, e.g., He knows him.
SOV, SVO, OVS are possible natural language
word orders.
linguistic concepts
Challenges to Creating a
Community of Practice
• Language data should be searchable and
comparable—broad access.
• Few standardized methods for encoding of
language data (cf. EAGLES and TEI).
• Authors or communities want control over
their data.
Local control should be balanced with data
interoperability
Semantic Web
Example of the Problem
google query:
[“past tense” Australian languages]
Web
Umpila
[Prehodiernal Tense]
Warumungu
[PAST]
Dyirbal
[PST]
Other Examples of the Problem
• homonymous terminology:
A search for PA intended to mean
‘Partitive’ might return PA meaning Past or
PerfectiveAspect.
• “covert” markup:
Hopi CAUS really means Causative
combined with PerfectiveAspect
Some Further Complications
• A search for present tense forms in
English returns future tense, e.g.,
“Tomorrow, John goes to Holland.”
• Some language has past, present,
habitual analyzed as grammatical tenses
(Hopi).
• Searching for habitual aspect in Hopi does
not guarantee anything to do with aspect.
Solution Strategy for Providing
Broad Access to Language Data
• Integrate the data using metastructure—a
linguistic ontology.
• Make the data available in standard format
without imposing a standard for markup.
• Build tools to access and process the
data—query engines, expert systems…
Challenges to Conceptual Modeling
of Linguistic Domain
•
•
•
•
•
domain is large
most concepts are abstract
linguistic objects are hierarchical
language is symbolic
field is fragmented—few standards (cf.
chemistry, computer science)
Linguistic Ontology
• Our starting point is morpho-syntax (word
parts, tense, case, aspect, inflection)
• linguistic segments, grammatical
concepts, data structures
• Built on top of the Standard Upper Merged
Ontology (SUMO)
Standard Upper Merged Ontology
(SUMO)
• extensible resource
• already includes a number of concepts related
to semiotics and linguistics
• connection with the NLP community
(WordNet; B. Levin’s verb classes; Allen’s
tense logic)
• developed by an IEEE working group and is
freely available (http://suo.ieee.org)
(Niles and Pease 2001)
Physical Entities
What entities are physical, i.e., exist in
space-time?
• The word and its parts (written or spoken):
Stems, Affixes, Roots, Phonetic segments
• Larger constituents:
Phrases, Clauses, Texts
Taxonomy for LinguisticExpression
Entity
Physical
Object
SelfConnectedObject
ContentBearingObject
Icon
SymbolicString
LinguisticExpression
WrittenLinguisticExpression
SpokenLinguisticExpression
Taxonomy for LinguisticExpression
WrittenLinguisticExpression
WordPart
SimpleWordPart
Root
Affix
instance-of
Prefix
Infix
Suffix
Clitic
Stem
Word
instance-of
SimpleWord
ComplexWord
Compound
Phrase
‘un-’
‘dog’
Other Possibilities
recall:
instance-of
Word
SimpleWord
instance-of
ComplexWord
Compound
Word
Noun
SimpleNoun
ComplexNoun
Verb
SimpleVerb
ComplexVerb
Adjective
…
‘dog’
‘jump’
instance-of
‘dog’
instance-of
‘jump’
Part of Speech as Property
• Language exhibits categorical ambiguity
(e.g., “fish”).
• No noun/verb distinction in some
languages—language specific (e.g.,
Lummi).
• Related to the notions of “rigid” and “antirigid” w.r.t. properties in Ontoclean
(Guarino and Welty 2002)
Mereological Relations for
LinguisticExpression
A Stem has-part Root.
(=>
(instance ?STEM Stem)
(exists (?PART)
(and
(part ?PART ?STEM)
(instance ?PART Root))))
Mereological Relations for
LinguisticExpression
A Word has-part Stem.
(=>
(instance ?WORD Word)
(exists (?PART)
(and
(part ?PART ?WORD)
(instance ?PART WordPart))))
Other Axioms for
LinguisticExpression
• None for WrittenLinguisticExpression,
without referring to abstract section of
ontology.
• Phonetic overlay (intonation contour)
Abstract Entities
• What entities are abstract, i.e., qualities or
attributes?
Grammatical attributes:
Tense, Aspect, Case
Linguistic data structures:
Paradigms, Feature Structures,
PhonemeTables, Derivations (as in
“Minimalism”)
Mental entities:
Morpheme, Phoneme, Lexeme
Taxonomy of GrammaticalProperty
Abstract
Relation
Proposition
Attribute
InternalAttribute
RelationalAttribute
?(
)
GrammaticalAttribute
PartOfSpeech
Tense
Aspect
Case
…
Relational
SyntacticRole
subject
object
CaseRole
agent
patient
These are relational much the same as
PositionalAttribute, e.g., above.
Internal
PartOfSpeech
noun
verb
determiner
Within a grammar, these do not appear to be
relational, cf. ShapeAttribute.
Problematic Cases
Tense
Aspect
Case
• As grammatical attributes, these do not
appear to be relational.
• But what about meaning?
• After all, that’s the goal of the Semantic
Web.
Grammar-Meaning Distinction
• GrammaticalAttribute should be tied to a
language (e.g., to construct paradigms).
• Semantic notions do not have to be tied to
a specific language.
• Tendency in language is to match these
up, e.g., verbs pick out predicates, nouns
pick out terms and functions.
• Salishan languages—no verb-noun
distinction.
Tense
AbsolutePastTense:
(=> (AbsolutePastTense ?SENTENCE)
(and
(exists ?INTERVAL TimeInterval)
(during ?INTERVAL (WhenFn (ProcessFn
?SENTENCE)))
(before ?INTERVAL (WhenFn ?SENTENCE))))
HodiernalPastTense—?INTERVAL has to be during
‘yesterday’
RemotePastTense--?INTERVAL has to be before a certain
time point (language-specific)
Tense
RelativePastTense:
(=>
(RelativePastTense ?SENTENCE)
(and
(exists ?INTERVAL TimeInterval)
(exists ?POINT TimePoint)
(during ?INTERVAL (WhenFn (ProcessFn
?SENTENCE)))
(before ?INTERVAL ?POINT)))
Taxonomy of Case
Case
EventiveCase
instance-of
ErgativeCase
AffectorCase
AffecteeCase
instance-of
DativeCase
NoneventiveCase
xxxxCase
instance-of
ExistentialCase
SpatialCase
DirectionalCase
instance-of
IllativeCase
PositionalCase
Predicates in the SUMO
• Linguistic/Semiotic Predicates
(containsInformation ?SENT ?PROP)
refers
represents
containsInformation
realization
representsInLanguage
(representsInLanguage
?THING ?ENTITY ?LANGUAGE)
Additional Predicates
Need a more linguistically-centered
predicate that acts deictically and ‘picks
out’ or points to a particular instance.
(designates ?LinguisticExpression ?Entity)
Future directions
• Extend ontology into the domains of
phonology and syntax.
• Recommend markup approach (XML)
• Explore applications of the ontology
beyond the immediate EMELD project
Contact Info
• Terry Langendoen
• Scott Farrar
[email protected]
[email protected]
• See our website:
http://emeld.douglass.arizona.edu:8080