Spanish FrameNet Project

Download Report

Transcript Spanish FrameNet Project

Spanish FrameNet Project
Autonomous University of Barcelona
Marc Ortega
Spanish FrameNet Project




Spanish FrameNet is a research project which is
sponsored by the Department of Education of Spain
(Grant No. TSI2005-01200) from December 2005 to
December 2006.
A new grant proposal has been submitted to the Spanish
Department of Education for the period 2007-2009
SFN is developed at the Autonomous University of
Barcelona (Spain) and the International Computer Science
Institute (Berkeley, CA) in cooperation with the FrameNet
Project.
PI: Carlos Subirats, System Analyst: Marc Ortega, 2
linguist
SFN Goals




The Spanish FrameNet Project is creating an online
lexical resource for Spanish, based on frame
semantics and supported by corpus evidence.
SFN will be available to the public by July 2007
SFN will contain at least 1,000 lexical items aprox. verbs, predicative nouns, and adjectives, adverbs,
prepositions and entities- representative of a wide
range of semantic domains.
The aim is to document the range of semantic and
syntactic combinatory possibilities (valences) of
each word in each of its senses
Frame Semantics

Spanish FrameNet (SFN) is using, adapting and changing FrameNet Frames
in order to adapt them to Spanish

Some SFN Frames are the same as English FN (with Spanish examples)

Some SFN Frames have the same English FN name but they are different
(slightly different definition, different FE’s, or different core sets)

To adapt FN to Spanish we defined some new frames and some FN frames
are not used (new frames use the same FN format), like:

Cause_to_halt

Change_emotional_state

Collapse

Inventing

Motion_backwards, Motion_interruption, Motion_manner,
Motion_medium, Motion_up_downwards

Return

Social_interaction

Think_up
Current Project Status

Frames Defined: 92

Lexical Units: 624



Annotated: 413
Subcorporated: 130
Created but without subcorporation: 23
Spanish FrameNet Corpus and Tools

Spanish FrameNet is using a 350 million word corpus
 It includes both European and New World Spanish
(40% and 60%)
 The SFN Corpus has been developed by the SFN
research team, since there are no (large) public
domain Spanish corpora available

The SFN Corpus is lemmatized and tagged with a set of
in-house tools

FNDesktop

Web Reports

Sato Tool
The SFN tagging and chunking system

The SFN Corpus is tagged and lemmatized by using:

An electronic dictionary of Spanish of 600,000 forms,
which is expanded from a dictionary of 93,000
lemmas:





66,000 single-word lexical units, like unir (unite),
inmoralidad (immorality), allí (there), etc.;
26,000 multi-word lexical units (MWLU), like muerte
cerebral (brain death), etc., which are automatically
expanded in 55,000 inflected MWLU forms.
Plain text to Deterministic Finite State Automata (FSA)
corpus tagger
2,000 Finite State Transducers (FST) transducers of
multi-word verbs
Transducers of head of verbal phrases (compound
verbal tenses)
The SFN tagging and chunking system

The POS tagging process gives to
corpus formats:


Automata Corpus
IMS-CWB (Institut für Maschinelle
Sprachverarbeitung -Corpus
Workbench)
Automata Corpus



Very efficient
Lexical
Allows
efficient
tagging
process
word
(part-ofrates
speech, lemma)
disambiguation
Word
Allows
ambiguities
extended
lexical
are
Human
access is
almost
represented
tagging
using
in
automata
deterministic
impossible
finite state automata (DFSAs)
transduction
as
possible
 different
Compound
verbal forms
transitions
between
two
tagging
consecutive
states
 Multi-word verb
recognition
DFSA of the sentence Al habérselo propuesto a
tiempo
FST for compound verb form
tagging
DFSA of the sentence Al habérselo propuesto a tiempo
FST for compound verb form tagging
Transduced DFSA of the sentence Al habérselo propuesto a tiempo
Transduced DFSA of the sentence Al habérselo propuesto a
tiempo
CWB Corpus

Lexical tagging (part-ofspeech, lemma)

Text DSFA are disambiguated
and converted to XML format

Unambiguous corpus

Allows human access to
corpus contents

Allows human corpus search

Corpus contents are codified
and indexed for an efficient
corpus search
Multi-word verb recognition
DFSA of the sentence Le hacían siempre el vacío en la empresa before the transduction
Output DFSA of the sentence after the intersection and transduction
• Inflectional morphological
properties are kept
• the siempre adverb is detected
between the core verb and idiom
Subsequential FST that detects the multi-word verb hacer el vacío
Subcorporation Process

Internal tools GramCreator and XQS are used to create
subcorporation grammar
# Request: solicitud
# N-de-GN-de
# <PALABRA>* = 4
{
}
<%NPRED%> ( <APRED> + <PALABRA>* )
<de.PREP>
(
(<PRON>
+
(
( <E> + <PREDET> )
( <E> + <DET> + <APOS> )
( <E> + <APRED> + <VPRED:PP> )
))
<N> + (<NPROP> ( <E> + <NPROP> ))
)
<de.PREP>
Solicitud grammar
example: the syntactic
structure N-de-GN-de
is detected
Subcorporation Process




Each grammar (regular expression) is converted
to a Finite State Transducer
LU’s subcorpora is transduced with a set of
grammar’s FST to produce a set of subcorpora
The transduction process allows very efficient
process rates (100 transductions per second)
The subcorporation set is converted to XML and
imported to FNDesktop
Subcorporation Process
N-de-GN-de structure detection
Annotation Tool


SFN uses the FN annotation
tool (FNDesktop) to add
semantic annotation to the
LU subcorporation sets
The FNClassifier has been
adapted to Spanish: the
classifier has new rules
which are adapted to the
Spanish tags and Spanish
local Syntactic contexts
Annotation search tools (Web Reports)
Annotation search tools (Sato Tool)