Diapositive 1
Download
Report
Transcript Diapositive 1
FipsRomanian: Towards a Romanian Version of the Fips Syntactic Parser
Violeta Seretan, Eric Wehrli, Luka Nerima, Gabriela Soare
LATL – Language Technology Laboratory
Romanian language
Extending Fips to Romanian: two main tasks
Vocabulary
• Latin origin (fundamental vocabulary)
• Slavic origin
• Neologisms: French, Italian, …
• Loanwords: Turkish, Greek, Hungarian,
Albanian, ...
Morphology
• Case system inherited from Latin
Europe - Romance languages
nominative-accusative, genitive-dative, vocative
• Three grammatical genders
masculine, feminine, neuter
Sample text
Prezentul regulament
intră în vigoare în a
douăzecea zi de la
publicarea în Jurnalul
Oficial
al
Uniunii
Europene.
http://wt.jrc.it/lt/Acquis/
{violeta.seretan, eric.wehrli, luka.nerima, [email protected]}
This
Regulation shall
enter into force on the
twentieth day following
that of its publication in
the Official Journal of the
European Union.
• Rich declension of determiners, nouns,
adjectives, and verbs
e.g., about 35 forms for a verb
• The definite article is enclitic, i.e., suffixed to
nouns and adjectives:
casă/house – casa/house-the
mare/big – marea/big-the
Orthography
• phonemic; Latin alphabet (since 1859)
• Diacritics: ă/ə, â/ɨ, î/ɨ; cedilla: ş/ʃ, ţ/ʦ
Syntax
Lexicon construction
Grammar implementation
• list of headwords (DEX, 1998)
• morphological generation: given a base word
form, generates all its forms according to the
appropriate inflection paradigm
• Specifications (Soare, 2005)
• Customisation of FipsRomanian grammar for
standard operations (syntactic
transformations: relativization, interrogation,
passivization, ...)
• Similarities and differences. Examples:
– clitic system
• manual and semi-automatic insertion
• manual insertion for verbs (specific information:
subcategorization, selectional features, thematic
function, …)
• Current status:
– simple entries:
60K lexemes/ 380K words
(10 K proper nouns)
– complex entries: multi-word expressions
(compounds and collocations):
de jur împrejurul “around”
problemă – a se pune “problem – to arise”
• VSO language, relatively free word order
Fips: a multilingual parsing architecture (Wehrli, 2007)
Underlying theory
Output
• Generative Grammar (Chomsky, 1995)
Similarities:
• Simpler Syntax (Culicover and Jackendoff, 2005)
• Lexical Functional Grammar (Bresnan, 2001)
• Rich sentence representation:
– constituent structure
– predicate-argument table
– co-indexation chains
– intra-sentential pronoun resolution
– wh-fronting
• Attachment rules: constraints on the main
parser operation, Merge, which combines
two adjacent structures into a larger structure
• Current status: about 100 rules specified;
nearly half implemented and tested
FipsRomanian: Sample results
direct object
subject
predicate
Sample parse tree produced by Fips
Implementation
• Left-to-right, bottom-up tabular parsing algorithm, relying on detailed lexical information
• Language-independent core + language-specific implementation
• Component Pascal, OOP paradigm, BlackBox IDE
• Supported languages: French, English, German, Spanish, Italian, Greek; others in progress
Preliminary results
Screen captures
Parsing experiment
• data: journalistic texts, 1.05M words
• average sentence length: 26.9 tokens
• 16.2% full parses (FipsFrench, FipsEnglish: about 80%)
• average partial parses length : 5.3 tokens
• unknown words: 6.5% (of which 39.2% proper nouns)
• satisfactory lexical coverage
• grammatical coverage needs to be improved (work in
progress!)
parsing output
Task-based evaluation
• Collocation extraction from parsed data (Seretan, 2008)
• Collocations are half idioms (of encoding, but not of decoding)
• Used by parser and in-house rule-based machine translation
system
• Precision for top 2000 results: 30.3%
Sample collocations extracted
(Precision for French data: 65.9%, top 500 results)
Related work & Useful resources
• Data-driven dependency parser for Romanian based on the MaltParser, learns dependencies
from manual annotations (Călăcean and Nivre, 2009). Problem: reduced treebank size and
grammatical coverage (simple structures, no subordination, average sentence length only 9
words).
• Sketch Engine for Romanian: shallow parsing (POS patterns), http://www.sketchengine.co.uk/
• Dependency treebank construction, work in progress at the University of Iaşi, Romania
• Text processing webservices, RACAI – Research Institute for Artificial Intelligence, Romanian
Academy, Bucarest, Romania. http://www.racai.ro/webservices/TextProcessing.aspx
• A repository of tools for Romanian: ConsILR - Consortium for the Romanian Language:
Resources & Tools, research groups from Iaşi, Bucarest and Chişinău http://consilr.info.uaic.ro/
Faculté des Lettes, Département de Linguistique
POS-tagging output
Fips interface
Lexicon interface
References
Bresnan, J. 2001. Lexical Functional Syntax. Blackwell, Oxford.
Chomsky, N. 1995. The Minimalist Program. MIT Press, Cambridge, Mass.
Călăcean, M. and J. Nivre. 2009. A data-driven dependency parser for Romanian. In
Proceedings of the 7th International Workshop on Treebanks and Linguistic Theories (TLT 7),
pages 65–76, Groningen, Holland.
1998. DEX – Dicţionarul explicativ al limbii române. Academia Română, Bucharest.
Seretan, V. 2008. Collocation extraction based on syntactic parsing. Ph.D. thesis, University of
Geneva.
Soare, G. 2005. Romanian syntax. Technical report, University of Geneva.
Wehrli, E. 2007. Fips, a “deep” linguistic multilingual parser. In ACL 2007 Workshop on Deep
Linguistic Processing, pages 120–127, Prague, Czech Republic.