Transcript View slides

Tools and resources (not only) for
French, Italian and Spanish
Thomas Koller
NCLT seminar series, 22.11.2005
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Plurilingual learning
• Exploits learners’ knowledge of similar
languages
• Raises language awareness by showing
similar properties in several languages
• Aims to avoid learners’ typical errors
related to transfer processes
Plurilingual learning: Fields of similarity
• Pan-Romance vocabulary (dormir, sang, vin)
– 39 words in all languages
– 141 words in 8-9 languages
– 227 words in 5-7 languages
• Sound correspondences
– sp. ñ → fr. / it. gn, n :
señor, campaña → seigneur, campagne / signore,
campagna
año → an / anno
• Morphosyntactic elements
Plurilingual learning: Example
paternal
su-
de la
Pater
El padre habla con
su
hijo
de la escuela
Schule
Il
padre parla con
Le père
school
suo figlio della scuola
parle avec son fils
parl-
fi l-
de l’
école
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Existing resources: Linguistic tools
• POS tagger
– TreeTagger
– SVMTool (Spanish, English, Catalan)
• IBM JFrost lemmatiser
– provides possible base forms + POS
– morphological information (no POS tagging)
• Verb conjugator
– English, German, French, Italian and Spanish
– generates all forms for all tenses
Existing plurilingual resources
• Pan-Romance wordlist: 840 words
eau agua acqua -- utiliser utilizar utilizzare
• Profile words: 340 words
avec con con -- presque casi quasi
• Sound correspondences:
– Italian → Spanish: 19
llamar
– Italian → French: 19
– Spanish → Italian: 23
– Spanish → French: 31
– French → Italian: 17
– French → Spanish: 27
chi- → ll-
chiamare →
-ott- → -uit- notte → nuit
-ue- → -uo- bueno → buono
ll- → plllorar → pleurer
qu- → ch- que → che
-ein → -eno plein → lleno
Existing resources
• Bilingual wordlists
– wordlists can easily be converted into
• different XML formats
• relational databases
– used to create multilingual XML lexicons
• Plurilingual lexicon
– French, Italian, Spanish (Portuguese,
Romanian)
– 1800 entries
Existing resources: Plurilingual lexicon
– [1]actuar, [2]tratarse
agir [v] {1 intransitif, 2 pronominal impers.}
[1]agire, [2]trattarsi
– [caldo->'bouillon'], caliente
chaud [adj]
caldo
– contar [+'raconter']
compter [v]
contare [+'raconter']
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Created resources
• Multilingual XML lexicon
–
–
–
–
–
–
–
43 topics
French: 11,500 lemmas / 14,900 entries
Italian: 13,400 lemmas / 17,800 entries
Spanish: 14,600 lemmas / 19,700 entries
English: 17,600 lemmas / 25,900 entries
German: 5,200 lemmas / 7,300 entries
POS: nouns (m, n, f), verbs, adverbs, adjectives,
conjunctions, articles, pronouns, prepositions,
interjections, numerals
– Language levels: 1 - 4
Multilingual XML lexicon: sample entry
Created resources: verb lexicons
Verb lexicons with 500 verbs for each language
containing verb pattern information
accepter
<vt> <v pron>
[de + INF]
[de faire qch]
[par]
[qch de qn]
[que]
Created resources: verb lexicons
Full-form verb lexicons for 1500 – 1700 verbs
échappe
échapper:pres:1s
échapper:pres:3s
échapper:subj_pres:1s
échapper:subj_pres:3s
échapper:impe:2s
abandonner
1s_abandonne
2s_abandonnes
3s_abandonne
1p_abandonnons
2p_abandonnez
3p_abandonnent
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Overview
Developed tools
Animated grammar presentations
Dictionary tools
Plurilingual analysis module
Animated grammar presentations
• Dynamic representation of grammatical
properties / processes
• Tailor-made presentations
–
–
–
–
Replacing indications of place
Emphasising the subject
Irregular verb conjugations
Spatial prepositions and movements
• Authoring tool for creation of slide-based
learning materials with animated content
– produces slide-based learning materials
– animated and/or static text can be included
Authoring tool: Presenter
• Can be embedded in web page or used
as standalone tool in Windows
• XML data can be created automatically
and then fed into the presenter
→ suitable for flexible feedback
• Several XML files can be provided for
use in one page and then e.g. chosen
via PHP or JavaScript
Dictionary tools
• Input: any text in French, Italian or
Spanish
• Provide word-by-word translations
• Multilingual dictionary tool
– Tense, number, person for verb forms
– POS
– Topic
• Plurilingual dictionary tool
– Similar word forms
– Profile words
Multilingual dictionary: Resources
• Used resources
– Multilingual XML lexicons, multilingual
MySQL database
– Full-form verb lexicons
• Dictionary tool can easily be used with
any other data base
– special language dictionaries
– monolingual definition dictionaries
Plurilingual dictionary:
Tools and resources
• TreeTagger provides most likely POS
• Pan-Romance wordlist and list of profile
words
• Tool makes use of
– sound correspondences
– Levenshtein string similarity measure
– multilingual MySQL database
to automatically detect graphically
similar words with the same meaning
Plurilingual dictionary: Word detection
• Basically all words of target language
with “distance” ≤ 2 are displayed
• Sp. posibilidad -- Fr. possibilité
→ Normal distance: 4
• Sound correspondence:
Sp. -dad -- Fr. -té
→ Intermediate form: posibilité
• Distance between intermediate form
and French form is now only 1
Plurilingual analysis module
• Exploits similar sentence structures in
Romance languages
• Able to analyse learner input up to
(paragraphs of) simple sentences and
to give detailed feedback
Resources
• JFrost:
– possible lemmas + POS
– (extended morphological information)
• Verb lexicons
• Hand-crafted grammar
Parser type
Robust island parser
V
V
P
V
Hoy la madre no ha vuelto a hablar con su hijo.
subject
Verb group
object
Verb group:
• has a fixed position and extension in the sentence
• only contains verbs and certain POS
 sentence is splitted at potential verb groups
 only parts before and after verb group are
actually parsed
Analysis module: Recognised errors
• Agreement errors
– inside NPs
– between sentence components
• Subcategorisation errors
– too many/few sentence components
– wrong preposition
– wrong infinite verb form
• Position errors
– Negation
– Adverbs
• ...
Error recognition
• Constraint relaxation
– no constraints during parsing
– suite of tests after parsing
• Agreement
• Position of adverbs
• Correctness of Verb group
• Error rules
Modules
• Grammar reader
– Reads in grammar file
– Extrapolates phrase structure rules
NP -> (det) n (AP)
– Provides direct access to subparts of the grammar
”give me all NP rules for Spanish”
• Verb group divider
– Divides sentence at its verbal group
– Returns the sentence chunks before and after the VG
• NP finder
– Finds all possible NP occurrences in sentence
chunks
– Returns positions of NPs in sentence chunks
Overview
Plurilingual learning
Existing resources
Created resources
Developed tools
Software architecture
Interaction of software components
Server
XML
Client
XML
PHP
MySQL
Perl
Flash
Java
NLP
NLP
Web page
PDF
Shared
Object
Software architecture: Pros
• Uniform representation on several
platforms, browser-independent
• Easy integration of different media types
(audio, video, images, animation)
• Embed fonts for many character sets
(Cyrillic, Hebrew, Arabic, Chinese,
Japanese, Korean)
• Flash Remoting: sending complex data
structures (Java objects, arrays,
hashes) to and from server
Software architecture: Pros
• Flash files can interact mutually via
JavaScript, LocalConnection class or
using the same Local Shared Objects
• Local Shared Objects provide the
opportunity to save structured data (e.g.
XML data) on the client side
• No reload necessary for incoming
server data
• Can read XML files, you can use XPath
and regular expressions
Software architecture: Cons
• (Requires browser plug-in)
• Steep learning curve at the beginning
• Contents cannot be read by search
engines
• Software is not for free