Towards a corpus-based online dictionary of
Download
Report
Transcript Towards a corpus-based online dictionary of
TOWARDS A CORPUS-BASED
ONLINE DICTIONARY OF
ITALIAN WORD COMBINATIONS
The CombiNet project
SARA CASTAGNOLI
FRANCESCA MASINI
(UNIVERSITY OF BOLOGNA)
GIANLUCA E. LEBANI
ALESSANDRO LENCI
(UNIVERSITY OF PISA)
MALVINA NISSIM
VALENTINA PIUNNO
(UNIVERSITY OF GRONINGEN)
(UNIVERSITY OF ROMA TRE)
ENeL meeting @ Herstmonceux Castle, 13 August 2015
THIS PRESENTATION
• INTRODUCING CombiNet, an ongoing project aimed at
building a corpus-based, lexicographic resource for Italian
Word Combinations (Universities of Roma Tre, Pisa, Bologna)
•
•
an innovative resource for the Italian language
relevance for ENeL-WG3:
• an electronic resource
• an integrated computational-lexicographic approach:
1) automatic extraction of candidate WoCs from corpora
2) manual evaluation and compilation
• our view of Word Combinations (WoCs)
• AKA: extracting WoCs from corpora – methods
• evaluation of AKA: automatic and manual
3
• OUTLINE:
WORD COMBINATIONS (WoCs)
The whole range of combinatory possibilities associated with
a word, including:
• Multiword Expressions (MWEs), i.e. a variety of WoCs
characterised by different degrees of fixedness and
idiomaticity that act as a single unit at some level of linguistic
analysis, e.g.:
• idioms
• phrasal lexemes
• collocations
• preferred combinations
• argument structure
• subcategorization frames
• selectional preferences
4
• More abstract combinations, i.e. the distributional
properties of a word at the level of e.g.:
EXTRACTING WoCs - METHODS
-
POS-tagged corpus
list of POS patterns
NOUN PREP
punto di
‘point of view’
NOUN
vista
Using SYNTACTIC INFO
(S-BASED methods)
-
parsed corpus
list of syntactic relations
SUBJ – VERB
guerra – scoppiare
‘war – burst’
NOUN ADJ
anno
accademico
‘academic year’
VERB – OBJ
perdere – vista
‘lose – (one’s)sight’
VER
DET (ADJ) NOUN
costruire un piccolo impero
‘build a small empire’
VERB – COMP_DI
parlare – di sport
‘talk – about sport’
5
Using POS PATTERNS
(P-BASED methods)
COMPARING EXTRACTION METHODS
Using SYNTACTIC INFO
(S-BASED methods)
- satisfactory results for
relatively fixed | adjacent |
short WOCs
- also target discontinuous and
syntactically flexible WoCs
- patterns need to be specified a
priori
- noise, even after applying AMs
- cannot capture complex and
flexible WOCs
- dismissing abstract
combinatory information (e.g.
argument structure)
- abstracting away from
information such as linear order,
morphosyntactic features etc.
- no information about how
exactly words combine
- cannot distinguish frequent but
productive combinations, from
idiomatic ones with the very
same syntactic structure
Castagnoli et al. 2015; Lenci et al. 2014, 2015
6
Using POS PATTERNS
(P-BASED methods)
AUTOMATIC EXTRACTION OF
CANDIDATE WoCs - DATA
• La Repubblica corpus (Baroni et al. 2004)
• approx. 380M tokens, POS-tagged and dependency parsed
• “clean” corpus, but only newspaper language
• POS-based extraction:
• 122 POS sequences deemed representative of Italian WoCs, in 3
subsets (nominal, verbal, prepositional WoCs)
• Independent extraction rounds, using the EXTra tool
• contiguous sequences, no optional slots, LL ranking, freq>5
• Syntax-based extraction:
• contiguous and discontinuous sequences, LL ranking, freq>5
7
• distributional profiles, containing the syntactic slots (subject,
complements, modifiers, etc.) and the combinations of slots (frames)
with which words co-occur, abstracted away from their surface
morphosyntactic patterns
• each slot is associated with lexical sets formed by its most
prototypical fillers
• LexIt tool
DATA FOR LEXICOGRAPHERS
1) All sequences corresponding to the mentioned patterns
are extracted from the corpus.
• 2) Lists of candidate WoCs are filtered to extract lines
containing specific Target Lemmas (i.e. future headwords)
• Headwords: “fundamental” 2,100 words from the Senso Comune
lexicon (http://www.sensocomune.it/)
• Nouns, Verbs, Adjectives
• 3) Lexicographers are provided with structured lists:
lemmatised candidate WoCs for a given TL
ranked according to their LL score
raw frequency of each combination in the corpus
underlying POS pattern or syntactic relation
8
•
•
•
•
9
POS-BASED DATA
10
POS-BASED DATA
11
SYNTAX-BASED DATA
LEXICOGRAPHERS’ USE OF DATA
• Candidate lists for each TL are imported into a
spreadsheet.
12
• As our current lexicographic layout groups WoCs on
the basis of their function and syntactic configuration,
lexicographers can scroll candidate lists or filter
them to observe and evaluate only candidate WoCs
corresponding to specific POS patterns and/or syntactic
relations.
13
14
LEXICOGRAPHERS’ USE OF DATA
• Candidate lists for each TL are imported into a
spreadsheet.
• As our current lexicographic layout groups WoCs on
the basis of their function and syntactic configuration,
lexicographers can scroll candidate lists or filter
them to observe and evaluate only candidate WoCs
corresponding to specific POS patterns and/or syntactic
relations.
• Candidates considered as valid WoCs are manually
selected
• before being recorded in the relevant part of the
lexicographic record
15
• and edited
LEXICOGRAPHERS’ EVALUATION - 1
(“highly impressionistic feedback from our
lexicographers”)
• LL ranking is generally helpful, as most higher-ranking
candidates represent (or contain, or suggest) proper
WoCs which deserve inclusion in the dictionary.
• No systematic evidence provided.
16
• However, difficult to set thresholds, since WoCs which
they would intuitively include in the entry also appear
in the middle and lower part of the ranking.
• POS-based data are more useful to compile the
entries for nominal and adjectival TLs, whereas
SYNTAX-based data would be more helpful for verbal
TLs.
AUTOMATIC EVALUATION - 1
• We tested and compared the performance of the two extraction
methods using an existing Italian combinatory dictionary as
a benchmark (25 TLs).
•
Recall, (R-)precision, thresholds, systems’ overlap
Castagnoli
et al. 2015
•
•
•
•
•
•
•
Recall is rather high for both systems
Recall of P-based method is higher for N and A, while S-based method
has higher recall for V
Recall for P-based method appears to plateau at 2,000 hits (*)
P-based and S-based method often extract/don’t extract the same WoCs
(performance is identical for 76% of gold standard combinations) (*)
But they also extract different gold standard combinations, with a
complementary distribution (P-based: N+A, S-based: V) (*)
R-precision is higher for S-based method
Crowdsourcing evaluation: nearly 25% of candidates are valid WoCs
even if they are not included in the benchmark dictionary (*)
17
• Interesting findings supporting the lexicographers’ intuition:
LEXICOGRAPHERS’ EVALUATION - 2
• Lexicographers report adding WoCs that “should
intuitively be there” but are not extracted from the corpus.
• More research is needed to:
18
a) analyse the nature of these WoCs
• Patterns we haven’t thought of? (Long) idioms?
b) assess the impact of extraction techniques and settings
• Min. frequency?
c) assess the impact of corpus type and size
• Limited to a single newspaper corpus
• Virtually no difference with the PAISA’ corpus (250M
words, copyright-free web content)
• Maybe a huge web corpus?
OTHER LIMITATIONS
• Still a lot of manual work for lexicographers
• No automatic import / conversion of acquired data into
an editing database / interface
• We are not using a proper Dictionary Writing System
• Many other ideas that came up listening to some eLex
presentations…
19
THANK YOU!