METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE
Download
Report
Transcript METODI STATISTICI NELLA LINGUISTICA COMPUTAZIONALE
METODI STATISTICI NELLA
LINGUISTICA COMPUTAZIONALE
Massimo Poesio
Universita’ di Venezia
Obiettivi del corso
Un’introduzione all’uso dei corpora e ai metodi
statistici
Piano del corso
Fondamenti di statistica, uso dei corpora
Tasks & tecniche base: predizione di parole, ngrams, smoothing, spelling, Bayesian inference
POS tagging: tagsets, Brill tagger, HMM
tagging
Valutazione di sistemi
Il lessico
Grammatiche probabilistiche,parsing statistico
Oggi
Statistica e Linguistica (Abney, 1996)
Fondamenti di probabilita’
Corpora
Dettagli pratici
Orario: 10:30-13, 14:30-17
Laboratori: dalle 17 alle 18 (non oggi)
Orario di ricevimento: 9:30-10:30, 18-19
Email: [email protected]
Pagina web (temporanea):
csstaff.essex.ac.uk/staff/poesio/Courses/Venez
ia/Stat_NLP/
Empiricism vs. Rationalism
Chomskyan linguistics:
–
–
–
Empirical methods
–
–
–
Assumption: linguistic knowledge mostly innate
Emphasis on explanation
Primary goal: simplicity of the theory
Assumption: linguistic knowledge primarily derives from
generalizations over experience
Emphasis on data
Primary goal: fact discovery
Computational Linguistics between 1960 & 1980
mostly Chomskyan
Problems statistical methods are
meant to address
Ambiguity resolution: previous choices were
–
–
–
Narrow domains to avoid ambiguity
Hand-coded rules
Hand-tuned preference weights
Adaptation to new domains
Measuring improvement
Case study: POS tagging
“Time flies like
N/V
N/V V/N/CJ
an arrow”
Det
N
Number of tags 1
2
3
Number of
words types
3760 264 61
35340
4
5
6
7
12
2
1
The rise of statistical methods
First area in which statistical techniques truly
proved their worth was Automatic Speech
Recognition (ASR)
ASR techniques then used for POS tagging,
and then in all areas of CL
A synthesis of statistical methods and linguistic
insights now underway
Modern empiricism in
Computational Linguistics
Large data collections
Rigorous collection techniques (interannotator
agreement)
Rigorous evaluation techniques
Discovery of generalizations: via learning
techniques
Statistics & the study of language?
Theoretical advances
–
–
–
Empirical
–
–
–
Language acquisition: the role of experience
Linguistic theory: graded grammaticality
Language change: shifts in grammaticality
Quantify linguistic phenomena
Analyze data
Test hypotheses
Psychological
–
Express preferences
Some interesting statistics about
language
Lexical biases
–
–
Syntax
–
Category: “bank” = Noun 85%, Verb 15%
Sense: Bank(river) 22%, Bank(money) 78%
Subcategorization of “realised”: NP 20%, S 65%,
Other 15%
Semantics / discourse
–
“he” in subject position 65% of the time
Corpora
The use of statistical techniques has been made
possible by the availability of CORPORA – large
collections of text typically ANNOTATED with linguistic
information:
–
–
–
–
–
The Brown corpus (1M words) and British National Corpus
(150 million words), annotated with POS tags (English)
Penn Treebank (4M words), syntactically annotated (English)
SEMCOR (250K), annotated with wordsense information
The MapTask, annotated with dialogue information
Italian: CORIS (100M words+, Bologna), Si-TAL (220K words,
written, annotated with syntactic information & wordsense
information), IPAR (‘MapTask Italiano’)
Basic uses of corpora:
Collocations
COMPOUNDS: “computer program”, “disk
drive”, “calcio di rigore”
PHRASAL VERBS: “wake up”, “come on”
PHRASAL EXPRESSIONS: “bacon and eggs”,
“the bees’ knees”, “siamo alla frutta”
Bigrams: New York
Frequency
Word 1
Word 2
80871
of
the
58841
in
the
26430
to
the
…
…
…
15494
to
be
…
…
…
12622
from
the
11428
New
York
…
…
…
Statistical Language Processing
Statistical inference:
–
–
Example: language modeling
–
–
Collect statistics about occurrence of X
Predict new occurrences
Problem: predict word that follows, given previous ones
Find Wn that maximizes P(Wn|W1..W n-1)
Applications:
–
–
–
Speech recognition
Spell-checking
POS tagging …
Bibliografia
Steven Abney, Statistical Methods and Linguistics, in
Judith Klavans and Philip Resnik (eds.), The Balancing
Act, The MIT Press, Cambridge, Mass., 1995.
Testi:
–
Daniel Jurafsky and James Martin, Speech and Language
Processing, Prentice-Hall
–
Piu’ generale, e piu’ facile da seguire
Christopher Manning and Hinrich Schütze, Foundations of
Statistical Natural Language Processing, MIT Press
Piu’ completo, e scritto da una prospettiva piu’ linguistica, ma
tecnicamente piu’ avanzato