PowerPoint-presentation

Download Report

Transcript PowerPoint-presentation

Corpus-based computational linguistics
or computational corpus linguistics?
Joakim Nivre
Uppsala University
Department of Linguistics and Philology
Outline
• Different worlds?
–
–
–
–
Corpus-based computational linguistics
Computational corpus linguistics
Similarities and differences
Opportunities for collaboration
• Computational linguistics – an example
– Dependency-based syntactic analysis
– Machine learning
Different worlds?
Corpora and computers
• The empirical revolution in (computational) linguistics:
– Increased use of empirical data
– Development of large corpora
– Annotation of corpus data (syntactic, semantic)
• Underlying causes:
– Technical development:
• Availability of machine-readable text (and digitized speech)
• Computational capacity:
– Storage
– Processing
– Scientific shift:
• Criticism of armchair linguistics
• Development of statistical language models
Computational corpus
linguistics
• Goal:
– Knowledge of language
• Descriptive studies
• Theoretical hypothesis testing
• Means:
– Corpus data as a source of knowledge of language
• Descriptive statistics
• Statistical inference for hypothesis testing
– Computer programs for processing corpus data
• Corpus development and annotation
• Search and visualization (for humans)
• Statistical analysis (descriptive and inferential)
Corpus-based computational
linguistics
• Goal:
– Computer programs that process natural language
• Practical applications (translation, summarization, …)
• Models of language learning and use
• Means:
– Corpus data as a source of knowledge of language:
• Statistical inference for model parameters (estimation)
– Computer programs for processing corpus data
• Corpus development and annotation
• Search and information extraction (for computers)
• Statistical analysis (estimation/machine learning)
Corpus processing 1
• Corpus development:
– Tokenization (minimal units, words, etc.)
– Segmentation (on several levels)
– Normalization (e.g., abbreviations, orthography, multi-word
units; graphical elements, metadata, etc.)
• Annotation:
–
–
–
–
Part-of-speech tagging (word  word class)
Lemmatization (word  base form/lemma)
Syntactic analysis (sentence syntactic representation)
Semantic analysis (word  sense, sentence  proposition)
• Standard methodology:
– Automatic analysis (often based on other corpus data)
– Manual validation (and correction)
Corpus processing 2
• Searching and sorting:
– Search methods:
•
•
•
•
String matching
Regular expressions
Dedicated query languages
Special-purpose programs
– Results:
• Concordances
• Frequency lists
• Visualization:
– Textual:
• Concordances, etc.
– Graphical:
• Diagram, syntax trees, etc.
Corpus processing 3
• Statistical analysis:
– Descriptive statistics
• Frequency tables and diagrams
– Statistical inference
• Hypothesis testing (t-test, 2, Mann-Whitney, etc.)
• Machine learning:
– Probabilistic: Estimate probability distributions
– Discriminative: Approximate mapping input-output
– Induction of lexical and grammatical resources
(e.g. collocations, valency frames)
User Requirements
• Corpus linguists
– Software
• Accessible
• Easy to use
• General
– Output
• Suitable for humans
• Perspicuous (graphical
visualization)
– Functions
• Specific search
• Descriptive statistics
• Computational linguists
– Software
• Efficient
• Modifiable
• Specific
– Output
• Suitable for computers
• Well-defined format
(annotated text)
– Functions
• Exhaustive search
• Statistical learning
Summary
• Different goals:
– Study language
– Create computer programs
• … give (partly) different requirements:
– Accessible and usable (for humans)
– Efficient and standardized (for computers)
• … but (partly) the same needs:
– Corpus development and annotation
– Searching, sorting, and statistical analysis
Symbiosis?
• What can computational linguists do for corpus linguists?
– Technical and general linguistic competence
– Software for automatic analysis (annotation)
• What can corpus linguists do for computational linguists?
– Linguistic and language specific competence
– Manual validation of automatic analysis
• What can they achieve together?
– Automatic annotation improves precision in corpus linguistics
– Manual validation improves precision computational linguistics
– A virtuous circle?
Computational linguistics – an example
Dependency analysis
P
ROOT
OBJ
NMOD
0
1
SBJ
2
Economic news
JJ
NN
PMOD
NMOD NMOD
3
had
VBD
4
5
6
NMOD
7
8
9
little effect on financial markets .
JJ
NN
IN
JJ
NNS
.
Inductive dependency parsing
• Deterministic syntactic analysis (parsing):
– Algorithm for deriving dependency structures
– Requires decision function in choice situations
– All decisions are final (deterministic)
• Inductive machine learning:
–
–
–
–
Decision function based on previous experience
Generalize from examples (successive refinement)
Examples = Annotated sentences (treebank)
No grammar – just analogy
Algorithm
• Data structures:
– Queue of unanalyzed words (next = first in queue)
– Stack of partially analyzed words (top = on top of stack)
• Start state:
– Empty stack
– All words in queue
• Algorithm steps:
–
–
–
–
Shift:
Reduce:
Right:
Left:
Put next on top of stack (push)
Remove top from stack (pop)
Put next on top of stack (push); link top  next
Remove top from stack (pop); link next  top
Algorithm example
P
ROOT
RA(
LA(
RA(
LA(
RNMOD
EDUCE
PMOD
RA(
SOBJ
SBJ
HIFT
P)
OBJ
NMOD
0
1
SBJ
2
Economic news
JJ
NN
PMOD
NMOD NMOD
3
had
VBD
4
5
6
NMOD
7
8
9
little effect on financial markets .
JJ
NN
IN
JJ
NNS
.
Decision function
• Non-determinism:
RA(ATT)?
OBJ
…
eats
pizza
with
RE?
…
• Decision function: (Queue, Stack, Graph)  Step
• Possible approaches:
– Grammar?
– Inductive generalization!
Machine learning
• Decision function:
– (Queue, Stack, Graph)  Step
• Model:
– (Queue, Stack, Graph)  (f1, …, fn)
• Classifier:
– (f1, …, fn)  Step
• Learning:
– { ((f1, …, fn), Step) }  Classifier
Model
hd
ld
t1 th … .
rd
… top … .
Stack
ld
. … next n1 n2 n3
Queue
• Parts of speech: t1, top, next, n1, n2, n3
• Dependency types: t.hd, t.ld, t.rd, n.ld
• Word forms: top, next, top.hd, n1
Memory-based learning
• Memory-based learning and classification:
– Learning is storing experiences in memory.
– Problem solving is achieved by reusing solutions of
similar problems experienced in the past.
• TIMBL (Tilburg Memory-Based Learner):
– Basic method: k-nearest neighbor
– Parameters:
• Number of neighbors (k)
• Distance metrics
• Weighting av attributes, values and instances
Learning example
•
Instance base:
1.
2.
3.
4.
•
•
(a, b, a, c)  A
(a, b, c, a)  B
(b, a, c, c)  C
(c, a, b, c)  A
New instance:
5. (a, b, b, a)
Distances:
1.
2.
3.
4.
•
D(1, 5) = 2
D(2, 5) = 1
D(3, 5) = 4
D(4, 5) = 3
k-NN:
1. 1-NN(5) = B
2. 2-NN(5) = A/B
3. 3-NN(5) = A
Experimental evaluation
• Inductive dependency analysis:
– Deterministic algorithm
– Memory-based decision function
• Data:
– English:
• Penn Treebank, WSJ (1M words)
• Converted to dependency structure
– Swedish:
• Talbanken, Professional prose (100k words)
• Dependency structure based on MAMBA annotation
Results
• English:
– 87.3% of all words got the correct head
– 85.6% of all words got the correct head and label
• Svenska:
– 85.9% of all words got the correct head
– 81.6% of all words got the correct head and label
Dependency types: English
• High precision (86%  F):
VC (auxiliary verb  main verb)
NMOD (noun modifier)
SBJ (verb  subject)
PMOD (complement of preposition)
SBAR (complementizer  verb)
95.0%
91.0%
89.3%
88.6%
86.1%
• Medium precision (73%  F  83%):
ROOT
OBJ (verb  object)
VMOD (adverbial) 76.8%
AMOD (adj/adv modifier)
PRD (predicative complement)
• Low precision (F  70%):
DEP (other)
82.4%
81.1%
76.7%
73.8%
Dependency types: Swedish
• High precision (84%  F):
IM (infinitive marker  infinitive)
PR (preposition  noun)
UK (complementizer  verb)
VC (auxiliary verb  main verb)
DET (noun  determiner)
ROOT
SUB (verb  subject)
98.5%
90.6%
86.4%
86.1%
89.5%
87.8%
84.5%
• Medium precision (76%  F  80%):
ATT (noun modifier)
CC (coordination)
OBJ (verb  object)
PRD (verb  predicative)
ADV (adverbial)
• Low precision (F  70%):
INF, APP, XX, ID
79.2%
78.9%
77.7%
76.8%
76.3%
Corpus annotation
• How good is 85%?
– Good enough to save time for manual annotators
– Good enough to improve search precision
– Recent release: SUC with syntactic annotation
• How can accuracy be improved further?
– By annotation of more data, which facilitates
machine learning
– By refined linguistic analysis of the structures to
be annotated and the errors performed
MaltParser
• Software for inductive dependency parsing:
– Freely available (open source)
• http//maltparser.org
– Evaluated on close to 30 different languages
– Used for annotating corpora at Uppsala University