winterschool
Download
Report
Transcript winterschool
Issues in Computational Linguistics:
Grammar Engineering
Dick Crouch and Tracy King
Outline
What is a deep grammar?
How to engineer them:
–
–
–
–
–
robustness
integrating shallow resources
ambiguity
writing efficient grammars
real world data
What is a shallow grammar
often trained automatically from marked up
corpora
part of speech tagging
chunking
trees
POS tagging and Chunking
Part of speech tagging:
I/PRP saw/VBD her/PRP duck/VB ./PUNCT
I/PRP saw/VBD her/PRP$ duck/NN ./PUNCT
Chunking:
– general chunking
[I begin] [with an intuition]: [when I read] [a sentence], [I
read it] [a chunk] [at a time].
(Abney)
– NP chunking
[NP President Clinton] visitited [NP the Hermitage] in [NP
Leningrad]
Treebank grammars
Phrase structure tree (c-structure)
Annotations for heads, grammatical functions
Collins parser output
Deep grammars
Provide detailed syntactic/semantic analyses
– LFG (ParGram), HPSG (LinGO, Matrix)
– Grammatical functions, tense, number, etc.
Mary wants to leave.
subj(want~1,Mary~3)
comp(want~1,leave~2)
subj(leave~2,Mary~3)
tense(leave~2,present)
Usually manually constructed
– linguistically motivated rules
Why would you want one
Meaning sensitive applications
– overkill for many NLP applications
Applications which use shallow methods for
English may not be able to for "free" word
order languages
– can read many functions off of trees in English
SUBJ: NP sister to VP [S [NP Mary] [VP left]]
OBJ: first NP sister to V [S [NP Mary] [VP saw [NP John]]]
– need other information in German, Japanese, etc.
Deep analysis matters…
if you care about the answer
Example:
A delegation led by Vice President Philips, head of the chemical
division, flew to Chicago a week after the incident.
Question: Who flew to Chicago?
Candidate answers:
division
head
V.P. Philips
closest noun
next closest
next
delegation
furthest away but
Subject of flew
shallow but wrong
deep and right
Applications of Language Engineering
Post-Search
Sifting
Broad
Autonomous
Knowledge Filtering
Alta
Vista
AskJeeves
Document Base
Management
Narrow
Domain Coverage
Google
Restricted
Dialogue
Manually-tagged
Keyword Search
Knowledge
Fusion
Good
Translation
Useful
Summary
Natural
Dialogue
Microsoft
Paperclip
Low
Functionality
High
Traditional Problems
Time consuming and expensive to write
Not robust
– want output for any input
Ambiguous
Slow
Other gating items for applications that need
deep grammars
Why deep analysis is difficult
Languages are hard to describe
– Meaning depends on complex properties of words and sequences
– Different languages rely on different properties
– Errors and disfluencies
Languages are hard to compute
– Expensive to recognize complex patterns
– Sentences are ambiguous
– Ambiguities multiply: explosion in time and space
How to overcome this
Engineer the deep grammars
– theoretical vs. practical
– what is good enough
Integrate shallow techniques into deep
grammars
Experience based on broad-coverage LFG
grammars (ParGram project)
Robustness: Sources of Brittleness
missing vocabulary
– you can't list all the proper names in the world
missing constructions
– there are many constructions theoretical linguistics
rarely considers (e.g. dates, company names)
– easy to miss even core constructions
ungrammatical input
– real world text is not always perfect
– sometimes it is really horrendous
Real world Input
Other weak blue-chip issues included Chevron, which
went down 2 to 64 7/8 in Big Board composite trading
of 1.3 million shares; Goodyear Tire & Rubber, off 1
1/2 to 46 3/4, and American Express, down 3/4 to 37
1/4. (WSJ, section 13)
``The croaker's done gone from the hook –
(WSJ, section 13)
(SOLUTION 27000 20) Without tag P-248 the W7F3
fuse is located in the rear of the machine by the
charge power supply (PL3 C14 item 15.
(Eureka copier repair tip)
Missing vocabulary
Build vocabulary based on the input of
shallow methods
– fast
– extensive
– accurate
Finite-state morphologies
Part of Speech Taggers
Finite State Morphologies
Finite-state morphologies
falls -> fall +Noun +Pl
fall +Verb +Pres +3sg
Mary -> Mary +Prop +Giv +Fem +Sg
vienne -> venir +SubjP +SG {+P1|+P3} +Verb
Build lexical entry on-the-fly from the
morphological information
– have canonicalized stem form
– have significant grammatical information
– do not have subcategorization
Building lexical entries
Lexical entries
-unknown N
@(COMMON-NOUN %stem).
+Noun
N-SFX @(PERS 3).
+Pl
N-NUM @(NUM pl).
Rule
NOUN -> N
N-SFX
N-NUM.
Templates
– COMMON-NOUN :: (^ PRED)='%stem'
(^ NTYPE)=common
– PERS(3) :: (^ PERS)=3
– NUM(pl) :: (^ NUM)=pl
Building lexical entries
F-structure for falls
[ PRED 'fall'
NTYPE common
PERS 3
NUM pl ]
C-Structure for falls
Noun
N
fall
N-SFX
+Noun
N-NUM
+Pl
Guessing words
Use FST guesser if the morphology doesn't
know the word
– Capitalized words can be proper nouns
» Saakashvili -> Saakashvili +Noun +Proper +Guessed
– ed words can be past tense verbs or adjectives
» fumped -> fump +Verb +Past +Guessed
fumped +Adj +Deverbal +Guessed
Languages with more morphology allow for
better guessers
Using the lexicons
Rank the lexical lookup
1. overt entry in lexicon
2. entry built from information from morphology
3. entry built from information from guesser
Use the most reliable information
Fall back only as necessary
Missing constructions
Even large hand-written grammars are not
complete
– new constructions, especially with new corpora
– unusual constructions
Generally longer sentences fail
– one error can destroy the parse
Build up as much as you can; stitch together
the pieces
Grammar engineering approach
First try to get a complete parse
If fail, build up chunks that get complete
parses
Have a fall back for things without even chunk
parses
Link these chunks and fall backs together in a
single structure
Fragment Chunks: Sample output
the the dog appears.
Split into:
– "token" the
– sentence "the dog appears"
– ignore the period
C-structure
F-structure
Ungrammatical input
Real world text contains ungrammatical input
– typos
– run ons
– cut and paste errors
Deep grammars tend to only cover
grammatical output
Two strategies
– robustness techniques: guesser/fragments
– disprefered rules for ungrammatical structures
Rules for ungrammatical structures
Common errors can be coded in the rules
– want to know that error occurred
(e.g., feature in f-structure)
Disprefer parses of ungrammatical structure
– tools for grammar writer to rank rules
– two+ pass system
1. standard rules
2. rules for known ungrammatical constructions
3. default fall back rules
Sample ungrammatical structures
Mismatched subject-verb agreement
Verb3Sg = { SUBJ PERS = 3
SUBJ NUM = sg
|BadVAgr }
Missing copula
VPcop ==> { Vcop: ^=!
|e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>'
MissingCopularVerb}
{ NP: (^ XCOMP)=!
|AP: (^ XCOMP)=!
| …}
Robustness summary
Integrate shallow methods
– for lexical items
– morphologies
– guessers
Fall back techniques
– for missing constructions
– fragment grammar
– disprefered rules
Ambiguity
Deep grammars are massively ambiguous
Example: 700 from section 23 of WSJ
– average # of words: 19.6
– average # of optimal parses: 684
» for 1-10 word sentences: 3.8
» for 11-20 word sentences: 25.2
» for 50-60 word sentences: 12,888
Managing Ambiguity
Use packing to parse and manipulate the
ambiguities efficiently
(more tomorrow)
Trim early with shallow markup
– fewer parses to choose from
– faster parse time
Choose most probable parse for applications
that need a single input
Shallow markup
Part of speech marking as filter
I saw her duck/VB.
– accuracy of tagger (v. good for English)
– can use partial tagging (verbs and nouns)
Named entities
– <company>Goldman, Sachs & Co.</company> bought IBM.
– good for proper names and times
– hard to parse internal structure
Fall back technique if fail
– slows parsing
– accuracy vs. speed
Example shallow markup: Named entities
Allow tokenizer to accept marked up input:
parse {<person>Mr. Thejskt Thejs</person> arrived.}
tokenized string:
Mr. Thejskt Thejs TB +NEperson
Mr(TB). TB Thejskt TB Thejs
Add
TB arrived TB . TB
lexical entries and rules for NE tags
Resulting C-structure
Resulting F-structure
Results for shallow markup
Full/All
% Full
parses
Optimal
sol’ns
Best
F-sc
Time
%
Unmarked
76
482/1753
82/79
65/100
Named ent
78
263/1477
86/84
60/91
POS tag
62
248/1916
76/72
40/48
Lab brk
65
158/ 774
85/79
19/31
Kaplan and King 2003
Chosing the most probable parse
Applications may want one input
– or at least just a handful
Use stochastic methods to choose
– efficient (XLE English grammar: 5% of parse time)
Need training data
– partially labelled data ok
[NP-SBJ They] see [NP-OBJ the girl with the telescope]
Run-time performance
Many deep grammars are slow
Techniques depend on the system
– LFG: exploit the context-free backbone
ambiguity packing techniques
Speed vs. accuracy trade off
– remove/disprefer peripheral rules
– remove fall backs for shallow markup
Development expense
Grammar porting
Starter grammar
Induced grammar bootstrapping
How cheap are shallow grammars?
– training data can be expensive to produce
Grammar porting
Use an existing grammar as the base for a
new language
Languages must be typologically similar
– Japanese-Korean
– Balkan
Lexical porting via bi-lingual dictionaries
Main work in testing and evaluation
Starter grammar
Provide basic rules and templates
– including for robustness techniques
Grammar writer:
– chooses among them
– refines them
Grammar Matrix for HPSG
Grammar induction
Induce a core grammar from a treebank
– compile rule generalizations
– threshold rare rules
– hand augment with features and fallback
techniques
Requires
– induction program
– existing resources (treebank)
Conclusions
Grammar engineering makes deep grammars
feasible
– robustness techniques
– integration of shallow methods
Many current applications can use shallow
grammars
Fast, accurate, broad-coverage deep
grammars enable new applications