Transcript Document
Advanced Artificial Intelligence
Part II. Statistical NLP
Introduction and Grammar Models
Wolfram Burgard, Luc De Raedt, Bernhard
Nebel, Kristian Kersting
Some slides taken from Helmut Schmid, Rada Mihalcea, Bonnie Dorr,
Leila Kosseim, Peter Flach and others
Topic
Statistical Natural Language Processing
Applies
•
Machine Learning / Statistics to
•
Learning : the ability to improve one’s behaviour at a specific
task over time - involves the analysis of data (statistics)
Natural Language Processing
Following parts of the book
•
Statistical NLP (Manning and Schuetze), MIT Press, 1999.
Contents
Motivation
Zipf’s law
Some natural language processing tasks
Non-probabilistic NLP models
• Regular grammars and finite state automata
• Context-Free Grammars
• Definite Clause Grammars
Motivation for statistical NLP
Overview of the rest of this part
Rationalism versus Empiricism
Rationalist
•
•
•
•
Noam Chomsky - innate language structures
AI : hand coding NLP
Dominant view 1960-1985
Cf. e.g. Steven Pinker’s The language instinct. (popular
science book)
Empiricist
•
•
•
Ability to learn is innate
AI : language is learned from corpora
Dominant 1920-1960 and becoming increasingly important
Rationalism versus Empiricism
Noam Chomsky:
•
But it must be recognized that the notion of “probability of
a sentence” is an entirely useless one, under any known
interpretation of this term
Fred Jelinek (IBM 1988)
•
•
Every time a linguist leaves the room the recognition rate
goes up.
(Alternative: Every time I fire a linguist the recognizer
improves)
This course
Empiricist approach
•
Focus will be on probabilistic models for learning
of natural language
No time to treat natural language in depth !
•
•
(though this would be quite useful and
interesting)
Deserves a full course by itself
Covered in more depth in Logic, Language and
Learning (SS 05, prob. SS 06)
Ambiguity
NLP and Statistics
Statistical Disambiguation
• Define a probability model for the data
• Compute the probability of each alternative
• Choose the most likely alternative
NLP and Statistics
Statistical Methods deal with uncertainty.
They predict the future behaviour of a system
based on the behaviour observed in the past.
Statistical Methods require training data.
The data in Statistical NLP are the Corpora
Corpora
Corpus: text collection for linguistic purposes
Tokens
How many words are contained in Tom Sawyer?
71.370
Types
How many different words are contained in T.S.?
8.018
Hapax Legomena
words appearing only once
Word Counts
word
freq
word
freq
the
3332
in
906
and
2972
that
877
a
1775
he
877
to
1725
I
783
of
1440
his
772
was
1161
you
686
it
1027
Tom
679
The most frequent words are function words
Word Counts
f
1
2
3
4
5
6
7
8
9
10
11-50
51-100
> 100
nf
3993
1292
664
410
243
199
172
131
82
91
540
99
102
How many words appear f times?
About half of the words occurs just once
About half of the text consists of the
100 most common words
….
Word Counts (Brown corpus)
Word Counts (Brown corpus)
Zipf‘s Law
word
the
and
a
he
but
be
there
one
about
more
never
Oh
two
f
3332
2972
1775
877
410
294
222
172
158
138
124
116
104
r
f*r
1
3332
2
5944
3
5235
10
8770
20
8400
30
8820
40
8880
50
8600
60
9480
70
9660
80
9920
90 10440
100 10400
Zipf‘s Law: f~1/r
word
turned
you‘ll
name
comes
group
lead
friends
begin
family
brushed
sins
Could
Applausive
f
51
30
21
16
13
11
10
9
8
4
2
2
1
(f*r = const)
r
f*r
200 10200
300
9000
400
8400
500
8000
600
7800
700
7700
800
8000
900
8100
1000 8000
2000 8000
3000 6000
4000 8000
8000 8000
Minimize effort
Language and sequences
Natural language processing
• Is concerned with the analysis of
sequences of words / sentences
• Construction of language models
Two types of models
• Non-probabilistic
• Probabilistic
Key NLP Problem: Ambiguity
Human Language is highly ambiguous at all levels
• acoustic level
recognize speech vs. wreck a nice beach
• morphological level
saw: to see (past), saw (noun), to saw (present, inf)
• syntactic level
I saw the man on the hill with a telescope
• semantic level
One book has to be read by every student
Language Model
A formal model about language
Two types
• Non-probabilistic
Allows one to compute whether a certain sequence
(sentence or part thereof) is possible
Often grammar based
• Probabilistic
Allows one to compute the probability of a certain
sequence
Often extends grammars with probabilities
Example of bad language model
A bad language model
A bad language model
A good language model
Non-Probabilistic
• “I swear to tell the truth” is possible
• “I swerve to smell de soup” is impossible
Probabilistic
• P(I swear to tell the truth) ~ .0001
• P(I swerve to smell de soup) ~ 0
Why language models ?
Consider a Shannon Game
• Predicting the next word in the sequence
Statistical natural language ….
The cat is thrown out of the …
The large green …
Sue swallowed the large green …
…
Model at the sentence level
Applications
Spelling correction
Mobile phone texting
Speech recognition
Handwriting recognition
Disabled users
…
Spelling errors
They are leaving in about fifteen minuets to go to her
house.
The study was conducted mainly be John Black.
Hopefully, all with continue smoothly in my absence.
Can they lave him my messages?
I need to notified the bank of….
He is trying to fine out.
Handwriting recognition
Assume a note is given to a bank teller, which
the teller reads as I have a gub. (cf. Woody
Allen)
NLP to the rescue ….
• gub is not a word
• gun, gum, Gus, and gull are words, but gun has a
higher probability in the context of a bank
For Spell Checkers
Collect list of commonly substituted words
• piece/peace, whether/weather, their/there ...
Example:
“On Tuesday, the whether …’’
“On Tuesday, the weather …”
Another dimension in language
models
Do we mainly want to infer
(probabilities) of legal sentences /
sequences ?
• So far
Or, do we want to infer properties of
these sentences ?
• E.g., parse tree, part-of-speech-tagging
• Needed for understanding NL
Let’s look at some tasks
Sequence Tagging
Part-of-speech tagging
• He drives with his bike
• N V
PR PN N
noun, verb, preposition, pronoun, noun
Text extraction
• The job is that of a programmer
• X X X X X X JobType
• The seminar is taking place from 15.00 to 16.00
• X
X
X X
X
X
Start
End
Sequence Tagging
Predicting the secondary structure
of proteins, mRNA, …
• X = A,F,A,R,L,M,M,A,…
• Y = he,he,st,st,st,he,st,he, …
Parsing
Given a sentence, find its parse tree
Important step in understanding NL
Parsing
In bioinformatics, allows to predict
(elements of) structure from sequence
Language models based on
Grammars
Grammar Types
• Regular grammars and Finite State Automata
• Context-Free Grammars
• Definite Clause Grammars
A particular type of Unification Based Grammar (Prolog)
Distinguish lexicon from grammar
• Lexicon (dictionary) contains information about
words, e.g.
word - possible tags (and possibly additional information)
flies - V(erb) - N(oun)
• Grammar encode rules
Grammars and parsing
Syntactic level best understood and formalized
Derivation of grammatical structure: parsing
(more than just recognition)
Result of parsing mostly parse tree:
showing the constituents of a sentence, e.g. verb
or noun phrases
Syntax usually specified in terms of a grammar
consisting of grammar rules
Regular Grammars and
Finite State Automata
Lexical information - which
words are ?
•
•
•
•
•
•
Det(erminer)
N(oun)
Vi (intransitive verb) - no
argument
Pn (pronoun)
Vt (transitive verb) - takes an
argument
Adj (adjective)
Now accept
•
•
The cat slept
Det N Vi
As regular grammar
• S -> [Det] S1
• S1 -> [N] S2
• S2 -> [Vi]
Lexicon
• The - Det
• Cat - N
• Slept - Vi
• …
% [ ] : terminal
Finite State Automaton
Sentences
•
•
•
•
John smiles - Pn Vi
The cat disappeared - Det N Vi
These new shoes hurt - Det Adj N Vi
John liked the old cat PN Vt Det Adj N
Phrase structure
S
NP
D
VP
N
V
NP
D N
PP
P
NP
D
N
the dog chased a cat into the garden
Notation
S: sentence
D or Det: Determiner (e.g., articles)
N: noun
V: verb
P: preposition
NP: noun phrase
VP: verb phrase
PP: prepositional phrase
Context Free Grammar
S
NP
VP
VP
PP
D
D
N
N
N
V
V
P
->
->
->
->
->
->
->
->
->
->
->
->
->
NP VP
D N
V NP
V NP PP
P NP
[the]
[a]
[dog]
[cat]
[garden]
[chased]
[saw]
[into]
Terminals ~ Lexicon
Phrase structure
Formalism of context-free grammars
• Nonterminal symbols: S, NP, VP, ...
• Terminal symbols: dog, cat, saw, the, ...
Recursion
• „The girl thought the dog chased the cat“
VP
N
V
-> V, S
-> [girl]
-> [thought]
Top-down parsing
S -> NP VP
S -> Det N VP
S -> The N VP
S -> The dog VP
S -> The dog V NP
S -> The dog chased NP
S -> The dog chased Det N
S-> The dog chased the N
S-> The dog chased the cat
Context-free grammar
S
NP
NP
NP
VP
VP
Art
Adj
Adj
PN
N
VI
VT
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
-->
NP,VP.
PN.
%Proper noun
Art, Adj, N.
Art,N.
VI.
%intransitive verb
VT, NP. %transitive verb
[the].
[lazy].
[rapid].
[achilles].
[turtle].
[sleeps].
[beats].
Parse tree
S
NP
Art
Adj
VP
N
Vt
NP
PN
the
rapid
turtle
beats
achilles
Definite Clause Grammars
Non-terminals may have arguments
S
--> NP(N),VP(N).
NP(N)
--> Art(N),N(N).
VP(N)
--> VI(N).
Art(singular)
--> [a].
Art(singular)
--> [the].
Art(plural)
--> [the].
N(singular)
--> [turtle].
N(plural)
--> [turtles].
VI(singular)
--> [sleeps].
VI(plural)
--> [sleep].
Number Agreement
DCGs
Non-terminals may have arguments
• Variables (start with capital)
E.g. Number, Any
• Constants (start with lower case)
E.g. singular, plural
• Structured terms (start with lower case, and take
arguments themselves)
E.g. vp(V,NP)
Parsing needs to be adapted
• Using unification
Unification in a nutshell
(cf. AI course)
Substitutions
E.g. {Num / singular }
{T / vp(V,NP)}
Applying substitution
• Simultaneously replace variables by
corresponding terms
• S(Num) {Num / singular } = S(singular)
Unification
Take two non-terminals with arguments and
compute (most general) substitution that
makes them identical, e.g.,
• Art(singular) and Art(Num)
Gives { Num / singular }
• Art(singular) and Art(plural)
Fails
• Art(Num1) and Art(Num2)
{Num1 / Num2}
• PN(Num, accusative) and PN(singular, Case)
{Num/singular, Case/accusative}
Parsing with DCGs
Now require successful unification at each
step
S -> NP(N), VP(N)
S -> Art(N), N(N), VP(N) {N/singular}
S -> a N(singular), VP(singular)
S -> a turtle VP(singular)
S -> a turtle sleeps
S-> a turtle sleep
fails
Case Marking
PN(singular,nominative)
--> [he];[she]
PN(singular,accusative)
--> [him];[her]
PN(plural,nominative)
--> [they]
PN(plural,accusative)
--> [them]
S
--> NP(Number,nominative), NP(Number)
VP(Number) --> V(Number), VP(Any,accusative)
VP(Number,Case)
VP(Number,Any)
--> PN(Number,Case)
--> Det, N(Number)
He sees her. She sees him. They see her.
But not Them see he.
DCGs
Are strictly more expressive than CFGs
Can represent for instance
•
•
•
•
•
•
•
S(N) -> A(N), B(N), C(N)
A(0) -> []
B(0) -> []
C(0) -> []
A(s(N)) -> A(N), [A]
B(s(N)) -> B(N), [B]
C(s(N)) -> C(N), [C]
Probabilistic Models
Traditional grammar models are very rigid,
• essentially a yes / no decision
Probabilistic grammars
• Define a probability models for the data
• Compute the probability of each alternative
• Choose the most likely alternative
Ilustrate on
• Shannon Game
• Spelling correction
• Parsing
Illustration
Wall Street Journal Corpus
3 000 000 words
Correct parse tree for sentences known
• Constructed by hand
• Can be used to derive stochastic context free
grammars
• SCFG assign probability to parse trees
Compute the most probable parse tree
Sequences are omni-present
Therefore the techniques we will see
also apply to
• Bioinformatics
DNA, proteins, mRNA, … can all be
represented as strings
• Robotics
Sequences of actions, states, …
• …
Rest of the Course
Limitations traditional grammar models motivate
probabilistic extensions
• Regular grammars and Finite State Automata
All use principles of Part I on Graphical Models
Markov Models using n-gramms
(Hidden) Markov Models
Conditional Random Fields
• As an example of using undirected graphical models
• Probabilistic Context Free Grammars
• Probabilistic Definite Clause Grammars