PPT - Research Group on the Foundations of Artificial Intelligence

Download Report

Transcript PPT - Research Group on the Foundations of Artificial Intelligence

Statistical Natural Language
Processing
Advanced AI - Part II
Luc De Raedt
University of Freiburg
WS 2005/2006
Many slides taken from Helmut Schmid
Topic


Statistical Natural Language Processing
Applies

Machine Learning / Statistics to



Learning : the ability to improve one’s behaviour at a
specific task over time - involves the analysis of data
(statistics)
Natural Language Processing
Following parts of the book

Statistical NLP (Manning and Schuetze), MIT
Press, 1999.
Rationalism versus Empiricism

Rationalist





Noam Chomsky - innate language structures
AI : hand coding NLP
Dominant view 1960-1985
Cf. e.g. Steven Pinker’s The language instinct.
(popular science book)
Empiricist



Ability to learn is innate
AI : language is learned from corpora
Dominant 1920-1960 and becoming increasingly
important
Rationalism versus Empiricism

Noam Chomsky:


But it must be recognized that the notion of
“probability of a sentence” is an entirely useless
one, under any known interpretation of this term
Fred Jelinek (IBM 1988)


Every time a linguist leaves the room the
recognition rate goes up.
(Alternative: Every time I fire a linguist the
recognizer improves)
This course

Empiricist approach


Focus will be on probabilistic models for learning
of natural language
No time to treat natural language in depth !


(though this would be quite useful and
interesting)
Deserves a full course by itself

Covered in more depth in Logic, Language and
Learning (SS 05, prob. SS 06)
Ambiguity
NLP and Statistics
Statistical Disambiguation
• Define a probability model for the data
• Compute the probability of each alternative
• Choose the most likely alternative
NLP and Statistics
Statistical Methods deal with uncertainty.
They predict the future behaviour of a system
based on the behaviour observed in the past.
 Statistical Methods require training data.
The data in Statistical NLP are the Corpora
Corpora
 Corpus: text collection for linguistic purposes
 Tokens
How many words are contained in Tom Sawyer?
 71.370
 Types
How many different words are contained in T.S.?
 8.018
 Hapax Legomena
words appearing only once
Word Counts
word
freq
word
freq
the
3332
in
906
and
2972
that
877
a
1775
he
877
to
1725
I
783
of
1440
his
772
was
1161
you
686
it
1027
Tom
679
 The most frequent words are function words
Word Counts
f
1
2
3
4
5
6
7
8
9
10
11-50
51-100
> 100
nf
3993
1292
664
410
243
199
172
131
82
91
540
99
102
How many words appear f times?
About half of the words occurs just once
About half of the text consists of the
100 most common words
….
Word Counts (Brown corpus)
Word Counts (Brown corpus)
Zipf‘s Law
word
the
and
a
he
but
be
there
one
about
more
never
Oh
two
f
3332
2972
1775
877
410
294
222
172
158
138
124
116
104
r
f*r
1
3332
2
5944
3
5235
10
8770
20
8400
30
8820
40
8880
50
8600
60
9480
70
9660
80
9920
90 10440
100 10400
Zipf‘s Law: f~1/r
word
turned
you‘ll
name
comes
group
lead
friends
begin
family
brushed
sins
Could
Applausive
f
51
30
21
16
13
11
10
9
8
4
2
2
1
(f*r = const)
r
f*r
200 10200
300
9000
400
8400
500
8000
600
7800
700
7700
800
8000
900
8100
1000 8000
2000 8000
3000 6000
4000 8000
8000 8000
Minimize effort
Some probabilistic models

N-grams

Predicting the next word

Artificial intelligence and machine ….
Statistical natural language ….


Probabilistic





Regular (Markov Models)
Hidden Markov Models
Conditional Random Fields
Context-free grammars
(Stochastic) Definite Clause Grammars
Illustration



Wall Street Journal Corpus
3 000 000 words
Correct parse tree for sentences known



Constructed by hand
Can be used to derive stochastic context free
grammars
SCFG assign probability to parse trees

Compute the most probable parse tree
Conclusions


Overview of some probabilistic and
machine learning methods for NLP
Also very relevant to bioinformatics !

Analogy between parsing


A sentence
A biological string (DNA, protein, mRNA, …)