Natural Language Processing

Download Report

Transcript Natural Language Processing

Natural Language Processing
Spring 2007
V. “Juggy” Jagannathan
Course Book
Foundations of Statistical
Natural Language Processing
By
Christopher Manning & Hinrich
Schutze
Chapter 1
Introduction
January 8, 2007
Linguistic vs Statistic
Rationale for a statistical approach
• Linguistic approaches that attempt to
parse language based on grammar have
failed
• Edward Saphir famous quote: “All
grammars leak”
• Statistical approaches have been shown
to be practical to look at “What are the
common patterns that occur in language
use”
Rationalist vs Empiricist
• Sort of the difference between “nature” and
“nurture”
• Rationalist: Innate intelligence of humans is
inherited and hence computational system must
be loaded with pre-knowledge to be effective
• Empiricist: Lot can be learned through
examining actual use of language – and hence
statistical approaches that learn from “corpus”
are germane.
• Corpus – a body of text
• Corpora – a collection of texts
Scientific content: Questions that
linguistics should answer
• What kinds of things do people say?
• What do these things say/ask/request
about the world?
• Traditional linguistic approach
– Competence grammar and grammaticality
determination
– But this is hard… trying to determine whether
sentences are grammatical or not.
– Some examples in page 10 – next page
Some examples of sentences
Non-categorical phenomena in
language
• Language usage changes with time
• Some words defy categorization into rigid
linguistic boundaries
• Example of “near” which can be an
adjective, adverb or both simultaneously
• Example of change: kind of and sort of
• Language usage change can be better
tracked using statistical NLP approaches
Language and cognition as
probabilistic phenomena
• One view of the world – the Chomsky line
of thinking is that probability and statistics
are inappropriate for determining
“grammaticality” and understanding the
“meaning” of sentences.
• The viewpoint with statistical NLP is that
“grammar” is not necessarily relevant to
understand and develop practical solutions
Some parses of the sentence: “Our company is
training workers”
The ambiguity of language: why
NLP is hard
• Linguists like to parse sentences to determine
things like: who did what to whom
• Parsing sentences is hard
• 455 parses to the sentence:
– “List the sales of the products produced in 1973 with
the products produced in 1972”
• AI approaches to understanding meaning have
failed and have been shown to be brittle and
non-scalable
Dirty Hands
• Variety of corpus available for statistical
NLP research
• Tom Sawyer example
Common Words in Tom Sawyer
Word Counts
•
•
•
•
•
•
•
•
•
•
Some statistics from Tom Sawyer
# of word tokens: 71,370
# of word types (unique words): 8,018
Average frequency: 71,370/8,018 = 8.9
Some words are very common!
12 words appear more than 700 times each
100 words account for more than 50.9% of the text
49.8% of “word types” appear only once in the corpus!
“hapax legomena” Greek for “read only once”
How can statistics help us understand the meaning of
sentences if half the words only appear once?
Frequency of frequencies
8018
Total # of
Word Types
Zipf Law
Empirical evaluation of Zipf law for Tom Sawyer
Basic Insight from Power Laws
• What makes frequency-based approaches
to language hard is almost all words are
rare.
• Zipf’s law is a good way to encapsulate
this insight.
Collocations
Collocations in New York Times corpus with and without filtering
Concordances
Key Word In Context (KWIC)
ff