Words - Indian Institute of Technology Kharagpur

Download Report

Transcript Words - Indian Institute of Technology Kharagpur

WORDS
The Building Blocks of Language
Sudeshna Sarkar
Professor
Computer Science & Engineering Department
Indian Institute of Technology Kharagpur
1
Language can be divided up into pieces of varying
sizes, ranging from morphemes to paragraphs.
Words -- the most fundamental level for NLP.
2
Tokens, Types and Texts
This process of segmenting a string of characters into
words is known as tokenization.
A word token is an individual occurrence of a word in a
concrete context.
A word type is what we're talking about when we say
that the three occurrences of the in sentence are "the
same word."
3
What’s a word?
I have a can opener; but I can’t open these cans.
how many words?
Word form
inflected form as it appears in the text
can and cans ... different word forms
Lemma
a set of lexical forms having the same stem, same POS and same meaning
can and cans … same lemma
Word token:
an occurrence of a word
I have a can opener; but I can’t open these cans. 11 word tokens (not counting
punctuation)
Word type:
a different realization of a word
I have a can opener; but I can’t open these cans. 10 word types
(not counting
punctuation)
4
Another example
Mark Twain’s Tom Sawyer
71,370 word tokens
8,018 word types
tokens/type ratio = 8.9 (indication of text complexity)
Complete Shakespeare work
884,647 word tokens
29,066 word types
tokens/type ratio = 30.4
5
Some Useful Empirical Observations
A small number of events occur with high frequency
A large number of events occur with low frequency
You can quickly collect statistics on the high frequency
events
You might have to wait an arbitrarily long time to get valid
statistics on low frequency events
Some of the zeroes in the table are really zeros But others
are simply low frequency events you haven't seen yet. How
to address?
6
Common words in Tom Sawyer
but words in NL have an uneven distribution…
7
Text properties (formalized)
Sample word frequency data
8
Frequency of frequencies
most words are rare
3993 (50%) word types appear only once
they are called happax legomena (read only
once)
but common words are very common
100 words account for 51% of all tokens (of
all text)
9
Zipf’s Law
1.
2.
Count the frequency of each word type in a large corpus
List the word types in order of their frequency
Let:
f = frequency of a word type
r = its rank in the list
Zipf’s Law says:
In other words:
f  1/r
there exists a constant k such that: f × r = k
The 50th most common word should occur with 3 times the frequency of
the 150th most common word.
10
Zipf’s Law
If probability of word of rank r is pr and N is the total
number of word occurrences:
f
A
pr 

for corpus indp. const. A  0.1
N
r
11
Zipf curve
12
Predicting Occurrence Frequencies
By Zipf, a word appearing n times has rank rn=AN/n
If several words may occur n times, assume rank rn applies to the last of
these.
Therefore, rn words occur n or more times and rn+1 words occur n+1 or more
times.
So, the number of words appearing exactly n times is:
AN AN
AN
I n  rn  rn 1 


n
n  1 n(n  1)
Fraction of words with frequency n is:
In
1

D n(n  1)
Fraction of words appearing only once is therefore ½.
13
Explanations for Zipf’s Law
- Zipf’s explanation was his “principle of least effort.”
Balance between speaker’s desire for a small
vocabulary and hearer’s desire for a large one.
14
Zipf’s First Law
1. f ∝ 1/r,
f = word-frequency,
r = word-frequency rank,
m = number of meetings per word.
2. There exists a k such that f × r = k.
3. Alternatively, log f = log k - log r.
4. English literature, Johns Hopkins Autopsy Resource, German,
and Chinese.
5. Most famous of Zipf’s Laws.
15
Zipf’s Second Law
1. Meanings, m ∝ √f
2. There exists a k such that k × f = m2.
3. Corollary: m ∝ 1/√r
16
Zipf’s Third Law
1. Frequency ∝ 1/wordlength:
2. There exists a k such that f × wordlength = k.
3. Many other minor laws stated.
17
Zipf’s Law Impact on Language Analysis
Good News: Stopwords will account for a large fraction of text so
eliminating them greatly reduces size of vocabulary in a text
Bad News: For most words, gathering sufficient data for
meaningful statistical analysis (e.g. for correlation analysis for
query expansion) is difficult since they are extremely rare.
18
Vocabulary Growth
How does the size of the overall vocabulary (number
of unique words) grow with the size of the corpus?
This determines how the size of the inverted index will
scale with the size of the corpus.
Vocabulary not really upper-bounded due to proper
names, typos, etc.
19
Heaps’ Law
If V is the size of the vocabulary and the n is the length of the
corpus in words:
V  Kn

with constants K , 0    1
Typical constants:
K  10100
  0.40.6 (approx. square-root)
20
Heaps’ Law Data
21
Word counts are interesting...
As an indication of a text’s style
As an indication of a text’s author
But, because most words appear very infrequently,
it is hard to predict much about the behavior of words (if they do
not occur often in a corpus)
--> Zipf’s Law
22
Zipf’s Law on Tom Saywer


k ≈ 8000-9000
except for
The
3 most frequent words
Words of frequency ≈ 100
23
Plot of Zipf’s Law
On chap. 1-3 of Tom Sawyer (≠ numbers from p. 25&26)
f×r = k
Zipf
350
300
Freq
250
200
150
100
50
0
0
500
1000
1500
2000
Rank
24
Plot of Zipf’s Law (con’t)
On chap. 1-3 of Tom Sawyer
f×r = k ==> log(f×r) = log(k) ==> log(f)+log(r) = log(k)
Zipf's Law
6
5
log(freq)
4
3
2
1
0
0
1
2
3
4
5
6
7
8
log(rank)
25
Zipf’s Law, so what?
There are:
A few very common words
A medium number of medium frequency words
A large number of infrequent words
Principle of Least effort: Tradeoff between speaker and hearer’s effort
Speaker communicates with a small vocabulary of common words (less effort)
Hearer disambiguates messages through a large vocabulary of rare words (less
effort)
Significance of Zipf’s Law for us:
For most words, our data about their use will be very sparse
Only for a few words will we have a lot of examples
26