1.4.3: Zipf`s Law

Download Report

Transcript 1.4.3: Zipf`s Law

Ch-1: Introduction (1.3 & 1.4 & 1.5)
Prepared by Qaiser Abbas (07-0906)
1
1.3:- Ambiguity of Language
• NLP system determine structure of text e.g. “who did what to
whom?”
• Conventional Parsing System answer this question syntactically
limited e.g. “Our company is training workers”. The three different
parses represented as in (1.11):
Making Verb Group while
in others “is” as main
verb
Adj. Particle
modifies workers
Present Participle followed
by Noun (Gerund)
2
• Last two parses b & c are anomalous. It means when
sentences get longer and grammar get more
comprehensive then such ambiguities lead to terrible
multiplication of parses. Martin(1987) report 455 parses
for the following sentence (1.12):
“List the sales of the products produced in 1973 with the
products produced in 1972”.
• Practical NLP system are good for making
disambiguation decisions of word sense, word category,
syntactic structure and semantic scope.
• Goal is to maximize coverage while minimize ambiguity
but maximize coverage increases the undesired parses
and vice versa.
• AI approaches to parsing and disambiguation has shown
that hand coded syntactic constraints and preference
rules are time consuming and not scale up well and hard3
and easily broken(Lakoff 1987).
• In selectional restriction e.g. a verb “swallow” requires an
animate as subject and physical object as object. These
restriction disallow common and simple extensions of
usage of swallow as in (1.13):
a. I swallowed (believe) his story, hook, line, and sinker.
b. The supernova swallowed (Nervously landing) the
planet.
• Statistical NLP approaches solves these problems
automatically by learning lexical and structural
preferences from corpora through information lies in
relationship between words.
• Statistical models are robust, generalize well, and behave
gracefully in presence of errors and new data. Moreover,
parameters of SNLP models can often be estimated
automatically from corpora.
• Automatic learning reduces human effort and raises
interesting scientific issues.
4
1.4: Dirty Hands
1.4.1: Lexical Resources
• Read machine readable text, dictionaries, thesauri and tools for
processing them
• Brown Corpus (1960-70): widely known, million words corpus,
American English, pay to use it, include press reportage, fiction,
scientific text, legal text, and many others.
• Lancaster Oslo Bergen (LOB) corpus is British English replication
of Brown Corpus.
• Susanne Corpus : 130,000 words, freely available, subset of
Brown Corpus, contain information on syntactic structure of
sentence,
• Penn Treebank: text from Wall Street Journal, widely used, not free
• Canadian Hansards: proceeding of Canadian parliament, bilingual
corpus, not freely available, such parallel text corpus is important
for statistical machine translation and other cross lingual NLP work.
• WordNet Electronic Dictionary: hierarchical, includes synset (
identical meaning), meronym or part-whole relations between
words, free and downloaded from internet.
• Further details in Ch-4
5
1.4.2: Word Counts
•
•
•
•
•
•
Some common words are
Question:
the
occurring what
over 700are
times
andmost common
words
in theaccounting
text?. Table
individually
for over 1.1 includes
1%words
of the words
e.g. as function
common
known
3332x100/71370
4.67%Twain’s
and
words
from the =Mark
Tom
772x100/71370
=
1.08%
Sawyer Corpus e.g. determiners,
prepositions and complementizers
Frequency of Tom, corpus reflect the
material
from
which
was constructed,
Vast majority
of word
typesitoccur
extremely
infrequently
e.g.
over
90%
of almost
Question:
how
many
words
are there
On the
other
extreme,
wordtypes
occur 10
times
or less
in
the text?.
Corpus
includes
half
(49.8%)
ofe.g.
the word71370
91+82+131+…3993
=
7277
out
of
8018
occur
only once
in the less
work tokenstypes
(very
small
corpus),
word types. Rare corpus
words make
up
a
known astext.
hapax
than half a MB ofofonline
considerable proportion
the
text
e.g.
legomena (read only once)
it12%
includes
types
of the text8018
is wordsword
that occur
3 (different
times or less
e.g. 3993+2584+1992
words)
while
a sample of =newswire of
8569 out
of 71370 11,000 word types
same size
contains
Ratio of tok to typ  71370/8018 = 8.9
which is average frequency with which
each type is used
Table1.2 shows word types occur with
Overall the most common
a certain frequency.
100 words account for over
half (50.9%) of the word
tokens in the text
Word in corpus
occure “on
average” about 9
times each
6
•
•
•
•
•
1.4.3: Zipf’s Law
The Principle of Least Effort: The people will
act to minimize their probable average rate
of work. Zipf uncovered this theory through
certain empirical laws.
Count how often each word type occurs in a
large corpus and then list the words in order
of their frequency. We can explore the
relationship between the frequency of a word
f and its position in the list known as its rank
r. The Law states f ∞ 1/r (1.14) or in other
words f . r = k where k is constant.
This equation says e.g. 50th most common
word should occur with three times the
ranks,
frequency of the 150thLow
most
common word.
The
slop-I
line
This
concept
first
introduced
by
Estoup(1916) but widely
publicized
by Zipf.
is too
low
Zipf’s Law holds for table 1.3 approx. except
the three highest frequency words and
product f.r make a curve (bulge) for words of
rank around 100.
High ranks >10,000.
This curve
gives information about frequency
The line is too high
distribution that a few very common words, a
middling number of medium frequency
words and many low frequency words are
exists in human languages.
•The validity and possibilities for the
derivation of Zipf’s law is studies by
Mandelbrot(1954) and found that Zipf’s
law show closer match with large
Corpus sometime and give general
shape of the curve but poor in
reflecting details.
•Figure 1.1 is rank frequency plot.
Zipf’s law predicts that this graph
should be a straight line with the slop-1
but mandelbrot showed that it is bad fit
especially for low (most low ranks) and
high ranks (greater than 10,000).
7
•
•
•
•
•
•
Mandelbrot
derives
the
following
relationship to achieve the closer fit.
f = P(r+p)-B or logf = logP – B log(r+p)
where P, B and p(ro) are parameters of
text that collectively measure the richness
of the text’s use of words.
Hyperbolic distribution still exist as in the
case of Zipf law but for large value of r, it
closely approximate a straight line
descending with slop –B just as Zipf’s law.
By appropriate setting of parameters, one
can model a curve where the frequency of
most common words is lower.
The graph in fig 1.2 shows the Mandelbrot
formula which is better fit than Zipf’s law
for given corpus.
Other Laws
Zipf proposed a number of other empirical
laws relating to language. Among them two
important SNLP concerns are as follows:
“the number of meaning of a word is
correlated with its frequency” or m∞√f
where m is number of meaning and f is
frequency. Or m ∞ 1/√r.
Zipf gives empirical support in his study as
words of frequency rank about 10,000
average about 2.1 meaning, 5000 average
about 3 meanings and 2000 about 4.6
meaning.
Slight bulge in the upper left
corner and large slope of
model the lowest and highest
ranks better than Zipf’s law
Straight line at end
•One can measure the number of line and
pages b/t each occurrence of the word in a text
and then calculate the frequency F of different
interval size I e.g. for words of frequency at
most 24 in 260,000 word corpus zipf found
F ∞ I-p where p varied b/t 1 and 1.3 in Zipf’s
studies. In short, most of the time content
words occur near another occurrence of the
same word. (Detail in ch-7 and 15.3).
•Other laws of Zipf’s almost represent there is
an inverse relationship b/t the frequency of
words and their length.
8
The significance of power law (read yourself)
Problem: Not1.4.4:
normalized
for the frequency of
Collocations
In case of “of
the” andcompound
“in the” the most
• words.
Collocation
include
words
common
word
sequences
concludes
that
the
(disk drive). Phrasal verbs (make up) and
determiner
commonly
follows
a preposition
but
other stock
phrases
(bacon
and eggs).
these
are specialized
not collocations.
Solution:
count
Have
meaning
and
idiomatic
frequency
of
each
word
(natural style of speech and writing) but
•
•
•
they need not be e.g. international best
practice.
The frequent use of fixed expression is
candidate for collocation. Important in
areas of SNLP e.g. machine translation
(ch-13) and information retrieval (ch-15).
Lexicographer are also interested in
collocations to put it in dictionary due to
frequent ways of word usage, multiword
units and independent existence.
Chomskyan focus on the creativity of
language use is de-emphasized by the
people practice of collocation and
Hallidayan gives another idea that
language is inseparable from its
pragmatic (words with special meaning
w.r.t their use) and social context.
Collocations may be several words long
or discontinuous (make [something] up).
Common bigram collocation from New
York Times is given in Table 1.4.
•Another approach to filter
collocations first, then remove those
that are POS or syntactic categories
or rarely associated with
collocations. Two most frequent
patterns are adj-noun and nounnoun as shown in Table 1.5 .
9
•
•
•
•
•
•
1.4.5: Concordances
Key word in context (KWIC)
concordancing programme which
produces displays of data as in fig 1.3
5 uses of “showed off” in double
quotes either due to neologism (new
word) or slang at that time. All of these
uses are intransitive(which has subject
and no object) although some take
prepositional phrase modifiers e.g. in
and with in sentences.
(6,8,12,15) uses transitive verb (which
has object compulsory)
(16) uses ditransitive (which has direct
and indirect objects) verb
In (13,15), object is NP and that clause
and (7) as non finite and (10) as finite
question form complement clauses.
(9,14) has NP object followed by PP
but quite idiomatic. In both cases
object noun is modified to make a
more complex NP. We could
systematize the pattern as in fig1.4.
Collecting information about patterns of
occurrence of verbs like this is useful for
dictionaries for foreign language
learners, guiding statistical parses,
10
1.5 Further Readings
• References from text book
Questions,
Discussion and
Comments are
Welcomed
11