Lin3098-zipf
Download
Report
Transcript Lin3098-zipf
LIN 3098 – Corpus Linguistics
Lecture 5
Albert Gatt
In this lecture…
Corpora and the Lexicon
uses of corpora in lexicography
Counting words
lemmatisation and other issues
types versus tokens
word frequency distributions in corpora
Part 1
Corpora and lexicography
Why corpora are useful
Lexicographic work has long relied on
contextual cues to identify meanings.
e.g. Samuel Johnson used examples from
literature to exemplify uses of a word.
Corpora make this procedure much easier
not only to provide examples but:
to actually identify meanings of a word given its
context
definitions of word meanings should therefore be
more precise, if based on large amounts of data
Specific applications
Grammatical alternations of words
E.g. Verb diathesis alternations:
Atkins and Levin (1995) found that verbs such
as quiver and quake have both intransitive and
transitive uses. (see Lecture 1)
E.g. uses of prepositions such as on, with…
Regional variations in word use
relying on corpora which include
gender/region/dialect/date information
Specific applications - II
Identification of occurrences of a specific
homograph, e.g. house (Verb)
examination of the contexts in which it occurs
relies on POS tagging
Keeping track of changes in a language
through a monitor corpus
Identifying how common a word is, through
frequency counts.
many dictionaries include such information now
this shall be our starting point
Part 2
Counting words in corpora: types
versus tokens
Running example
Throughout this lecture, reference is
made to data from a corpus of
Maltese texts:
ca. 51,000 words
all from Maltese-language newspapers
various topics and article types
How to count words: types versus
tokens
token = any word in the corpus
(also counting words that occur more than once)
type = all the individual, different words in
the corpus
(grouping occurrences of a word together as
representatives of a single type)
Example:
I spoke to the chap who spoke to the child
10 tokens
7 types (I, spoke, to, the, chap, who, child)
More on types and tokens
The number of tokens in the corpus is
an estimate of overall corpus size
Maltese corpus: 51,000 tokens
The number of types is an estimate of
vocabulary size
gives an idea of the lexical richness of
the corpus
Maltese corpus: 8193 types
Type/token ratio
A (rough!) way of measuring the
amount of variation in the vocabulary
in the corpus.
no. types
no. tokens
Roughly, can be interpreted as the
“rate at which new types are
introduced, as a function of number
of tokens”
Difficult decisions - I
Do we distinguish upper- and lowercase words?
is New in New York the same as new in
new car?
but what of New in New cars are
expensive? (sentence-initial caps)
in practise, it’s not straightforward to
distinguish the two accurately, but can
be done
Difficult decisions - II
What about morphological variants?
man – men one type or two?
go – went one type or two?
If we map all morphological (inflectional) variants to a
single type, our counts will be cleaner
(lemmatisation).
depends on availability of automated methods to do
this
Maltese also presents problems with variants of the
definite article (ir-, is-, ix- etc)
ir-raġel (DEF-man): one token or two?
Difficult decisions - III
Do numbers count?
e.g. is 1,500 a word?
may artificially inflate frequency counts
one approach is to treat all numbers as tokens of a
single type “NUMBER” or “###”
Punctuation
can compromise frequency counts
computer will treat “woman!” as different from
“woman”
needs to be stripped
problematic for languages that rely on non-alphabetic
symbols: Maltese ‘l (“to”) vs l- (“the”)
Part 2
Representing word frequencies
Raw frequency lists (data from
Maltese)
A simple list, pairing each word with
its frequency
word
aħħar (“last”)
jkun (“be.IMPERF.3SG”)
ukoll (“also”)
bħala (“as”)
dak (“that.SGM”)
tat- (“of.DEF”)
frequency
97
96
93
91
86
86
Frequency ranks
Word counts can get very big.
most frequent word in the Maltese corpus occurs
2195 times (and the corpus is small)
Raw frequency lists can be hard to process.
Useful to represent words in terms of rank:
count the words
sort by frequency (most frequent first)
assign a rank to the words:
rank 1 = most frequent
rank 2 = next most frequent
…
Rank-frequency list example (data
from Maltese)
rank
Frequency
1
2195
2
2080
3
1277
4
1264
Rank of type, according
to frequency
Number of times the type
occurs
Frequency spectrum (data from
Maltese)
A representation
that shows, for
each frequency
value, the number
of different types
that occur with
that frequency.
frequency types
1
4382
2
1253
3
661
4
356
Normalised frequency counts
A raw frequency for a word isn’t
necessarily informative.
E.g. difficult to compare the frequency of
the word in corpora of different sizes.
We often take a “normalised” count.
typical to divide the frequency by some
constant, such as 10,000 or 1,000,000
this gives “frequency of word per million”
rather than a raw count.
Type/token ratio revisited
(no. of types)/(no. of tokens)
Another way of estimating “vocabulary
richness” of a corpus, instead of just
looking at vocabulary size.
E.g. if a corpus consists of 1000 words, and
there are 400 types, then the TTR is 40%
Type/token ratio
Ratio varies enormously depending on
corpus size!
If the corpus is 1000 words, it’s easy to see
a TTR of, say, 40%.
With 4 million words, it’s more likely to be
in the region of 2%.
Reasons:
vocab size grows with corpus size but
large corpora will contain a lot of tokens that
occur many times
Standardised type/token ratio
One way to account for TTR variations
due to corpus size is to compute an
average TTR for chunks of a constant
size. Example:
compute the TTR for every 1000 words
of running text
then, take an average over all the 1000word chunks
This is the approach used, for
example, in WordSmith.
Part 3
Frequency distributions, or
“few giants, many midgets”
Non-linguistic case study
Suppose we are interested in measuring
people’s height.
population = adult, male/female, European
sample: N people from the relevant population
measure height of each person in the sample
Results:
person 1: 1.6 m
person 2: 1.5 m
…
Measures of central tendency
Given the height of individuals in our
sample, we can calculate some summary
statistics:
mean (“average”): sum of all heights in sample,
divided by N
mode: most frequent value
Median: the middle value
What are your expectations?
The data (example)
height
1
135
2
159
3
160
4
160
5
180
Mean: 158.8cm
This is the expected value in
the long run.
If our sample is good, we
would expect that most people
would have a height at or
around the mean.
Mode: 160cm
Median: 160
Plotting height/frequency
Observations:
1. Extreme values
are less
frequent.
2. Most people fall
on the mean
3. Mode is
approximately
same as mean
4. Bell-shaped
curve
(“normal”
distribution)
Plotting height/frequency
•
•
•
This shape
characterises
the Normal
Distribution.
A “bell curve”
Quite typical
for a lot of
data sampled
from humans
(but not all
data)
What about language?
Typical observations about word
frequencies in corpora:
1. there are a few words with extremely
high frequency
2. there are many more words with
extremely low frequency
3. the mean is not a good indicator: most
words will have an actual value that is
very far above or below the mean
A closer look at the Maltese data
Out of 51,000 tokens:
8016 tokens belong to just the 5 most frequent types
(the types at ranks 1 -- 5)
ca. 15% of our corpus size is made up of only 5
different words!
Out of 8193 types:
4382 are hapax legomena, occurring only once
(bottom ranks)
1253 occur only twice
…
In this data, the mean won’t tell us very much.
it hides huge variations!
Ranks and frequencies (Maltese)
1. 2195
2. 2080
3. 1277
…
2298. 1
2299. 1
…
Among top ranks, frequency drops
very dramatically
Among bottom ranks, frequency drops very
gradually
General observations
In corpora:
there are always a few very highfrequency words, and many lowfrequency words
among the top ranks, frequency
differences are big
among bottom ranks, frequency
differences are very small
So what are the high-frequency
words?
Top 5 ranked words in the Maltese data:
li (“that”), l- (DEF), il- (DEF), u (“and”), ta’
(“of”), tal- (“of the”)
Bottom ranked words:
żona (“zone”) f = 1
yankee f = 1
żwieten (“Zejtun residents”) f = 1
xortih (“luck.POSS-3SGM”) f = 1
widnejhom (“ear.POSS-3PL”) f = 1
Zipf’s law
George K. Zipf (1902 – 1950) established a
mathematical model for describing
frequency data:
Frequency decreases with rank. More
precisely, frequency is inversely
proportional to rank.
We can plot this in a chart:
Y-axis = frequency
X-axis = rank
each dot on the chart represents the lexical item
(type) at a given rank
How Zipf’s law pans out (Maltese
data)
A few high frequency,
low-rank words
Hundreds of low-frequency,
high-rank words
frequency
2500
frequency
2000
1500
frequency
1000
500
0
0
1000
2000
3000
4000
rank
5000
6000
7000
8000
9000
Zipf’s law cross-linguistically
Empirical work has shown that the Zipfian
distribution is observable:
independent of the language
irrespective of corpus size (for reasonably large
corpora)
The bigger your corpus:
the bigger your vocabulary size (no. types)
the more words of frequency 1 (hapax
legomena)
Why?
Some reasons
If words were completely random, every
word would be equally likely.
Our plot would be completely flat: all words at
all ranks have same frequency.
Language is absolutely non-random:
occurrence of words governed by:
syntax
author/speaker intentions
...
Some words are the basic “skeleton” for
our sentences. They are the most frequent.
Implications
Traditional measures of central
tendency (mean etc) not very useful.
No two corpora can be directly
compared if they are of different size:
vocab size increases with corpus size
most of the vocab made up of hapax
legomena
most of the corpus size (no. tokens)
made up of a few, very frequent types,
typically function words.
Summary
We’ve introduced some of the uses of
corpora for lexicography.
Focused today on word frequencies,
especially Zipf’s law
looked at some of the implications
Next up:
collocations and why they’re useful
References
Baroni, M. (2007). Distributions in
text. In A. Lüdeling and M. Kytö
(eds.), Corpus linguistics: An
international handbook. Berlin:
Mouton de Gruyter.