Lin3098-zipf

Download Report

Transcript Lin3098-zipf

LIN 3098 – Corpus Linguistics
Lecture 5
Albert Gatt
In this lecture…
 Corpora and the Lexicon
 uses of corpora in lexicography
 Counting words
 lemmatisation and other issues
 types versus tokens
 word frequency distributions in corpora
Part 1
Corpora and lexicography
Why corpora are useful
 Lexicographic work has long relied on
contextual cues to identify meanings.
 e.g. Samuel Johnson used examples from
literature to exemplify uses of a word.
 Corpora make this procedure much easier
 not only to provide examples but:
 to actually identify meanings of a word given its
context
 definitions of word meanings should therefore be
more precise, if based on large amounts of data
Specific applications
 Grammatical alternations of words
 E.g. Verb diathesis alternations:
 Atkins and Levin (1995) found that verbs such
as quiver and quake have both intransitive and
transitive uses. (see Lecture 1)
 E.g. uses of prepositions such as on, with…
 Regional variations in word use
 relying on corpora which include
gender/region/dialect/date information
Specific applications - II
 Identification of occurrences of a specific
homograph, e.g. house (Verb)
 examination of the contexts in which it occurs
 relies on POS tagging
 Keeping track of changes in a language
through a monitor corpus
 Identifying how common a word is, through
frequency counts.
 many dictionaries include such information now
 this shall be our starting point
Part 2
Counting words in corpora: types
versus tokens
Running example
 Throughout this lecture, reference is
made to data from a corpus of
Maltese texts:
 ca. 51,000 words
 all from Maltese-language newspapers
 various topics and article types
How to count words: types versus
tokens
 token = any word in the corpus
 (also counting words that occur more than once)
 type = all the individual, different words in
the corpus
 (grouping occurrences of a word together as
representatives of a single type)
 Example:
 I spoke to the chap who spoke to the child
 10 tokens
 7 types (I, spoke, to, the, chap, who, child)
More on types and tokens
 The number of tokens in the corpus is
an estimate of overall corpus size
 Maltese corpus: 51,000 tokens
 The number of types is an estimate of
vocabulary size
 gives an idea of the lexical richness of
the corpus
 Maltese corpus: 8193 types
Type/token ratio
 A (rough!) way of measuring the
amount of variation in the vocabulary
in the corpus.
no. types
no. tokens
 Roughly, can be interpreted as the
“rate at which new types are
introduced, as a function of number
of tokens”
Difficult decisions - I
 Do we distinguish upper- and lowercase words?
 is New in New York the same as new in
new car?
 but what of New in New cars are
expensive? (sentence-initial caps)
 in practise, it’s not straightforward to
distinguish the two accurately, but can
be done
Difficult decisions - II
 What about morphological variants?
 man – men  one type or two?
 go – went  one type or two?
 If we map all morphological (inflectional) variants to a
single type, our counts will be cleaner
(lemmatisation).
 depends on availability of automated methods to do
this
 Maltese also presents problems with variants of the
definite article (ir-, is-, ix- etc)

ir-raġel (DEF-man): one token or two?
Difficult decisions - III
 Do numbers count?
 e.g. is 1,500 a word?
 may artificially inflate frequency counts
 one approach is to treat all numbers as tokens of a
single type “NUMBER” or “###”
 Punctuation
 can compromise frequency counts
 computer will treat “woman!” as different from
“woman”
 needs to be stripped
 problematic for languages that rely on non-alphabetic
symbols: Maltese ‘l (“to”) vs l- (“the”)
Part 2
Representing word frequencies
Raw frequency lists (data from
Maltese)
 A simple list, pairing each word with
its frequency
word
aħħar (“last”)
jkun (“be.IMPERF.3SG”)
ukoll (“also”)
bħala (“as”)
dak (“that.SGM”)
tat- (“of.DEF”)
frequency
97
96
93
91
86
86
Frequency ranks
 Word counts can get very big.
 most frequent word in the Maltese corpus occurs
2195 times (and the corpus is small)
 Raw frequency lists can be hard to process.
 Useful to represent words in terms of rank:
 count the words
 sort by frequency (most frequent first)
 assign a rank to the words:
 rank 1 = most frequent
 rank 2 = next most frequent
 …
Rank-frequency list example (data
from Maltese)
rank
Frequency
1
2195
2
2080
3
1277
4
1264
Rank of type, according
to frequency
Number of times the type
occurs
Frequency spectrum (data from
Maltese)
 A representation
that shows, for
each frequency
value, the number
of different types
that occur with
that frequency.
frequency types
1
4382
2
1253
3
661
4
356
Normalised frequency counts
 A raw frequency for a word isn’t
necessarily informative.
 E.g. difficult to compare the frequency of
the word in corpora of different sizes.
 We often take a “normalised” count.
 typical to divide the frequency by some
constant, such as 10,000 or 1,000,000
 this gives “frequency of word per million”
rather than a raw count.
Type/token ratio revisited
 (no. of types)/(no. of tokens)
 Another way of estimating “vocabulary
richness” of a corpus, instead of just
looking at vocabulary size.
 E.g. if a corpus consists of 1000 words, and
there are 400 types, then the TTR is 40%
Type/token ratio
 Ratio varies enormously depending on
corpus size!
 If the corpus is 1000 words, it’s easy to see
a TTR of, say, 40%.
 With 4 million words, it’s more likely to be
in the region of 2%.
 Reasons:
 vocab size grows with corpus size but
 large corpora will contain a lot of tokens that
occur many times
Standardised type/token ratio
 One way to account for TTR variations
due to corpus size is to compute an
average TTR for chunks of a constant
size. Example:
 compute the TTR for every 1000 words
of running text
 then, take an average over all the 1000word chunks
 This is the approach used, for
example, in WordSmith.
Part 3
Frequency distributions, or
“few giants, many midgets”
Non-linguistic case study
 Suppose we are interested in measuring
people’s height.
 population = adult, male/female, European
 sample: N people from the relevant population
 measure height of each person in the sample
 Results:
 person 1: 1.6 m
 person 2: 1.5 m
 …
Measures of central tendency
 Given the height of individuals in our
sample, we can calculate some summary
statistics:
 mean (“average”): sum of all heights in sample,
divided by N
 mode: most frequent value
 Median: the middle value
 What are your expectations?
The data (example)
height
1
135
2
159
3
160
4
160
5
180
 Mean: 158.8cm
 This is the expected value in
the long run.
 If our sample is good, we
would expect that most people
would have a height at or
around the mean.
 Mode: 160cm
 Median: 160
Plotting height/frequency
Observations:
1. Extreme values
are less
frequent.
2. Most people fall
on the mean
3. Mode is
approximately
same as mean
4. Bell-shaped
curve
(“normal”
distribution)
Plotting height/frequency
•
•
•
This shape
characterises
the Normal
Distribution.
A “bell curve”
Quite typical
for a lot of
data sampled
from humans
(but not all
data)
What about language?
 Typical observations about word
frequencies in corpora:
1. there are a few words with extremely
high frequency
2. there are many more words with
extremely low frequency
3. the mean is not a good indicator: most
words will have an actual value that is
very far above or below the mean
A closer look at the Maltese data
 Out of 51,000 tokens:
 8016 tokens belong to just the 5 most frequent types
(the types at ranks 1 -- 5)
 ca. 15% of our corpus size is made up of only 5
different words!
 Out of 8193 types:
 4382 are hapax legomena, occurring only once
(bottom ranks)
 1253 occur only twice
 …
 In this data, the mean won’t tell us very much.
 it hides huge variations!
Ranks and frequencies (Maltese)
1. 2195
2. 2080
3. 1277
…
2298. 1
2299. 1
…
Among top ranks, frequency drops
very dramatically
Among bottom ranks, frequency drops very
gradually
General observations
 In corpora:
 there are always a few very highfrequency words, and many lowfrequency words
 among the top ranks, frequency
differences are big
 among bottom ranks, frequency
differences are very small
So what are the high-frequency
words?
 Top 5 ranked words in the Maltese data:
 li (“that”), l- (DEF), il- (DEF), u (“and”), ta’
(“of”), tal- (“of the”)
 Bottom ranked words:





żona (“zone”) f = 1
yankee f = 1
żwieten (“Zejtun residents”) f = 1
xortih (“luck.POSS-3SGM”) f = 1
widnejhom (“ear.POSS-3PL”) f = 1
Zipf’s law
 George K. Zipf (1902 – 1950) established a
mathematical model for describing
frequency data:
Frequency decreases with rank. More
precisely, frequency is inversely
proportional to rank.
 We can plot this in a chart:
 Y-axis = frequency
 X-axis = rank
 each dot on the chart represents the lexical item
(type) at a given rank
How Zipf’s law pans out (Maltese
data)
A few high frequency,
low-rank words
Hundreds of low-frequency,
high-rank words
frequency
2500
frequency
2000
1500
frequency
1000
500
0
0
1000
2000
3000
4000
rank
5000
6000
7000
8000
9000
Zipf’s law cross-linguistically
 Empirical work has shown that the Zipfian
distribution is observable:
 independent of the language
 irrespective of corpus size (for reasonably large
corpora)
 The bigger your corpus:
 the bigger your vocabulary size (no. types)
 the more words of frequency 1 (hapax
legomena)
 Why?
Some reasons
 If words were completely random, every
word would be equally likely.
 Our plot would be completely flat: all words at
all ranks have same frequency.
 Language is absolutely non-random:
 occurrence of words governed by:
 syntax
 author/speaker intentions
 ...
 Some words are the basic “skeleton” for
our sentences. They are the most frequent.
Implications
 Traditional measures of central
tendency (mean etc) not very useful.
 No two corpora can be directly
compared if they are of different size:
 vocab size increases with corpus size
 most of the vocab made up of hapax
legomena
 most of the corpus size (no. tokens)
made up of a few, very frequent types,
typically function words.
Summary
 We’ve introduced some of the uses of
corpora for lexicography.
 Focused today on word frequencies,
especially Zipf’s law
 looked at some of the implications
 Next up:
 collocations and why they’re useful
References
 Baroni, M. (2007). Distributions in
text. In A. Lüdeling and M. Kytö
(eds.), Corpus linguistics: An
international handbook. Berlin:
Mouton de Gruyter.