How many words are in the English Language?

Download Report

Transcript How many words are in the English Language?

Borgmann Project
List all the words in the English language
Chris Cole
Ur Studios, Inc.
Dmitri Alfred Borgmann
“Father of logology”
 Recreational linguistics
 Systematic wordplay
Author of two seminal books
 Language on Vacation: An Olio of Orthographical Oddities (Scribner's,
1965)
 Beyond Language (Adventures in Words and Thought) (Scribner's,
1967)
Founder of Wordplay: The Journal of Recreational Linguistics (1968)
How many words are in the English Language?
“The English language has a complement of somewhere between two million and three million "short" words...”
-Dmitri Borgmann, Beyond Language, p. 226
How many words are in the largest unabridged dictionaries?
Philip Babcock Gove, Preface to Webster's Third New International Dictionary of the
English Language, Unabridged (G. & C. Merriam, 1961), p. 7a:
“This dictionary has a vocabulary of over 450,000 words. It would have been easy to
make the vocabulary larger although the book, in the format of the preceding edition,
could hardly hold any more pages or be any thicker. By itself, the number of entries is,
however, not of first importance. The number of words available is always far in
excess of and for a one volume dictionary many times the number that can possibly
be included.”
How many words are in the largest unabridged dictionaries?
John Simpson, Chief Editor, Oxford English Dictionary, Preface to the Third Edition, March
2000:
“There are a number of myths about the Oxford English Dictionary, one of the most
prevalent of which is that it includes every word, and every meaning of every word,
which has ever formed part of the English language. Such an objective could never be
fully achieved. […] It is also often claimed that a ‘word’ is not a ‘word’ (or is not
‘English’) unless it is in ‘the dictionary’. This may be acceptable logic for the purposes
of word games, but not outside those limits. […] It may be added here that the
question ‘How many words are there in the English language?’ cannot be answered by
recourse to a dictionary.”
How many words are in the largest unabridged dictionaries?
Victoria Neufeldt, editor of the Webster's New World family of dictionaries, quoted in
Kenneth F. Kister, Kister's Best Dictionaries for Adults and Young People, A
Comparative Guide, The Oryx Press, Phoenix, Arizona, 1992, p. 79:
“I hate the word "unabridged." It's stupid and misleading, since it is used for all large
dictionaries, regardless of whether an abridged edition of a given dictionary exists;
and also, because the word sort of implies the idea of completeness, it encourages
the buyer to believe that the dictionary so described contains all the words of the
language. No dictionary comes anywhere near doing that.”
What is a word?


A word is the smallest unit of meaning.
Analogous to:
A letter is the smallest unit of spelling.
A phoneme is the smallest unit of pronunciation.
How many words are in the English language?



Unabridged dictionaries contain about 500,000 words.
If “many times” (Gove) implies a multiple of 4 to 6, then 2 to 3 million (Borgmann) is a
reasonable estimate.
How to find these words?
The problem of names


A name is a word that designates an individual or a class of individuals.
Unlimited number of names.
The problem of prefixes and suffixes
 “countermeasures,” “countercountermeasures,”
“countercountercountermeasures,” etc. are all understandable and
distinct, hence words.
The problem of compounds
 English loose with closing open compounds, e.g., “airvent,” “air
vent,” “air-vent.”
The problem of derived forms
 Is "shanghaiings" a word?
shanghai (verb) →
shanghaiing (participle) →
shanghaiing (noun) →
shanghaiings (plural)
 If so, it is interesting because each letter in it occurs exactly twice.
The problem of rare words
 Comprises American, Canadian, British, Australian, etc. dialects.
 Web3 lists words printed since 1752. OED lists many older forms. What is
the cutoff?
 In addition you have jargon, technical terms, slang, loan words, etc.
 What is the English language?
Example: “amitular”
 Incorrectly formed by analogy with “avuncular.” The ending “-ular”
appended to “amit-” from the Latin “amita” (“aunt”), whereas the
most appropriate adjective is probably “amital.”
 Independently coined in 1982, 2003, 2004, 2007.
 Listed in several reference books.
Solving these problems
 What does “word” mean?
 Paradox of the heap. If you remove one grain of sand from a heap, it is still
a heap, hence logically even one grain is a heap.
 The word “heap” is less likely to apply to smaller heaps.
 “word” is a vague term like “heap.”
Probability is the key
 Wittgenstein: no private language.
 To be in a language, a word must be understood by multiple
speakers of that language.
 The probability that a string is a word is just the probability that it
will be understood by a speaker.
Solves previous problems





Names: specific names are understood by only a few speakers
Prefixes and suffixes: highly stacked words are difficult to understand
Compounds: most are in fact words
Rare: understood by few speakers
Derived forms: unusual derivations are difficult to understand
Using dictionaries to determine probability




Words included because of likelihood of being useful to customers.
Example: Early dictionaries did not include common words.
Example: “airvent” is not in any dictionary because it’s meaning is obvious.
Limit to size of printed dictionaries, but does not apply to electronic
dictionaries.
Not going to be fixed by electronic dictionaries
 Costs money to define a word.
 Word inclusion requires cost/benefit analysis.
 Faulty assumption: Words that are easily understood will be in
dictionary.
Using corpora to determine probability
Large corpora are available:
 USENET: over one million distinct strings in one billion instances
 Google: over ten million distinct strings occurring over 200 times in
one trillion instances
Problem with using corpus to determine probability





Example: “countercountermeasure”
Defined in college–level dictionary (11th Collegiate).
Google hits initial report 1000, really about 100.
Why not in Google corpus?
Does not have enough occurrences (200).
How many dictionary words are not in the corpus?
In corpus
Not in corpus
College-level
Unabridged
109796
241213
9882
200523
 Examples of college-level words not in corpus: airmanships, airposts,
airpowers
 >40% of dictionary words not in Google corpus.
 The problem is the corpus cutoff requiring at least 200 occurrences.
Signal versus noise
Why is the Google corpus cutoff 200 occurrences?
Frequency
Count
Ratio
2,000,000,000
44
-
200,000,000
325
7.38636
20,000,000
3,844
11.8277
2,000,000
20,403
5.30775
200,000
83,972
4.11567
20,000
387,649
4.61641
2,000
2,134,600
5.50653
200
10,957,554
5.13331
20
55,000,000
5
2
275,000,000
5
Signal versus noise

Too many noise words below 200 hits.
Hits
College-level
Unabridged
Words
Non-words
100
5
31
766
1,178
1,000
21
31
249
258
10,000
29
21
64
44
100,000
25
8
4
0
1,000,000
10
0
0
0
10,000,000
3
0
0
0
Example: 2747 strings that start with “air”
Samples of non-words: aircraaft aircracft aircract aircradft aircradt aircraf aircraf5 aircraf5t aircraf6 aircraf6t aircrafc aircrafct
aircrafdt aircraff aircrafft aircrafg aircrafgt aircrafi aircrafr aircrafrt
Samples of words: airbagged airbalancer airball airballed airballoon airballs airband airbands airbanks airbath airbaths
airbats airbattle airbeam airbeams airbear airbearing airbed airbeds airbell airbelt
Corpora are not the solution by themselves
 Use versus mention, names, spam, etc.
 Faulty assumption: Word that is easily understood will be used.
Modeling human understanding
 Bayesian model of word understanding
 Neuroscience results give us reasons to believe that understanding
can be modeled by Bayesian inference.
 Generative model of word formation
 Linguistics gives us reasons to believe that word formation follows a
predictable historical process.
Bayesian model of word understanding
 Example:
shanghaiings (plural of noun, p1) →
shanghaiing (noun from participle, p2) →
shanghaiing (participle from verb, p3) →
shanghai (in dictionary, p4)
 Probability = p1 * p2 * p3 * p4
 pi determined via Bayes’ Law from observed ratios of occurrences
(of similar cases)
Generative model of word formation
 Rules of word formation (etymology, parallelism, sound change,
spelling change, etc.)
 Example:
avunculus (Latin, “uncle”) → avuncular
║ amita (Latin, “aunt”) → amitular
Work in progress
 Iterative approach
 Work “outward” from dictionary using linguistic rules of word
formation.
 Work “inward” from corpus using Bayesian inference on grammar
rules.
 Goal is a process instead of a list
Work in progress
Words starting with “pro.” Results from parsing 10 million sentences collected
from USENET 1992.
6000
5000
4000
3000
2000
1000
0
25%
50%
75%
100%
Compound Dictionary Dictionary Dictionary Dictionary
500K
250k
125k
60k