The Ins and Outs of making a Wordlist

Download Report

Transcript The Ins and Outs of making a Wordlist

The Ins and Outs of making a
Wordlist
Rob Waring
Notre Dame Seishin University
www.robwaring.org/presentations/
Overview
Purpose - What kind of list?
List structure
Selection factors
Definitions or translations?
Mechanics
Validating
Android too
Black – in level
Red out of list
Red underline – out of
level
Green – ignored words
What kind of list - Purpose?
• To give to students to learn from (paper or digital)
• To analyze texts against e.g. a graded reader
• To cover the majority of words in a given field (e.g. top 1000 business
words)
• Master list to source sublists from?
• Multiple level lists, or one list?
• For a single class – or general (e.g. all natives, all intermediates)
• Spoken, written, mixed?
• For a specific audience?
– TOEIC, business, academic
– A certain age
– A certain level (intermediates)
What kind of list - Starting point?
• Use existing wordlists – GSL, Nation’s BNC lists, NGSL, NAWL ...
• Use existing corpus (e.g. BNC, COCA) and dig out what you want
• Create your own corpus (business, TOEIC, nursing)
• Does it suit your purpose? Will BNC give you an academic list?
• Is it structured the way you want? Headwords only? Lemmas?
Mixture?
BNC raw (by type)
BNC Nation Family list
List structure I - List with Levels
• How many levels? Why?
• What are the breaks between levels? Will learners get from
one level to another with ease?
• Will the breaks be even (say 560 words each) or vary?
• Level by frequency? utility? range? intuition? Learnability?
Selection Criteria I
Representativeness: The list should adequately represent the
wide range of uses of language
Frequency and range: A word should occur frequently across a
wide range of texts.
Word families: Sensible set of criteria regarding what forms and
uses are counted as being members of the same family
Utility: how useful will the words be to the target learners
Idioms and set expressions: Some items larger than a word
behave like high frequency words
Selection Criteria II
Learnability: how easy to learn? Related words may be easier
Regularity: regular forms are easier than irregular forms, but
some derivatives operate differently within a family. Excuse
inexcusable
Coverage: (it is not efficient to be able to express the same idea
in different ways. It is more efficient to learn a word that covers a
quite different idea)
Stylistic level and emotional words: West saw second language
learners as initially needing neutral vocabulary
Intuition: how well does it match the teacher’s sense of what to
include
Which of these would you put in your list?
out of
per cent
such as
of course
for example
in front of
all right
as soon as
in general
in addition to
next to
on top of
instead of
in charge of
just about
provided that
as good as
with a view to
in between
by and large
at random
per se
old fashioned
grown up
matter of fact
sq m
fait accompli
straight forward
habeas corpus
self-same
haute cuisine
a good deal
laissez faire
persona non grata
How frequently do lexical phrases occur (BNC)?
Raw Rank
177
222
272
285
378
1538
1725
2159
2491
2970
3307
3755
4378
5409
5987
7396
7885
9125
Word
out of
per cent
such as
of course
for example
in front of
all right
as soon as
in general
in addition to
next to
on top of
instead of
in charge of
just about
provided that
as good as
with a view to
Per million
words
490
382
321
309
238
65
58
47
41
34
30
26
21
17
15
11
10
8
Raw Rank
11459
13507
14369
16684
19505
22060
28441
43572
48241
51717
58511
74321
76170
82928
83882
89371
Word
Per million
words
in between
6
by and large
5
at random
4
per se
4
old fashioned
3
grown up
2
matter of fact
2
sq m
1
fait accompli
1
straight forward
1
habeas corpus
1
self-same
0
haute cuisine
0
a good deal
0
laissez faire
0
persona non grata
0
Selection criteria – a new headword or in the
family?
• Only mega-headwords (that cover all meaning senses)
• Inflections only? - Plurals, verb forms, -er –est adjectives. Keep them
all together? If not where do low frequency derivatives go?
– USE uses using used user users useful useless usefulness usefully usable
misused misuse misusing misuses misuser misusers uselessness uselessly
unused usability reuse reuses reused reusing unusable
• Derivatives in the family or as a new headword?
– interest, interesting, interested, disinterested, interestingly
•
•
•
•
•
Polygraphs with different meaning senses – book, bank, bat, bill
Nuances – a brain, to brain someone
Phrasal verbs – bring down, bring back, bring up, bring over
Compound words – handbag, policeman, airflow, birdwatching
Multi-word units? – traffic light, lunch box, all right, by and large
Selection – where to put derivatives?
Level 1: A different form is a different word. Capitalization is ignored.
Level 2: Regularly inflected words are part of the same family. The inflectional categories are
- plural; third person singular present tense; past tense; past participle; -ing; comparative;
superlative; possessive.
Level 3: -able, -er, -ish, -less, -ly, -ness, -th, -y, non-, un-, all with restricted uses.
Level 4: -al, -ation, -ess, -ful, -ism, -ist, -ity, -ize, -ment, -ous, in-, all with restricted uses.
Level 5: -age (leakage), -al (arrival), -ally (idiotically), -an (American), -ance (clearance), -ant
(consultant), -ary (revolutionary), -atory (confirmatory), -dom (kingdom; officialdom), -eer
(black marketeer), -en (wooden), -en (widen), -ence (emergence), -ent (absorbent), -ery
(bakery; trickery), -ese (Japanese; officialese), -esque (picturesque), -ette (usherette;
roomette), -hood (childhood), -i (Israeli), -ian (phonetician; Johnsonian), -ite (Paisleyite; also
chemical meaning), -let (coverlet), -ling (duckling), -ly (leisurely), -most (topmost), -ory
(contradictory), -ship (studentship), -ward (homeward), -ways (crossways), -wise (endwise;
discussion-wise), ante- (anteroom), anti- (anti-inflation), arch- (archbishop), bi- (biplane),
circum- (circumnavigate), counter- (counter-attack), en- (encage; enslave), ex- (expresident), fore- (forename), hyper- (hyperactive), inter- (inter- African, interweave), mid(mid-week), mis- (misfit), neo- (neo-colonialism), post- (post-date), pro- (pro-British), semi(semi-automatic), sub- (subclassify; subterranean), un- (untie; unburden).
Level 6: -able, -ee, -ic, -ify, -ion, -ist, -ition, -ive, -th, -y, pre-, re-.
Level 7: Classical roots and affixes.
Selection Criteria - How will you deal with … I
•
•
•
•
•
•
•
•
•
•
Proper nouns: SONY, Dave, Jackson, Thomson, Paris, London
Proper nouns that are words - Bell, Sue, Jack, Nation, Mark
Numbers: 1, one, thirty, twenty-seven, thousand, billion
Acronyms – NATO, DNA, UN, NSA, DARPA,
Dialectal differences (e.g. US vs UK spelling)
Multi-word units – post office, train station, city hall,
Closed lexical sets such as days of the week, months etc.
Typos – mispelings, heros, amatur, arguement, bellweather
Incomplete words – travelin’, roarin’, ‘cept
Slang forms – gonna, wanna, nuffink, wassup
Selection Criteria - How will you deal with … II
•
•
•
•
•
•
Offensive words – pooh, shit, crap, bugger, bastard, fart,
Culturally loaded words – temple vs. church, hijab, sporran
Non-pc words – stewardess, waitress, negro, retarded, stupid
NCLB words - beer, alcohol, drugs, tobacco, smoking,
Archaic words – thou, thee, thine, groovy, gay,
Prototypical sets – words often taught in sets
–
–
–
–
–
foods - pizza, apple, cake, bread, salt, tomato, zucchini, eggplant, capsicum
drinks – coffee, tea, juice, water, cola, mojito, screwdriver, bloody Mary
buildings – office, station, hotel, city hall, auditorium, ice rink
shops – supermarket, mall, barber, stationer, grocer
colors – red, blue, green, yellow, pink, violet, scarlet, puce
Definitions - What aspects of word knowledge to
include?
•
•
•
•
•
•
•
•
Definition
POS – how detailed do you want to be?
Translations – how will you deal with translators who disagree?
Example sentence – authentic, contrived?
Usage notes – which ones?
Synonyms
Anyonyms
Distractors? (for online test auto-create software)
Definitions - style
What style?
e.g. Apple
synonym
short definition
long definition
fruit
hard red or green fruit
the fleshy usually rounded red, yellow
or green edible fruit of a usually
cultivated tree (genus Malus) of the
rose family
Use of a defining vocabulary list? Which one? Which words?
Mechanics
• Word? Excel?
• Specialized database software such as Access or Filemaker?
• Versions. Is it important to know which version of your
wordlist was given to which users?
• Do you have the time and patience?
• SERIOUSLY. Do you have the time and patience?
Validating your wordlist
•
•
•
•
How will you evaluate the list’s integrity?
How will you check if you missed words?
How will you check mis-levelled words?
How will you check consistency of definitions, examples,
translations?
And soooooo much more!
Questions?
If you want help [email protected]