Resources for Using Corpus Linguistics in Language
Download
Report
Transcript Resources for Using Corpus Linguistics in Language
Resources for Using Corpus
Linguistics in ELT
Kenji Kitao
Doshisha University
Kyoto, Japan
S. Kathleen Kitao
Doshisha Women’s College
Kyoto, Japan
I. Presentation
A. Corpus linguistics and corpus-related
resources
B. Online resources for corpus linguistics
1. Types of resources
2. Examples of resources
C. Using corpus-related resources for
language teaching
II. Application
A. Assigned tasks
B. Free exploration
Presentation
Definitions
Corpus (Latin for “body”)
A text or collection of texts
Now generally used to refer to machinereadable texts
Corpus linguistics
the use of the empirical data from a corpus to
study language usage and to find patterns of
language usage by analyzing actual language
use
Requirements
A corpus
Can be a single text or a large collection of
texts
Larger corpora provide more reliable results, if
the purpose is making generalizations about
language use
Balanced corpora
A variety of genres, including academic writing,
newspapers, fiction, and spoken language
Specialized corpora
Examples
Academic writing
Texts by learners of English, sometimes with a
specific native language
Teachers can develop their own corpora
Newspaper articles
Learners’ texts
Corpus analysis tool(s)
Types
General
Tools with specific corpora
Tools that can be used with any text or collection of texts
Word, Excel, etc.
Specialized
Count words
Find example of specific words or parts of speech
Analyze word frequencies
Evaluate readability
Online Corpora
Free to all users
Available for a fee or for purchase
Available only to restricted users
In this presentation, we will only
introduce resources that are free.
Using Corpus Linguistics for Language
Teaching
Technology has become widespread and
accessible
Larger, more powerful computers that can analyze
large amounts of data quickly are available
Many corpus-related resources have become
available
Language teachers and learners can use corpora
Corpus-related Internet resources
1. General resources on corpus linguistics
2. Vocabulary frequency lists and
frequency level checkers
3. Online corpora, concordancers and other
text-analysis software
4. E-texts
5. Information about using corpus
linguistics for language teaching
Resources for Corpus Linguistics
http://www.cis.doshisha.ac.jp/kkitao/libra
ry/resource/corpus/corpus.htm
1. General resources on corpus
linguistics
Web sites that help orient users to corpora
and to what is available online for teachers
to use in the classroom or in preparing
material
The Compleat Lexical Tutor
http://www.lextutor.ca/
Resources for data-driven learning, including
concordancers for various corpora and in which one
can enter texts
Tutorials, resources of teachers, resources for
research
Bookmarks for Corpus Linguists
http://devoted.to/corpora/
extensive annotated list of links related to corpus
linguistics, including
software
tools
frequency lists
papers and articles
English and non-English corpora
2. Vocabulary frequency lists, frequency level
checkers, and n-gram extractors
Frequency lists
Words used most frequently in English and thus words
that are most useful for students to know
Often divided into sublists
Specialized word lists
Academic Word List
http://www.nottingham.ac.uk/~alzsh3/acvocab/inde
x.htm
List includes 570 headwords with their word families
Site includes an explanation of the word lists, the
words in each sublist, suggestions for using the list,
and a gapmaker that can be used to produce gapfilling exercises
5000 Vocabulary List for Visiting Scholars
in the USA
http://www.paulnoll.com/Books/5000Words/index.html
This is a list of the 5000 Words determined by
the Chinese Academy of Sciences for scholars
that need to go abroad for research or
advanced studies in the USA. They are listed in
alphabetical order and have sample sentences
and examples. There is an additional three
thousand words.
Frequency-level checkers
Produces a list of words at each level of
difficulty
Helps a teacher understand how difficult the
vocabulary in the reading passage is and which
words students at different levels of proficiency
might need to learn
N-gram finders
Finds groups of n-words
JACET 8000 Word List
http://www01.tcp-ip.or.jp/~shin/j8web/j8web.cgi
On this web page, you can enter a text and get a list
of the words that appear in the text at each of the
eight levels of the JACET list. You also get statistics
about what percentage of the words (both types and
tokens) occur at each of the eight levels.
N-gram finders
Online text analysis tool
http://www.online-utility.org/text/analyzer.jsp
Finds most frequent groups of 2 and 3 words,
plus produces a list of all the words, their
occurances, and their percentage
Advanced Search – Explore N-grams from
the BNC
http://pie.usna.edu/explore.html
Produces lists of n-grams, based on the
number of words and occurances you specify
N-gram phrase extractor
http://www.er.uqam.ca/nobel/r21270/cgibin/tuples/u_extract.html
Produces KWIC list of n-grams
3. Online corpora, concordancers, and other
text-analysis software
Concordancers
A type of software for searching corpora
Produces a list of key words in context (KWIC), that is,
search terms with the words that come before and after
them.
May be able to search for parts of speech, e.g., take,
followed by a preposition
May be able to search for two words that are not next to
each other
Corpora (or parts of corpora) may have spoken
language, written language, American English,
British English, academic English, and so on.
Specialized corpora include:
parallel corpora, which have same texts in different
languages (to compare same passages in different
languages)
learner corpora, which have students’ writing/
speaking (to help identify learners’ problems or to
study characteristics of their writing)
Examples of concordancers
Turbo Lingo
http://www.staff.amu.edu.pl/~sipkadan/lin
go.htm
Can enter a text or URL and get a list of
KWIC, average sentence length, word
frequency list, and other analyses
VIEW (Variation in English Words and
Phrases)
http://view.byu.edu/
Concordancing tool for the British National
Corpus, the Corpus of Contemporary
American English, and a Time magazine
corpus, plus non-English corpora
A powerful concordancing tool
Has a useful tutorial
Click on what you want to do to see samples of
searches
For example, if you want to learn to use wildcards,
click on that word, and you will see several examples.
You choose the type of search you want to do, and
the search is automatically filled in. You can revise it
based on what you want to do.
Types of searches
Search by exact word, exact phrase,
wildcard, or part of speech
Use ? or * as a wildcard
For example, mysterious
For example, * point *
Search for an exact word plus a part of
speech
For example, white [n*]
Compare usage of semantically related
words
Search for surrounding words
{sheer/total} [n*]
Nouns that follow the verb “wrap”
Limit the search to one register
Adjectives in tabloid newspapers
Compare usage between registers, e.g.,
news and speaking
we [verb] that: ACAD vs SPOKEN
Find words with similar, more general,
or more specific meanings
Similar words to “small”
More general than “shriek”
More specific than “woman”
BNCweb
To log in, go to:
http://bncweb.lancs.ac.uk/bncwebSignup/
For information, go to:
http://bncweb.info
On BNCweb, you can do simple searches,
you can restrict your search to written or
spoken texts or based on the type of text.
Form your own subcorpora.
Make frequency lists based on criteria you
specify
For example, make a frequency list of all
adverbs that end in –ly in spoken texts.
Look at your query history and save
queries to use again.
See your results in a sentence view or a
KWIC view.
Get a list of collocates, with statistics about
their frequency.
Get information about what type of texts
the search term was found in.
Online concordancer
http://www.lextutor.ca/concordancers/con
cord_e.html
Can search a variety of corpora, including
the Brown Corpus, the British National
Corpus (written and spoken), a learner
corpus, etc.
Produces a KWIC list for a given word and
a list of collocates and their frequency
WebCorp
http://www.webcorp.org.uk/
Uses the Internet as a corpus and
produces KWIC as well as providing other
information
Comparing two texts
Text Lex Compare
http://www.lextutor.ca/text_lex_compare/
Allows users to enter two texts and get lists of:
Unique words to first text
Shared words in two texts
Unique words in second text
Useful to help teacher find new words in new text
Specialized corpora (a few examples)
Spoken English
Corpus swb (American English telephone
conversations)
http://www.ldc.upenn.edu/cgibin/lol/swb/speechcorpus?&corpus=swb
Technical English
e-Xplore Technical English
https://learn.sz.htwk-leipzig.de/wc/main.php
Parallel corpora
CRATER Multilingual Aligned Annotated Corpus
http://www.comp.lancs.ac.uk/linguistics/crater/c
orpus.html
Academic English
Michigan Corpus of American Spoken English
http://quod.lib.umich.edu/m/micase/
Some large corpora also have sub-corpora of
academic English
Online software to assess readability
Tests of document readability and
suggestions how to improve readability
http://www.onlineutility.org/english/readability_test_and_im
prove.jsp
Can calculate texts of any length (some
online text analysis programs have limits)
Can enter the text directly or enter a URL
e.g.,
http://www.cis.doshisha.ac.jp/kkitao/Japan/shim
oda/s1.htm
Provides statistics:
Number of
Number of
Number of
Number of
Number of
characters
words
sentences
syllables/word
words/sentence
Calculates readability indexes,
including
Gunning Fog Index
Coleman-Liau Index
Flesch Kinkaid Grade Level
Flesch Reading Ease
Lists sentences that might be rewritten
to improve readability.
4. E-texts
In some cases, teachers or students may
want to develop their own corpora. There
are large numbers of e-text available.
Project Gutenberg
http://www.gutenberg.org/wiki/Main_Page
Large collection of downloadable fiction and nonfiction
Internet Public Library: Online Texts
Drew’s Script-o-Rama
http://www.ipl.org/div/subject/browse/hum60.60.00/
A large number of online texts on a wide variety of
subjects
http://www.script-o-rama.com/oldindex.shtml
A website with a large number of scripts of movies
and TV programs
American Rhetoric Online Speech Bank
http://www.americanrhetoric.com/speechbank.htm
A website with a large collection of speeches
5. Information about using corpus
linguistics for language teaching
Corpus-related websites specifically for
language teachers
Learner corpora and SLA Research
http://leo.meikai.ac.jp/%7Etono/
Links to learner corpora made up of language
produced by speakers of various languages,
links to useful tools, a bibliobraphy, and so on
Corpus linguistics: What it is and how
it can be applied to teaching
http://iteslj.org/Articles/Krieger-Corpus.html
An article about corpus linguistics and how it
can be used in the language classroom
Classroom Application
Two types of uses of corpus-related
resources
“Low contact” uses – teacher uses resources to help in
teaching, e.g., to find the difficult words in a reading
passage; students do not actually see the corpus
“High contact” uses – students use the corpora themselves
to learn about language, e.g., to find out which adjectives
collocate with “rain”
“Data-driven learning” is a high contact
use of corpus-related resources.
Using corpora to deduce rules of grammar or
usage, e.g., to determine if a word’s
connotation is positive or negative
Advantages of data-driven learning
Focus on authentic language
Encouragement of students to deduce
Real, exploratory activities rather than drills
A learner-centered activity
Web sites with suggestions for datadriven learning activities
How to use concordances in teaching
English: Some suggestions
http://www.nsknet.or.jp/%7Epeterrs/concordancing/usingconcs.html
Data-Driven Learning (DDL): the idea
http://www.ecml.at/projects/voll/rationale_and
_help/booklets/resources/menu_booklet_ddl.ht
m
An explanation of DDL, with examples
Activities
Use a corpus to check grammar
http://www.lextutor.ca/grammar_tester/
Use the concordancer in the bottom frame
to check the grammar of the sample
sentences in the top half
Use a concordancer to make a gap-filler
or a quiz
http://www.lextutor.ca/multi_conc/
http://www.nottingham.ac.uk/~alzsh3/acv
ocab/awlgapmaker.htm
Find examples of a word and group
them according to meaning
Examples
(http://www.lextutor.ca/concordancers/con
cord_e.html)
party
run
Use the results of a KWIC search to
determine how synonyms are used
differently
Examples
http://www.lextutor.ca/concordancers/con
cord_e.html
travel, journey, trip, voyage, tour
confident, fearless, pushy, upbeat, self-reliant
Use the academic word list web page
and enter a text and make a gap-filling
activity
http://www.nottingham.ac.uk/~alzsh3/acv
ocab/awlgapmaker.htm
Resources for Corpus Linguistics
http://www.cis.doshisha.ac.jp/kkita
o/library/resource/corpus/corpus.
htm
Thank you