Resources for Using Corpus Linguistics in Language

Download Report

Transcript Resources for Using Corpus Linguistics in Language

Resources for Using Corpus
Linguistics in ELT
Kenji Kitao
Doshisha University
Kyoto, Japan
S. Kathleen Kitao
Doshisha Women’s College
Kyoto, Japan

I. Presentation


A. Corpus linguistics and corpus-related
resources
B. Online resources for corpus linguistics



1. Types of resources
2. Examples of resources
C. Using corpus-related resources for
language teaching

II. Application


A. Assigned tasks
B. Free exploration
Presentation

Definitions

Corpus (Latin for “body”)


A text or collection of texts
Now generally used to refer to machinereadable texts

Corpus linguistics

the use of the empirical data from a corpus to
study language usage and to find patterns of
language usage by analyzing actual language
use

Requirements

A corpus


Can be a single text or a large collection of
texts
Larger corpora provide more reliable results, if
the purpose is making generalizations about
language use

Balanced corpora

A variety of genres, including academic writing,
newspapers, fiction, and spoken language

Specialized corpora


Examples
 Academic writing
 Texts by learners of English, sometimes with a
specific native language
Teachers can develop their own corpora
 Newspaper articles
 Learners’ texts

Corpus analysis tool(s)

Types



General


Tools with specific corpora
Tools that can be used with any text or collection of texts
Word, Excel, etc.
Specialized




Count words
Find example of specific words or parts of speech
Analyze word frequencies
Evaluate readability

Online Corpora




Free to all users
Available for a fee or for purchase
Available only to restricted users
In this presentation, we will only
introduce resources that are free.

Using Corpus Linguistics for Language
Teaching




Technology has become widespread and
accessible
Larger, more powerful computers that can analyze
large amounts of data quickly are available
Many corpus-related resources have become
available
Language teachers and learners can use corpora

Corpus-related Internet resources





1. General resources on corpus linguistics
2. Vocabulary frequency lists and
frequency level checkers
3. Online corpora, concordancers and other
text-analysis software
4. E-texts
5. Information about using corpus
linguistics for language teaching
Resources for Corpus Linguistics
http://www.cis.doshisha.ac.jp/kkitao/libra
ry/resource/corpus/corpus.htm

1. General resources on corpus
linguistics

Web sites that help orient users to corpora
and to what is available online for teachers
to use in the classroom or in preparing
material

The Compleat Lexical Tutor

http://www.lextutor.ca/


Resources for data-driven learning, including
concordancers for various corpora and in which one
can enter texts
Tutorials, resources of teachers, resources for
research

Bookmarks for Corpus Linguists

http://devoted.to/corpora/

extensive annotated list of links related to corpus
linguistics, including
 software
 tools
 frequency lists
 papers and articles
 English and non-English corpora

2. Vocabulary frequency lists, frequency level
checkers, and n-gram extractors

Frequency lists


Words used most frequently in English and thus words
that are most useful for students to know
Often divided into sublists

Specialized word lists

Academic Word List



http://www.nottingham.ac.uk/~alzsh3/acvocab/inde
x.htm
List includes 570 headwords with their word families
Site includes an explanation of the word lists, the
words in each sublist, suggestions for using the list,
and a gapmaker that can be used to produce gapfilling exercises

5000 Vocabulary List for Visiting Scholars
in the USA


http://www.paulnoll.com/Books/5000Words/index.html
This is a list of the 5000 Words determined by
the Chinese Academy of Sciences for scholars
that need to go abroad for research or
advanced studies in the USA. They are listed in
alphabetical order and have sample sentences
and examples. There is an additional three
thousand words.

Frequency-level checkers



Produces a list of words at each level of
difficulty
Helps a teacher understand how difficult the
vocabulary in the reading passage is and which
words students at different levels of proficiency
might need to learn
N-gram finders

Finds groups of n-words

JACET 8000 Word List


http://www01.tcp-ip.or.jp/~shin/j8web/j8web.cgi
On this web page, you can enter a text and get a list
of the words that appear in the text at each of the
eight levels of the JACET list. You also get statistics
about what percentage of the words (both types and
tokens) occur at each of the eight levels.

N-gram finders

Online text analysis tool


http://www.online-utility.org/text/analyzer.jsp
Finds most frequent groups of 2 and 3 words,
plus produces a list of all the words, their
occurances, and their percentage

Advanced Search – Explore N-grams from
the BNC



http://pie.usna.edu/explore.html
Produces lists of n-grams, based on the
number of words and occurances you specify
N-gram phrase extractor


http://www.er.uqam.ca/nobel/r21270/cgibin/tuples/u_extract.html
Produces KWIC list of n-grams

3. Online corpora, concordancers, and other
text-analysis software

Concordancers




A type of software for searching corpora
Produces a list of key words in context (KWIC), that is,
search terms with the words that come before and after
them.
May be able to search for parts of speech, e.g., take,
followed by a preposition
May be able to search for two words that are not next to
each other


Corpora (or parts of corpora) may have spoken
language, written language, American English,
British English, academic English, and so on.
Specialized corpora include:


parallel corpora, which have same texts in different
languages (to compare same passages in different
languages)
learner corpora, which have students’ writing/
speaking (to help identify learners’ problems or to
study characteristics of their writing)


Examples of concordancers
Turbo Lingo


http://www.staff.amu.edu.pl/~sipkadan/lin
go.htm
Can enter a text or URL and get a list of
KWIC, average sentence length, word
frequency list, and other analyses

VIEW (Variation in English Words and
Phrases)


http://view.byu.edu/
Concordancing tool for the British National
Corpus, the Corpus of Contemporary
American English, and a Time magazine
corpus, plus non-English corpora


A powerful concordancing tool
Has a useful tutorial

Click on what you want to do to see samples of
searches

For example, if you want to learn to use wildcards,
click on that word, and you will see several examples.
You choose the type of search you want to do, and
the search is automatically filled in. You can revise it
based on what you want to do.

Types of searches

Search by exact word, exact phrase,
wildcard, or part of speech


Use ? or * as a wildcard


For example, mysterious
For example, * point *
Search for an exact word plus a part of
speech

For example, white [n*]

Compare usage of semantically related
words


Search for surrounding words


{sheer/total} [n*]
Nouns that follow the verb “wrap”
Limit the search to one register

Adjectives in tabloid newspapers

Compare usage between registers, e.g.,
news and speaking


we [verb] that: ACAD vs SPOKEN
Find words with similar, more general,
or more specific meanings



Similar words to “small”
More general than “shriek”
More specific than “woman”

BNCweb

To log in, go to:


http://bncweb.lancs.ac.uk/bncwebSignup/
For information, go to:

http://bncweb.info


On BNCweb, you can do simple searches,
you can restrict your search to written or
spoken texts or based on the type of text.
Form your own subcorpora.

Make frequency lists based on criteria you
specify


For example, make a frequency list of all
adverbs that end in –ly in spoken texts.
Look at your query history and save
queries to use again.



See your results in a sentence view or a
KWIC view.
Get a list of collocates, with statistics about
their frequency.
Get information about what type of texts
the search term was found in.

Online concordancer



http://www.lextutor.ca/concordancers/con
cord_e.html
Can search a variety of corpora, including
the Brown Corpus, the British National
Corpus (written and spoken), a learner
corpus, etc.
Produces a KWIC list for a given word and
a list of collocates and their frequency

WebCorp


http://www.webcorp.org.uk/
Uses the Internet as a corpus and
produces KWIC as well as providing other
information

Comparing two texts

Text Lex Compare



http://www.lextutor.ca/text_lex_compare/
Allows users to enter two texts and get lists of:
 Unique words to first text
 Shared words in two texts
 Unique words in second text
Useful to help teacher find new words in new text

Specialized corpora (a few examples)

Spoken English


Corpus swb (American English telephone
conversations)
 http://www.ldc.upenn.edu/cgibin/lol/swb/speechcorpus?&corpus=swb
Technical English

e-Xplore Technical English
 https://learn.sz.htwk-leipzig.de/wc/main.php

Parallel corpora


CRATER Multilingual Aligned Annotated Corpus
 http://www.comp.lancs.ac.uk/linguistics/crater/c
orpus.html
Academic English

Michigan Corpus of American Spoken English
 http://quod.lib.umich.edu/m/micase/
 Some large corpora also have sub-corpora of
academic English

Online software to assess readability

Tests of document readability and
suggestions how to improve readability


http://www.onlineutility.org/english/readability_test_and_im
prove.jsp
Can calculate texts of any length (some
online text analysis programs have limits)

Can enter the text directly or enter a URL


e.g.,
http://www.cis.doshisha.ac.jp/kkitao/Japan/shim
oda/s1.htm
Provides statistics:





Number of
Number of
Number of
Number of
Number of
characters
words
sentences
syllables/word
words/sentence

Calculates readability indexes,
including





Gunning Fog Index
Coleman-Liau Index
Flesch Kinkaid Grade Level
Flesch Reading Ease
Lists sentences that might be rewritten
to improve readability.

4. E-texts

In some cases, teachers or students may
want to develop their own corpora. There
are large numbers of e-text available.

Project Gutenberg


http://www.gutenberg.org/wiki/Main_Page
Large collection of downloadable fiction and nonfiction

Internet Public Library: Online Texts



Drew’s Script-o-Rama



http://www.ipl.org/div/subject/browse/hum60.60.00/
A large number of online texts on a wide variety of
subjects
http://www.script-o-rama.com/oldindex.shtml
A website with a large number of scripts of movies
and TV programs
American Rhetoric Online Speech Bank


http://www.americanrhetoric.com/speechbank.htm
A website with a large collection of speeches

5. Information about using corpus
linguistics for language teaching


Corpus-related websites specifically for
language teachers
Learner corpora and SLA Research


http://leo.meikai.ac.jp/%7Etono/
Links to learner corpora made up of language
produced by speakers of various languages,
links to useful tools, a bibliobraphy, and so on

Corpus linguistics: What it is and how
it can be applied to teaching


http://iteslj.org/Articles/Krieger-Corpus.html
An article about corpus linguistics and how it
can be used in the language classroom
Classroom Application

Two types of uses of corpus-related
resources


“Low contact” uses – teacher uses resources to help in
teaching, e.g., to find the difficult words in a reading
passage; students do not actually see the corpus
“High contact” uses – students use the corpora themselves
to learn about language, e.g., to find out which adjectives
collocate with “rain”

“Data-driven learning” is a high contact
use of corpus-related resources.


Using corpora to deduce rules of grammar or
usage, e.g., to determine if a word’s
connotation is positive or negative
Advantages of data-driven learning




Focus on authentic language
Encouragement of students to deduce
Real, exploratory activities rather than drills
A learner-centered activity

Web sites with suggestions for datadriven learning activities

How to use concordances in teaching
English: Some suggestions

http://www.nsknet.or.jp/%7Epeterrs/concordancing/usingconcs.html

Data-Driven Learning (DDL): the idea


http://www.ecml.at/projects/voll/rationale_and
_help/booklets/resources/menu_booklet_ddl.ht
m
An explanation of DDL, with examples
Activities

Use a corpus to check grammar


http://www.lextutor.ca/grammar_tester/
Use the concordancer in the bottom frame
to check the grammar of the sample
sentences in the top half

Use a concordancer to make a gap-filler
or a quiz


http://www.lextutor.ca/multi_conc/
http://www.nottingham.ac.uk/~alzsh3/acv
ocab/awlgapmaker.htm

Find examples of a word and group
them according to meaning

Examples
(http://www.lextutor.ca/concordancers/con
cord_e.html)


party
run

Use the results of a KWIC search to
determine how synonyms are used
differently

Examples
http://www.lextutor.ca/concordancers/con
cord_e.html


travel, journey, trip, voyage, tour
confident, fearless, pushy, upbeat, self-reliant

Use the academic word list web page
and enter a text and make a gap-filling
activity

http://www.nottingham.ac.uk/~alzsh3/acv
ocab/awlgapmaker.htm
Resources for Corpus Linguistics
http://www.cis.doshisha.ac.jp/kkita
o/library/resource/corpus/corpus.
htm
Thank you