Show Me the Corpus - Western Oregon University
Download
Report
Transcript Show Me the Corpus - Western Oregon University
Habeas Corpus
in Your Classroom
An InterACTIVE Workshop
Dr. Rob Troyer
Western Oregon University
IALLT
June 11, 2013
• Introduction
Outline
• What’s a Corpus?
• What have we learned from corpus studies?
• What can I do to and with a corpus?
• Google me n-grams
• Learning to play and playing to learn
• COCA me crazy
• Using COCA
• Designing corpus-based lessons
• Activity
• getting on MICASE (and MICUSP)
• Activity
• Frankencorpus
• AntConc and free PoS tagging
• Activity
• Conclusion
What’s a corpus?
• a body of text
– referring to “a corpus” is like referring to “a dictionary”
• SEU (Survey of English Usage), University College London
– began 1955 on note-cards, completed by 1985 on computer
1 million words, 200 texts, 5,000 words each,
written + spoken (mono and dialogue)
Comprehensive Grammar of the English Language (1985)
Svartvik,
Crystal,
Greenbaum,
Leech,
and Quirk.
Types of Corpora
• University produced, for purchase only
• University produced, freely available via www
• Corporate produced for their own use only
• The www is a corpus (& Google n-grams)
• Self-compiled corpora
What can I do to a Corpus?
• annotation (auto, manual, or both)
– metadata
– textual markup
– linguistic annotation
• POS, parsing, semantic
• annotation software (free online)
• CLAWS
• Stanford Parser
– (for purchase) WordSmith, WMATRIX, etc.
What can I do with a Corpus?
• Descriptive statistics (software/interface)
– frequency
– type-token ratio
– keywords
– collocations
– (factor and cluster analysis)
• Concordancing (software/interface)
– KWIC lines
What have we learned from Corpus
Studies (a few examples)?
• emergent modals/grammaticalization
– going to, need to, had better (will, must, should)
Leech et. all from 1971 to 2004 continually rev.
• changes to the perfect aspect construction
– ‘be’ as auxiliary with certain verbs where today we
have only ‘have’
What have we learned from Corpus
Studies?
• preferences for grammatical structures
– participle vs. infinitive
• ELL teaching
– recall Doug Biber’s Tri-TESOL presentation
• progressive aspect in conversation
Google me n-grams
Google n-gram viewer
http://books.google.com/ngrams
• Highlights
– Google Books, 20 million books as of Oct 2012
• Includes English, Spanish, French, German, Russian,
Italian, Chinese, and Hebrew
• In 2010, Google estimated 130 million books have been
printed since Gutenberg (all languages); thus, GB=14%
• English portion is 500 billion words (500,000,000,000)
– n-gram viewer searches subset of 8 million books
Google me n-grams
Perform the following searches
•
•
•
•
•
flapper, hippie, yuppie
vodka,whiskey,gin,rum
Plato,Aristotle
cocaine,heroin,amphetamine,LSD,marijuana
werewolf,zombie
– add vampire
– add pedant
• the whole hog,cold turkey
Google me n-grams
Perform the following searches
The latest release (Oct 2012) added some options
• telephone_VERB, phone_VERB
• call_VERB, call_NOUN
• contact_VERB,impact_VERB,access_VERB
• telephone,radio,television
– add Internet
– add computer
– change the last two to (Internet+computer)
Google me n-grams
Activity
• You have 5 minutes to come up with your coolest Google
n-gram search and graph.
I will circulate to pick a winner.
Concordance: a list of every use of a certain word
in a corpus
KWIC lines: Key Word In Context
Activity
• Analyze the context of “is arrived” on the handout.
• Note the date of publication and the genre
– (Magazine, Non-Fiction, Newspaper, Fiction)
• FYI: COHA = Corpus of Historical English
• At what point in history does the pattern change?
• What happened?
– hint: present tense passive voice of transitive ‘arrive’ vs.
present perfect of intransitive ‘arrive’
• Where did I get this data?
COCA me crazy
http://corpus.byu.edu
The BYU Suite of Corpora and Tools
• Highlights
– Corpora
• Corpus of Contemporary American English (COCA) 450 mil
• Corpus of Historical English (COHA) 400 mil
• TIME magazine Corpus of American English (Time) 100 mil
• Corpus of American Soap Operas (SOAP) 100 mil
• BYU-BNC: British National Corpus (BNC) 100 mil
• 5 of the Google Book Corpora 34 – 155 bil
– Tools
• online concordance interface
• WordAndPhrase.info
COCA me crazy
The BYU Suite of Corpora and Tools http://corpus.byu.edu
• Go to the site (hint: just Google “coca”)
• register
• use the “start” navigator to go to COHA
• type
[np*] is arrived
in the search box
• make a note of the number of occurrences in different years
• type
[np*] has arrived
in the search box
• What can we conclude about changes to auxiliary use with
the present perfect of some transitive verbs (arrive)?
COCA me crazy
Vocabulary instruction 1
• Goal: teach students to correctly use evident/evidence
• use the “start” navigator to go to Word and Phrase
• type
evident
in the search box
• examine the information
• Go to COCA and type
evident [i*]
• choose the “Academic” corpus and click “search”
• scan down the lines and look for patterns
• click on “evident” in a KWIK line
• let’s make a class handout
Vocabulary instruction 2
COCA me crazy
• select 10 KWIC lines for “evident” and copy
• open MS Word and change orientation and margins
• paste the lines, select and delete unnecessary columns
• change font to courier, 8 pt
• delete characters so “evident” is in the middle
• rearrange lines so that patterns are more obvious
• delete difficult lines and/or redundant patterns
• follow the same steps for “evidence”
• delete the key words to make a gap-fill
Vocabulary instruction 3
COCA me crazy
• lesson design
• on board, write a frame for each word from your examples
• ask students to fill in words that could fit
• teach the new items: evident/evidence
• give students the handout to fill the gaps
• have them identify patterns of use
• engage in authentic writing in which the new words can be
used
Vocabulary instruction 4
COCA me crazy
• many potential variations
• multiple words
• single word—multiple patterns
• transition words instead of content words
• different register (spoken, fiction, news, academic)
• comparison of use in different registers
• teach students to search for a word in passive vocabulary—
examples will help them use it confidently
• try WordAndPhrase.Info
http://www.wordandphrase.info
COCA me crazy
Teaching Reporting Clauses and Phrases
• Example summaries from last year (pre-corpus-based lesson)
• Example summaries from this term (post-corpus-based lesson)
• Activity: do the student handout
– Answers and “What did you learn?”
COCA me crazy
Teaching Reporting Clauses and Phrases
• How did I make this handout using COCA?
– identified target phrases, searched academic register,
– clicked selected lines for context, copied entire sentence
– pasted to Word, arranged so that target moves away from
sentence initial position
– identified target verbs, searched with [np1] before verbs,
selected for context, copied, pasted
– Selected additional lines for page 2 gap-fill and arranged
– wrote questions that emphasize form
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Highlights
– Corpora
• Michigan Corpus of Academic Spoken English (MICASE)
– 152 transcripts; 1,848,364 words
• Michigan Corpus of Upper-level Student Papers (MICUSP)
– 830 “A-grade” papers; 2,600,000 words
– variety of disciplines
– Tools
• online concordance interfaces
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Goal: raise awareness of what’s in lectures
– Ss are probably familiar with what’s + NP, Verb, or Adj for a ?
• so with that in mind what's the next word after, avaritiae?
• now what's physically going on here?
• but what's wrong? why did… isn't this useful?
– But less aware of it’s frequent use in complement clauses and
wh-clefts
• there is never any sense of what's going on behind
• but what's usually happening is that some victim, of raw
political oppression, is unjustly imprisoned.
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Go to MICASE
– In the “Transcript Attributes” menu, select “Lecture-small”
– In the search box type what’s and click “submit”
– Sort results by “2 right”
– Which frequent followers typically lead to questions?
» a, an, the, this, wrong
– Which frequent followers typically lead to statements?
» called, going on, gonna, important
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Make an awareness raising handout
– what’s in questions
» copy 2 lines each of what’s + a, an, the, this, wrong
» paste in a pre-formatted Word doc
» edit lines to bring key word to the center
– what’s in statements
» copy 4 lines of what’s + called, important, going on, and
gonna
» paste and edit
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Make an awareness raising handout
– Students analyze first set of 10
» What is the function of what’s?
– Students analyze what’s + called
» can you remove “what’s called”?
» what is the purpose of adding “what’s called?”
– Students analyze what’s + going on
» read what’s before the key words—what do most lines
have in common?
Getting on
MICASE
http://micase.elicorpora.info
and MICUSPhttp://micusp.elicorpora.info
• Make an awareness raising handout
– For each pattern, allow students to form generalizations
about meanings associated with the pattern.
• Copy and Paste additional lines to make a gap-fill with random
patterns.
• Follow-up with authentic listening/speaking practice that uses
at least some of the what’s patterns.
Frankencorpus: Build your own body
• Research Questions
• prepare texts
• PoS tagging
• Download Concordancer
• Basic Analysis
Frankencorpus:
Build your own body
• Choose a news topic: find at least three articles in
different online news sources
• copy and paste the text of the articles into one notepad
text file
• copy all of the text and tag it (CLAWS, C7 tag set)
• copy tagged text and paste into notepad and save
• download antconc
• open antconc
Frankencorpus:
Build your own body
• antconc global settings: hide tags but allow search
• antconc tool settings: word count and concordancer—
select “treat all text as lowercase”
• load text file
• perform word count
• go to concordancer tab
• search for a keyword or phrase
• search for a part of speech or combination
• sort lines alphabetically by keyword and/or left-right
Conclusions
Corpus tools and corpus-based
materials are not magic.
Playing with the tools in your free time
will help you build skills used for
efficient materials production.
Authenticity is authentic.
Editing requires conscious attention to
form.