pocket lover

Download Report

Transcript pocket lover

CORPUS-INFORMED
TEACHING AND RESEARCH 1
Ken Lau
Warm-Up Discussion
1. Work in pairs. Which of the following groups
does not make a natural partnership in English?
How can you find out the answer?
 situations
arise
 difficulties arise
 problems arise
 suggestions arise
 disputes arise
 questions arise
Arise
Suggest vs Claim
have fallen and the marking was far too lenient. The Tories desperately want to claim their befuddled education policy is working while schools
If there's more than one winner, they EACH get a ring. To claim your Diamond Line prize, ring the Bingo Hotline in your card between 10.30am and
North Wales; Wonderwest World, Ayr, Scotland. IT'S so easy to claim one of these amazing star breaks, exclusively with the Daily Mirror. Just cut
The diver who found her body -- John Farrar -- broke a 23-year silence to claim that Mary Jo Kopechne was kept alive in an air pocket inside the
damages in the High Court yesterday. Magazine Star Kicks admitted they were wrong to claim he had' a reputation for wild drug orgies.' # LIBEL
this: Drugs trade is worth 2.74 billion. Arson, by people wanting to claim their own insurance -- 407 million. VAT fraud -- 38 million. Common
he had cancer and could not face an agonising death. She went on to claim that Elvis wanted to join his beloved mother Gladys, whose death had
on two terrified teenagers. It's ridiculous for owners of these dangerous beasts to claim they are as gentle as lambs. They aren't, and I wonder
details privately but I was left with no alternative.' He is expected to claim Mia is an unfit mother. The lawsuit was a double betrayal for Mia,
the bomb and the bullet. Yesterday, the unyielding men of terror tried to claim the province's 3,001st victim. Today, the Mirror looks back to the
worker John Beach. Now Mr Beach, 34, is taking private action to claim compensation for injury and loss of earnings. # PALS: Gardner and Gazza
Suggest vs Claim
and said No. He has told me:' It is completely absurd to suggest there is anything unprofessional in my friendship with the duchess. I am acting in
have complained to British Steel they have said that there is no medical evidence to suggest emissions from the steel plant will damage health. They won't
that one now.' ICI said yesterday:' We have no evidence to suggest our emissions are causing ill health in the local communities.' A spokesman for
make manufacturers put energy efficiency labels on their products and they are asking ministers to suggest such labels when they meet their European
abnormality used by Easton's section police in this way should not be taken to suggest that there is universal agreement on the abnormality of each
for helping them decide how to vote. Wober (1989a) presents evidence to suggest that electors find mid-term PPBs, especially opposition PPBs, much
more sceptical). Our content analysis of television during the election campaign seems to suggest that television was biased towards the right wing and,
a Family Policy Group, a committee of Cabinet ministers, with a remit to suggest ways for strengthening family life and promoting a sense of individual
of local government finance and a reformed education system. The experience is one to suggest an affirmative response to the question,' Do Parties Make
gives meaning and coherence to them. One object of this essay will be to suggest such a theoretical framework. The framework aims to provide a tool for
achieve an optimal allocation of society's resources. It is not enough simply to suggest justifications for the existence of private ownership. If private
, ostensible neutrality, and rules. One response to this choice might be to suggest that it depends on the type of dispute in question. Formal justice and
are deciding the case after hearing five days of evidence and it is impossible to suggest, in my judgment, that they were wrong in coming to the conclusion
doing. It would not, however, be possible for the third defendant to suggest that the third party was in any way guilty of any illegal conduct. A
What is a corpus?

Simply put, a corpus is a collection of texts in an
electronic database. There are several
characteristics / features of corpora which are
worth thinking:
 Not
all corpora which can be used for linguistic
analysis or research were originally built for those
purposes
 Electronic
corpora can consist of whole texts or
collections of whole texts
What is a corpus?
 Texts
in a corpus are (now) in a computer-readable
format
 Corpora
are often assembled to be representative of
some language or text type; authentic texts are
thus collected
 Corpora
may be compiled for specific purposes,
which in turn affect the design, size, and nature of
the individual corpus. In this case, the texts are NOT
supposed to be collected randomly but they are to
be collected in a principled way.
Intuition vs evidence/corpus-based approach

As L2 speakers we may come across a situation
when we have to decide a more idiomatic
form/usage of a grammatical construction in the
L1. For example, in the past, if we need to
determine whether “suggestions arise” is correct
in the warm up task we might rely on our
intuition. However, with the use of corpora (with
authentic texts), your decision will become
evidence-based and more accurately reflect the
language use.
Key Terms in CL







Representativeness
Mean and Standard Deviation (S.D.)
Raw Frequencies
Norminalising frequencies
Mutual Information
Other measures of collocation
Keyword
Representativeness


A key issue in any statistical analysis is whether a
sample, or subset, of any population, or larger group,
will accurately represent the variables or characteristic
features associated with the population as a whole.
To apply this to linguistics, if we are going to make
claims that a linguistic feature (the variable) is or is not
characteristic of the language as a whole (the
population), then we need to be convinced that its
incidence in the texts that make up our corpus (the
sample) accords with its incidence in the language more
broadly. In short the sample we have needs to be
representative of the population as a whole.
Representativeness



The larger the better/more reliable (if statistical
analyses are the major part of your research, >1M
words are needed)
Try to mirror the range and proportion of texts
produced in everyday life.
The challenge: is it possible to achieve this ideal
goal? (Consider, for example, what kinds of texts are
needed if you want spoken data of daily
conversation? Any foreseeable problems in data
collection?)
Representativeness

Balance

British National Corpus (BNC) is considered a balanced corpus


~ 100 million words; 90% written; 10% spoken
Written texts


Selected using three criteria: domain, time and medium
 Domain: content type (subject field)
 Time: period of the text production
 Medium: types of text publication e.g. books, periodicals, etc.
Spoken texts

Selected using two criteria: demographic and context-governed
 Demographic: informal encounters recorded by 124 volunteer respondents
selected by age group, sex, social class and geographical region
 Context-governed: formal encounters such as meetings, lectures and radio
broadcasts recorded in four broad context categories (Education, business,
institution, leisure)
Mean and Standard Deviation

Mean


Total number of words of a specific feature in question / Total number
of words in the corpus
Standard Deviation (S.D.)


The actual number of the specific feature in any given text might vary
considerably from the mean. Consider for example the number of
hedging devices (e.g. seems, appears, may, could) in the three texts are
70, 120, 200 and so the mean is 130. However, only the second text has
the number of hedging devices closer to the mean. It is therefore useful
to have a measure of how far a variable is likely to deviate from the
mean, i,e, the S.D.
A small S.D. will tell us that on average the variation from the mean is
quite low – although there might of course be a few exceptional
examples that vary quite widely from the mean. In the above example,
the S.D is about 53.5 which shows quite a high degree of variation from
the mean in the individual texts.
Mean and Standard Deviation
per 1,000 words
Mean and Standard Deviation:
Some Observations




We expect around 137.4 nouns to occur per 1,000 words in conversation.
If an individual conversational text displays variation to one S.D. (that is
+/- 15.6 occurrences from the mean), then that is very much expected. If,
however, an individual conversation deviate greatly from this band
frequencies (e.g. by 6 / 7 times the S.D.), then we can be relatively assured
in our claim that they are unlike other texts, in terms of the number of
nouns.
The figures for nouns show that the stylistic range of writing is greater
than that of speech, accounting for the higher degree of variation found in
the number of nouns found in the written registers.
Academic prose has a mean of 2.1 and a S.D. of 2.1 for conditional clauses,
indicating that it would be entirely reasonable to find a stretch of 1,000
words containing no conditional clauses at all.
There are a lot more passives in academic prose, which highlights the
impersonal nature of the texts.
Raw Frequency

The number of words occurring in a corpus.
Raw Frequency
Raw Frequency




Personal nature (with the high occurrences of I)
It’s related to presentation
Related to cognitive activities (think) and
physical activities (make)
Adherence to certain rules/patterns (should)
Normalising Frequencies


They are used when comparing two data sets of
unequal size.
They tell us the number of occurrences that we
can expect, per thousand, or sometimes per
million words
Normalising Frequencies
Rank
Word
Frequency
Normalised frequencies (per 10,000
words)
651
1
The
56,939
2
I
35,998
412
3
To
30,628
350
4
And
24,318
278
5
Of
18,374
210
6
In
17,804
204
7
Presentation 15,074
172
8
A
13,789
158
9
My
12,628
144
10
Is
11,082
127
Mutual Information (MI)


Provides information of how commonly individual
words collocate with others
It is generally accepted that an MI score higher than
3 suggests a strong bond between the search term
and its collocate.
Mutual Information

What can you tell from the MI scores of the collocates of
“reinforced” and “strengthened”. Check the MI scores
following the procedures:
1.
2.
3.
4.
5.
6.

Go to http://corpus.byu.edu/bnc
Select “List”
Type “reinforced” in the Search box
Leave the Collocates blank (with *) [Keep the span of
words 4 on each side]
In the sorting field, choose “relevance”
Click search
Repeat the same steps with the word “strengthened”
Mutual Information: Reinforced
Collocates with “reinforced”
Total
1
2
3
4
5
Fram
All
%
MI
Mutual Information: Reinforced
Collocates with “reinforced”
Total
All
%
MI
1
Fram
7
34
20.59
10.9
2
Glass-fibre
5
25
20.00
10.86
3
Concrete
53
2,585
2.05
7.58
4
Beams
5
625
0.80
6.22
5
Tendencies 5
685
0.76
6.09
Mutual Information: Strengthened
Collocates with “strengthened”
Total All
1
2
3
4
5
Weakened
%
MI
Mutual Information: Strengthened
Collocates with “strengthened”
Total All
%
MI
1
Weakened
14
740
1.89
7.73
2
Greatly
30
3267
0.92
6.69
3
Resolve
13
1714
0.76
6.41
4
Enormously 6
804
0.75
6.39
5
Considerably 18
2857
0.63
6.15
Mutual Information

You may also use the function of “Compare” to
solicit information about collocation. Follow the
steps below and compare the collocates of “Male”
and “Female”
1.
2.
3.
4.
Go to http://corpus.byu.edu/bnc
Select “Compare”
In the search box, input “Male” and “Female”
Leave the Collocates blank (with *) [Keep the span of
words 4 on each side]
5. Click Search
Mutual Information: Male and Female
1
2
3
4
5
6
7
8
9
10
Male
Chauvinism
Gay
Supremacy
Heir
Testosterone
Heterosexual
Breadwinner
Lover
Swindon
Chauvinist
Female
Eagle
Lays
Terminal
Detective
Emancipation
Passenger
Representation
Blonde
CM.
Impersonator
“Feminist vs Chauvinist” over time

Now use the Time Magazine Corpus (1923-2006)
(http://corpus.byu.edu/time/). Search for the terms
“feminist*” and “chauvinist*” what can you say
about these terms in terms the changes in their
frequencies since 1920s?
Keyword


Those expressions that have a significantly higher or
lower frequency of occurrence in a text or set of
texts than we should expect, given the frequency of
occurrence of those expressions in a larger corpus
used as a point of reference.
To determine whether a word is considered a
keyword, the concept of log-likelihood is important.
You do not need to worry about the calculations
behind it; instead simply use the calculator created
by Paul Rayson of the Lancaster University:

http://ucrel.lancs.ac.uk/llwizard.html
Keyword: Mortgage

Now try to see if the term mortgage is overused or
underused in the Hong Kong Financial Services
Corpus compiled by the Hong Kong Polytechnic
University
(reference
corpus:
Newspaper
subcorpora of BNC)
1.
2.
3.
4.
5.
Follow the procedures:
Go to http://rcpce.engl.polyu.edu.hk/HKFSC/
Enter the word “mortgage” in the search box
Note the size of the corpus and then click search
Record the number of instances of “mortgage”
Keyword: Mortgage

Now try to see if the term mortgage is overused or
underused in the Hong Kong Financial Services Corpus
compiled by the Hong Kong Polytechnic University
(reference corpus: Newspaper subcorpora of BNC)
6.
7.
8.
Go to http://corpus.byu.edu/bnc
Select “Chart”
Enter the word “mortgage” in the search box and click
search
9. Record the number of instances of ‘mortgage’ in the
newspaper subcorpora and the size of the subcorpora
10. Enter all the information collected here:
http://ucrel.lancs.ac.uk/llwizard.html
11. Write down the results below
Keyword: Mortgage
O1
Mortgage
%1
O2
%2
LL
Keyword: Mortgage
Mortgage
O1
%1
O2
%2
LL
2,550
0.03
695
0
+2,973.9
Useful Online Corpora

Professional Specific Corpora


British Academic Written English Corpus


http://www.coventry.ac.uk/research/research-directory/art-design/britishacademic-written-english-corpus-bawe/
British Academic Spoken English Corpus


http://rcpce.engl.polyu.edu.hk/
http://www.coventry.ac.uk/research/research-directory/art-design/britishacademic-spoken-english-corpus-base/
Michigan Corpus of Academic Spoken English (MICASE)

http://quod.lib.umich.edu/m/micase/?type=revise