Corpus Linguistics
Download
Report
Transcript Corpus Linguistics
Corpus Linguistics
Lecture 1
Albert Gatt
Contact details
My email: [email protected]
Drop me a line with queries etc, and
to arrange meetings.
Course web page
Course web page:
http://staff.um.edu.mt/albert.gatt/home/teachin
g/corpusLing.html
Details of tutorials, lectures etc will
always be on the web page.
Readings for the lecture
Downloadable lecture notes (available after
the lecture)
Suggested text
T. McEnery and A. Wilson. (2001).
Corpus Linguistics. Edinburgh
University Press
NB: Over the course of these
lectures, other readings will also be
proposed and made available, usually
online.
Lectures and assessment
Structure of lectures:
all lectures will take place in the lab
usually, about half the lecture (1hr) will
be devoted to practical work
Course assessment: assignment
Final essay (ca. 1500-2000 words)
Essay topics will involve research on
corpora!
Questions…
?
What is corpus linguistics?
A new theory of language?
No. In principle, any theory of language is compatible
with corpus-based research.
A separate branch of linguistics (in addition to syntax,
semantics…)?
No. Most aspects of language can be studied using a
corpus (in principle).
A methodology to study language in all its aspects?
Yes! The most important principle is that aspects of
language are studied empirically by analysing
natural data using a corpus.
A corpus is an electronic, machine-readable
collection of texts that represent “real life”
language use.
Goals of this lecture
To define the terms:
corpus linguistics
corpus
To give an overview of the history of corpus
linguistics
To contrast the corpus-based approach to
other methodologies used in the study of
language
An initial example
Suppose you’re a linguist interested in the
syntax of verb phrases.
Some verbs are transitive, some intransitive
I ate the meat pie (transitive)
I swam (intransitive)
What about:
quiver
quake
Most traditional grammars characterise
these as intransitive
Are these really intransitive?
One possible methodology…
The standard method relies on the linguist’s
intuition:
I never use quiver/quake with a direct object.
I am a native speaker of this language.
All native speakers have a common mental grammar
or competence (Chomsky).
Therefore, my mental grammar is the same as
everyone else’s.
Therefore, my intuition accurately reflects English
speakers’ competence.
Therefore, quiver/quake are intransitive.
NB: The above is a gross simplification! E.g. linguists
often rely on judgements elicited from other native
speakers.
Another possible methodology…
This one relies on data:
I may never use quiver/quake with a
direct object, but…
…other people might
Therefore, I’ll get my hands on a large
sample of written and/or spoken English
and check.
Quiver/quake: the corpus linguist’s
answer
A study by Atkins and Levin (1995) found
that quiver and quake do occur in transitive
constructions:
the insect quivered its wings
it quaked his bowels (with fear)
Used a corpus of 50 million words to find
examples of the verbs.
With sufficient data, you can find examples
that your own intuition won’t give you…
Example II: lexical semantics
Quasi-synonymous lexical items
exhibit subtle differences in context.
strong
powerful
A fine-grained theory of lexical
semantics would benefit from data
about these contextual cues to
meaning.
Example II continued
Some differences between strong and
powerful (source: British National Corpus):
strong
powerful
wind, feeling, accent, flavour
tool, weapon, punch, engine
The differences are subtle, but examining
their collocates helps.
Some preliminary definitions
The second approach is typical of the
corpus-based methodology:
Corpus: A large, machine-readable
collection of texts.
Often, in addition to the texts themselves,
a corpus is annotated with relevant
linguistic information.
Corpus-based methodology: An approach
to Natural Language analysis that relies
on generalisations made from data.
Example (British National Corpus)
British National Corpus (BNC):
100 million words of English
90% written, 10% spoken
Designed to be representative and
balanced.
Texts from different genres (literature,
news, academic writing…)
Annotated: Every single word is
accompanied by part-of-speech
information.
Example (continued)
A sentence in the BNC:
Explosives found on Hampstead Heath.
<s>
<w NN2>Explosives
<w VVD>found
<w PRP>on
<w NP0>Hampstead
<w NP0>Heath
<PUN>.
Example (continued)
new sentence
<s>
plural noun
<w NN2>Explosives
past tense verb
<w VVD>found
preposition
<w PRP>on
proper noun
<w NP0>Hampstead
proper noun
<w NP0>Heath
punctuation
<PUN>.
Explosives found on Hampstead Heath
Important to note
This is not “raw” text.
Annotation means we can search for particular
patterns.
E.g. for the quiver/quake study: “find all
occurrences of quiver which are verbs, followed
by a determiner and a noun”
The collection is very large
Only in very large collections are we likely to
find rare occurrences.
Corpus search is done by computer. You
can’t trawl through 100 million words
manually!
The practical objections…
But we’re linguists not computer
scientists! Do I have to write
programs?
No, there are literally dozens of available
tools to search in a corpus.
Are all corpora good for all purposes?
No. Some are “general-purpose”, like the
BNC. Others are designed to address
specific issues.
The theoretical objections…
What guarantee do we have that the texts in our
corpus are “good data”, quality texts, written by
people we can trust?
How do I know that what I find isn’t just a small,
exceptional case. E.g. quiver in a transitive
construction could be really a one-off!
Just because there are a few examples of something,
doesn’t mean that all native speakers use a certain
construction!
Do we throw intuition out of the window?
Part 2
A brief history of corpus linguistics
Language and the cognitive
revolution
Before the 1950’s, the linguist’s task was:
to collect data about a language;
to make generalisations from the data (e.g. “In
Maltese, the verb always agrees in number and
gender with the subject NP”)
The basic idea: language is “out there”, the sum total
of things people say and write.
After the 1950’s:
the so-called “cognitive revolution”
language treated as a mental phenomenon
no longer about collecting data, but explaining what
mental capabilities speakers have
The 19th & early 20th Century
Many early studies relied on corpora.
Language acquisition research was based on
collections of child data.
Anthropologists collected samples of unknown
languages.
Comparative linguists used large samples from
different languages.
A lot of work done on frequencies:
frequency of words…
frequency of grammatical patterns…
frequency of different spellings…
All of this was interrupted around 1955.
Chomsky and the cognitive turn
Chomsky (1957) was primarily responsible for the
new, cognitive view of language.
He distinguished (1965):
Descriptive adequacy: describing language, making
generalisations such as “X occurs more often than Y”
Explanatory adequacy: explaining why some things
are found in a language, but not others, by appealing
to speakers’ competence, their mental grammar
He made several criticisms of corpus-based
approaches.
Criticisms of corpora (I)
Competence vs. performance:
To explain language, we need to focus on
competence of an idealised speaker-hearer.
Competence = internalised, tacit knowledge of
language
Performance – the language we speak/write – is
not a good mirror of our knowledge
it depends on situations
it can be degraded
it can be influenced by other cognitive factors
beyond linguistic knowledge
Criticisms of corpora (II)
Early work using corpora assumed that:
the number of sentences of a language is finite (so
we can get to know everything about language if the
sample is large enough)
But actually, it is impossible to count the number of
sentences in a language.
Syntactic rules make the possibilities literally infinite:
the man in the house (NP -> NP + PP)
the man in the house on the beach (PP -> PREP +
NP)
the man in the house on the beach by the lake
…
So what use is a corpus? We’re never going to have
an infinite corpus.
Criticisms of corpora (III)
A corpus is always skewed, i.e. biased in
favour of certain things.
Certain obvious things are simply never said.
E.g. We probably won’t find a dog is a dog in our
corpus.
A corpus is always partial: We will only find
things in a corpus if they are frequent
enough.
A corpus is necessarily only a sample.
Rare things are likely to be omitted from a
sample.
Criticisms of corpora (IV)
Why use a corpus if we already know things
by introspection?
How can a corpus tell us what is
ungrammatical?
Corpora won’t contain “disallowed” structures,
because these are by definition not part of the
language.
So a corpus contains exclusively positive
evidence: you only get the “allowed” things
But if X is not in the corpus, this doesn’t mean
it’s not allowed.
It might just be rare, and your corpus isn’t big
enough. (Skewness)
Refutations
Corpora can be better than introspectvie
evidence because:
They are public; other people can verify and
replicate your results (the essence of scientific
method).
Some kinds of data are simply not available to
introspection. E.g. people aren’t good at
estimating the frequency of words or structures.
Skewness can itself be informative: If X occurs
more frequently than Y in a corpus, that in itself
is an interesting fact.
Refutations (II)
By the way, nobody’s saying “throw
introspection out the window”…
There is no reason not to combine the corpusbased and the introspection-based method.
Many other objections can be overcome by
using large enough corpora.
Pre-1950, most corpus work was done manually,
so it was error prone.
Machine-readable corpora means we have a
great new tool to analyse language very
efficiently!
Corpora in the late 20th Century
Corpus linguistics enjoyed a revival
with the advent of the digital personal
computer.
Kucera and Francis: the Brown Corpus,
one of the first
Svartvik: the London-Lund Corpus,
which built on Brown
These were rapidly followed by
others… Today, corpora are firmly
back on the linguistic landscape.
Summary
Introduced the notion of corpus and
corpus-based research
Gave a quick overview of the history
of this methodology
Looked at some possible objections to
corpus-based methods, and some
possible counter-arguments
Next lecture
We look more closely at some
important properties of a corpus:
Machine-readability
Balance
Representativeness
…