Transcript Corpus 3

Corpus 3
Corpus-based Description
Aspects of corpus-based
studies
• lexis, morphology, syntax and discourse.
• fig. 3.1 A classification of corpus-based
research on English
lexical description
• The most obvious use of corpora for lexical
description is in lexicography.
• Not only to identify the set of different words and
show when new types enter the language, but to
identify the various senses or uses of particular
types and their relative frequencies.
• e.g. London-Lund Corpus: polysemous word good
• Table 3.1
• Identify neologisms
Pre-Electronic Lexical Description
for Pedagogical Purposes
• Thondike (1921): word frequency on the
basis of 4.5 million word corpus of literary
works and books read by younger children.
• The principle of vocabulary control in the
design and editing of reading materials
owes much to Thorndike's pioneering work.
• Michael West: General Service List of
English Words (1953)
Pre-Electronic Lexical Description
for Pedagogical Purposes
• Description of the most frequent 2,000 words n
the written English of the time, supplemented by
information on the frequency of the meanings or
uses of these words, based on the work of Lorge.
• Fig. 3.2
• Thorndike-Lorge corpus was biased towards more
literary and formal styles of writing, and did not
include speech at all.
Computer-based studies of
lexicon
• With a computerized corpus and appropriate
software, both significant and more trivial
but interesting facts about the lexicon of a
language can be uncovered.
• Table 3.2 The rank ordering of the 50 most
frequent words in various corpora shows
remarkable consistency and systematic
differences.
Computer-based studies of
lexicon
Consistence: all the words except said are
function words.
Word Occurrence
40% of the words in a corpus of over
five million words occur only once
show that a corpus of even that
size is not a sound basis for
lexicographical studies of low
frequency words.
Word Occurrence
Sharman found that there was an almost
linear relationship between vocabulary
size and corpus size. A new word
appeared in the text approximately every
30 words on average.
The more narrowly focused the corpus, the
more content words find their way into
the higher frequency levels.
Word Classes
• Table 3.5 (written English): Relative
proportions of major word classes in the
Brown and LOB corpora
• As shown in Table 3.6 (spoken English),
fewer nouns and a considerable proportion
of discourse items characteristic of spoken
English are noteworthy.
Word Classes
• Table 3.7 shows that some sequences such as
adjective + noun or noun + noun are very frequent
indeed.
• Johansson and Hofland: occurrence of the 40 most
frequent sequences of word-class tags at the
beginnings and ends of sentences. Findings: the
ends of sentences may be more predictable
grammatically than the beginnings.
Register studies
• Table 3.8
• There are certain characteristics of the vocabulary
of scientific English. Certain relational words are
disproportionately more frequent in scientific
English. Comparative adjectives and adverbs are
similarly disproportionately frequent, whereas
locative adverbs of space or time are
disproportionately less frequent in scientific t4xts
than in general written American English.
• Items witch occur in one variety but are highly
unlikely to occur in the other.
Semantic information
• Longman Dictionary of Contemporary
English
• noun entries 23,800
• 67% one sense 15946
• 20% two sense 4760
• 6.5% three senses 1547
• 2.5% four senses 595
Semantic information
•
•
•
•
•
Verb 7921
55% one sense 4357
23.8% 2 senses 1885
10% three senses 792
4.4% four senses 348
Collocation
• Some words can have a tendency to occur
in the company of other words in certain
contexts, e.g. Pouring rain, statistically
significant, intrinsic value, strong tendency
• Lexicalized unit: set phrase, idiomatic usage,
cliché
Collocation
• Interest in recurring word combinations:
• Wong-Fillmore (1976): The strategy of
acquiring formulaic speech is central to the
learning of language.
Collocation
• Peters (1980): unanalyzed sequences of words had
a significant role among the units of language
acquisition and proposed ways for identifying
such unanalyzed sequences.
• Nattinger and De Carrico (1992): since first
language learners can be seen to use varying,
apparently unanalyzed, prefabricated chunks of
speech, then second language teaching might
similarly be concentrated around the establishment
of what they call lexical phrases.
Collocation
• Different characteristics of the sequence:
• Allow for no alteration: it's as easy as
falling off a log.
• Allow certain changes (at the moment/at
certain moments)
• Relatively free within a framework (too...to,
n ... Of)
Collocation
• Problems in the definition of collocation:
• How often does a combination have to recur to be
habitual?
• Who decides what sounds natural?
• Does a combination have t be well-formed or
canonical to be a collocation?
• Do collocations have o be syntactic or are they
primarily semantic?
• Do collocations have to consist of adjac4en words
or can they be discontinuous?
Collocation
• can a sequence which occurs only once in a
particular corpus but which is intuitively
recognized by native speakers as a sequence they
have heard before be listed as a collocation?
• How big does a corpus have to be in order to
establish that a collocation does exist?
• Are there degrees of collocationality based on the
flexibility of the bonding between words?
Collocation
• Can we lemmatize collocations so that
similar or inflectionally related sequences
are coned as a single collocation type?
• Are degrees of colocationality able to be
established on the basis of the number of
tokens of a type in a particular corpus
Collocation
• Sinclair(1991) suggested that a span of up
to four words each side of a word is the
environment in which collocation is most
likely to occur although, of course,
computer software makes it possible to
explore much larger spans, including size of
a whole text.
Tense and aspect of verbs
• Table 3.14 Rank order of the most frequent
simple and complex finite verb forms
• Table 3.15 Relative frequencies of use of
finite verb forms
• Table 3.16 Perfect and progressive verb
forms in the Brown Corpus
• Table 3.17 Finite and non-finite verb forms
• Table 3.18 past participle
Modals
• Tale 3.19 frequency of nine modals
• Table 3.20 use of models
• Table 3.21 use of modals in verb-phrase
structures
Voice
• Table 3.22 active and passive predications
• Table 3.23 use of passives in different
regisgters
• Table 3.24 verb-phrase structure of
agentive passives
Verb and particle use
• Subjunctive
• Prepositions
• Conjunctions
Grammatical studies
• Corpus-based grammatical studies revealed
considerable genre differences in the use of
syntactic patterns and in sentence length.
• Syntactic constructions are not in free variation.
• Grammatical study is more of a challenge than
lexical study because the tagging and parsing to
facilitate the automatic analysis of texts and the
development of softwares has not been widely
available or user-friendly.
Sentence length
• Sentence length is related to genre. The mean
number of words per sentence in Informative
categories is much greater than imaginative prose.
• There is much closer consistency in the number of
predications per sentence regardless of genre.
• Table 3.4.1 Sentence length and predications
Syntactic processes
• Clause patterning
• Table 3.42 Distribution of recurrent verbcomplement patterns
• SVC (adj.) 45%
• SVO 20.9%
Syntactic processes
• About half of the clauses are matrix clauses and
half are embedded. Of the matrix clauses,97.8%
are finite, 1.5% are nonfinite, and 0.7% are
elliptical.
• The vast majority of all informational subject
clauses are extraposed (it is necessary that),
reflecting a principle of end-focus from a
functional sentence perspective or preferences in
sentence organization for processing purposes.
Syntactic processes
• In informative prose the verb which
precedes a finite that clause is more likely
to be a communication verb such as say,
state, whereas in spoken conversation
affective or cooperative verbs such as think,
fee, hope, tend to predominate.
Noun modification
• 98% of postmodifying clauses had one or other of
the simpler clause patterns SVO(37%), SVO
(38%), SVC (38%). Suggesting that embedding
tend to favor less complex sentence patterns.
• 70% of noun phrases function as subjects or
prepositional complements and noun phrases with
postmodifying clauses tend to be disfavoured in
subject functions.
Noun modification
• Postmodification is less frequent in nonfinal
positions of sentences. This is because the
subject or topic is familiar enough not to
need identification or elaboration through
postmodification, or because brief subjects
are easier to process.
Causation
• The marking of causation can be lexicalized
( because, cause), syntactic structure
(because of) or implicature.
• Choice for expressing causation is seldom
free, but is influenced by various semantic,
pragmatic, stylistic, cognitive and textual
variables.
Pragmatics
•
•
•
•
Table 3.5 Distribution of discourse items
Comparisons of spoken and writing English
Table 3.58 pretty
Table 3.59 really just right