Using corpus-based methods with young learners

Download Report

Transcript Using corpus-based methods with young learners

An investigation into
Corpus-based learning about language in
the primary-school:
CLLIP
Corpus evidence of the
features of children’s literature
The CLLIP Project: Background

CLLIP:



Corpus-based Learning about
Language In the Primary-school
ESRC-funded project
Exploring potential for using corpus
evidence with primary school children
(9-11 year olds) for learning about
language (L1)
Linguistic analysis of CLLIP corpus



CLLIP corpus is a collection of the
texts in the British National Corpus
that were written for a child
audience
The corpus contains imaginative
fiction, factual prose and other texts
Linguistic analysis was conducted
on the imaginative fiction texts only
Project research question: 1
1.
Does linguistic analysis of the corpus
data confirm, extend or challenge the
descriptions of English lexis and syntax
which are identified as teaching targets in
the National Curriculum and the National
Literacy Strategy?
1a. Does any such analysis suggest a need
for further research on the basis of a
larger dedicated corpus of writing for
children?
Corpora: CLLIP and comparison

CLLIP corpus: imaginative fiction written for
child audience, from the BNC


Comparison corpus (hereafter ‘Comp’):
imaginative fiction written for an adult
audience, from the BNC


31 texts
315 texts
Newspaper texts from the BNC

114 texts
Purpose of the linguistic analysis


To determine the characteristic
features of the language of
imaginative fiction written for
children
To compare and contrast the
language of these texts with the
language of imaginative fiction
written for adults, and also the
language of newspapers
Questions



What is distinctive about the discourse of the
CLLIP corpus?
What similarities and differences are there in
the overall word frequencies and of
POSgrams in the three corpora?
Is there a difference in the uses of certain
lexical items between the child and adult
fiction corpora?
A POSgram is a sequence of parts of speech, such as
an article followed by an adjective followed by another
adjective then a noun (eg a bright red car; the last
chocolate biscuit). In this study, we look at 6-grams
(sequences of six parts of speech)
For each part of speech you can see 3 columns. The first two
columns (left and middle) are for the CLLIP and Comp corpora
respectively. What is remarkable is the similarity between the
two for most
parts of speech. There are many more nouns
Comparison of POS categories for 3 corpora (expressed in percentages)
proportionally in the Newspaper corpus, while there are more
lexical verbs in the fiction corpora.
Frequency of Parts of Speech
25.00
20.00
15.00
10.00
5.00
0.00
Article
Adjectiv
Adverb
e
Conjun Posses Determi
ction
sive
ner
Noun
Proper Preposi Pronou Infinitiv
noun
tion
n
e to
Verb
'be'
Verb
'do'
Verb
'have'
Modal
verb
Lexical
verb
CLLIP
7.53
5.29
7.95
5.62
2.29
2.63
15.29
4.53
6.95
4.53
1.59
4.23
1.03
1.56
1.74
14.51
Comparison
7.82
5.91
7.68
5.72
2.69
2.74
16.60
3.89
7.51
3.89
1.68
4.23
0.88
1.87
1.65
13.74
Newspaper
9.73
7.96
4.45
4.83
1.28
2.28
23.15
7.57
8.92
3.49
1.81
3.64
0.27
1.32
1.35
10.44
CLLIP – 22.0%; Comp – 22.4%; News – 23.5%
Frequency data
The top ten most frequent tokens for the CLLIP and Comp
corpora are remarkably similar, particularly the top 4. Note
the greater frequency of ‘of’ in the News corpus, which is
related to the higher number of nouns – in expressions such
as ‘the resignation of’. The figures at the top show the
percentage of the overall frequency that the top ten account
for in each corpus
CLLIP – 14.6%; Comp – 11.3%; News – 11.9%
Frequency - adjectives
Once again, a remarkable similarity
exists between the top 11
adjectives for the fiction corpora,
while the Newspaper corpus
contains many adjectives that refer
to social attributes. The figures at
the top indicate that the top 11
adjectives in the CLLIP corpus do a
larger amount of ‘work’ than those
for the other two corpora
CLLIP – 8.3%; Comp – 7.8%; News – 6.7%
Frequency - nouns
This table shows the most frequent 6-POS grams for each
corpus. For each corpus, the sequence preposition +
article + noun + of + article + noun is most common,
followed by preposition + article + noun + preposition
[not ‘of’] + article + noun in the two fiction corpora
POSgram information
Prep+art+[
51%
This slide shows the nouns that
most frequently fill the third slot in
the preposition + article + noun +
of + article + noun sequence. This
shows that the sequence most
commonly indicates spatial or
temporal relations in the fiction
corpora while in the newspaper
corpus it can also express causal
relations. The top six nouns in the
CLLIP corpus account for 51% of
the 6 POS grams of this sequence.
]+of+art+noun
Body parts: NECK
Do nouns in the CLLIP corpus more
typically refer to physical entities in
the world than the equivalent noun in
the Comp corpus? The two righthand
columns show the percentage of uses
of the word ‘neck’ that are used to
refer to part of a piece of clothing, or
used in an idiomatic sense. The adult
corpus contains only a marginally
higher percentage of idiomatic uses.
Neck

CLLIP:




‘stick your neck
out’
Little physical
contact
Intimacy with
animals
Neck as site of
pain

Comp:




‘breathing down
your neck’
Lots of physical
contact
Intimacy between
humans
Neck as site of
desire,
tenderness, place
for ornamentation
Finger

CLLIP




Figurative – 13%
Jab, prod, lay, run,
put
Accusing,
admonishing
Used for drawing,
for indicating the
need for silence
and for pulling
triggers

Comp




Figurative – 19%
Put, raise, point,
run, jab, wag
Furtive, tentative,
negligent
Used for
communicating, for
feeling [contours &
textures], for
wearing rings
in time – CLLIP
We looked at uses of ‘in time’ in the CLLIP
corpus. The dominant meaning is immediate,
and characters are concerned to accomplish
something before the expiry of an implied
deadline, externally imposed. A childly
perspective seems often to imply staying on
the right side of trouble or sanction.
‘In time’ in the Comp corpus is used in several senses.
i: ‘in the fullness of time’, time on a large scale, which the
speaker can perceive from a distance
ii: ‘within an appropriate period of time’
iii: others, as in the last line, where ‘in’ and ‘time’ have
more separate meanings than is usual in the phrase
in time – Comp