Transcript t-score

ENG 626
CORPUS APPROACHES TO LANGUAGE STUDIES
word-based methodology
Bambang Kaswanti Purwo
[email protected]
» the concordance output
▪ a concordance program arranges all instances of a particular
search item in the center of the page
▪ the search item = the node
▪ the items to the left and to the right of the node = the span
▪ the span can be specified: a span of four or five items to
the left (L) and to the right (R)
▪ four or five items to L and to R – a commonly used range
▪ a span to L “N-1, N-2, etc.” and to R “N+1, N+2, etc.”
collocation (Hunston 2005, 68 ff.)
• the tendency of words to be biased in the way they co-occur
▪ motivated: there is a logical explanation for collocation
e.g. toys – children, more frequently than – women, men
▪ unmotivated: no logical explanation
e.g. strong tea vs. powerful car
• collocation may be observed informally in any instance
of language
• more reliable to measure collocation statistically
 corpus study
▪ a tendency of two words to co-occur
▪ a tendency of one word to attract another
measurements of collocation
• any program which calculates collocation
▪ takes a node word
▪ counts the instances of all words occurring within a
particular span (four words to the L or the R of the node)
(see a 4:4 span of the node word gaze)
▪ punctuation marks are ignored
▪ items marked with‘s are counted as a separate word
▪ sentence boundaries are ignored
▪ some of the words co-occurring with gaze are supposed
to do so by chance (e.g. wait … gaze, life … gaze)
▪ others might be said to be “meaningful”
(e.g. penetrating gaze, my/her/his gaze)
▪ [calculation of large quantities of data]
“chance” collocation – insignificant when compared with
“meaningful” collocation
▪ the lines include both gaze [n] and gaze [v]
▪ the search could be restricted to
▫ restrict the search to [n] or [v]
▫ to choose only the word-form gaze or lemma GAZE
▪ see all instances of gaze [n] from Bank of English
▫ a list of fifteen most frequent collocates (out of 2,864 lines)
▫ the plural is not found in the corpus
▪ the words at the top – all grammatical words
▪ the high frequency: determiners (the, his, her, my, their)
prepositions (from, with, under)
▪ two lexical words in the list:
▫ public in the phrase public gaze
▫ fixed in expressions such as fixed his gaze on, her gaze
was fixed on
▫ a list of raw frequencies:
 how to attach a precise degree of importance to any of
the figure?
 his occurs near the top of the list: is it significant?
» how calculate the significance of each co-occurrence?
• three most common measures of significance are
▪ Mutual Information (MI) score
▪ t-score
▪ z-score
• MI score n t-score depend on two calculations:
▪ of the co-occurring words are
found in the designated span?
(the Observed)
how many instances
▪ might be expected in that span,
given the frequency of the cooccurring word in the corpus
as whole? (the Expected)
▪ t-score uses a calculation of standard deviation;
it takes into account
▫ the probability of co-occurrence of the node and its collocates
▫ the number of tokens in the designated span in all lines
▪ t-score is calculated by subtracting Expected from Observed
and dividing the result by the standard deviation
▪ MI-score is the Observed divided by the Expected, converted
to a base-2 logarithm
▪ [the “Lookup” package available with Bank of English corpus]
▫ MI-score indicates the strength of a collocation
the actual co-occurrence of the two items with their
expected co-occurrence if the words in the corpus used
were to co-occur in totally random order
▪ the MI-score measures the amount of non-randomness
present when two words co-occur
MI-score of 3 or higher  significant
e.g. ballpoint + pen; distinctly + unenthusiastic, hardly + surprising
▪ words such as baleful and unwavering are more generally
associated with gaze, although they are not particularly
frequent words
▪ if a word occurs rarely, but in most of its few co-occurrences
appears in the proximity of another word, the collocation
between those words will obtain a high MI-score
 MI is a measure of how strongly two words seem to
associate in a corpus, based on the independent relative
frequency of the two words
» the case of baleful and gaze:
knowing the strength of the collocation not always reliable
indication of meaningful association
» how certain we can be that the collocation is the result of
more than the vagaries of a particular corpus?
[vagaries = ‘unexpected changes that you cannot control’]
» the calculation of t-score (which takes the amount of
evidence into account) can be used
▪ a t-score of 2 or higher  significant
▪ high t-score in Bank of English corpus: e.g.
things + considered, could + hardly, argument + that,
children + toys
(see the list of 15 collocates of gaze with highest t-score
▪ the t-score list is different from
▫ the MI-score list (collocates depending on particular parts
of the corpus)
▫ the raw frequency list
▪ the + gaze does not have a high t-score
the because it is a frequent word,
not because of its association with gaze
▪ t-score confirms the collocation between his + gaze not
due purely to the high frequency of his but to the lexical
preferences of gaze
▪ his + gaze do not have a high MI-score, because
his also collocates with so many other things
there is a lot of evidence of their co-occurrence
(hence they have a high t-score)
» importance difference between MI-score n t-score
• MI-score is a measure of strength of collocation
t-score is a measure of certainty of collocation
this is because:
▪ the value of an MI-score not particularly dependent on
the size of the corpus
▪ for t-score, corpus size is important
(the amount of evidence is being taken into account)
• MI-scores can be compared across corpora, even if the corpora
are of different sizes
• absolute t-scores cannot be compared across corpora
(though it is reasonable to compare t-score ranking)
• in the Bank of English:
corpus of Times twice the size as that of the Economist
▪ decided occurs roughly the same frequency in each
▪ collocation of decided with to is very comparable in each
corpus in terms of MI-score
but the t-score in Times corpus is much higher
 the collocation decided to is stronger in Times corpus
the Times figures is higher not because the collocation is
stronger, but because the corpus is bigger
(both in Times and Economist, to is the most frequent
word to the right of decided)
• looking at the top collocates
from the view point of t-score
 info about the grammatical behavior of a word
from the viewpoint of MI-score
 info about the lexical behavior (particularly, more idiomatic
(“fixed”) co-occurrences
Hunston Ch. 4
word-based methodology vs. category-based methodology
• to answer different sets of questions
• a useful synergy between the two methodologies
(see Hunston, p. 86, BEGIN vs. START)
• word-based methodology
frequency and key-word lists
collocation
• the use of collocational information (Hunston, p. 76)
LEAK [v]
▪ associated with the physical meaning
▪ metaphoric sense
▪ prepositions and advs of direction – important
 give a semantic profile of LEAK
cf. CAUSE vs. PROVIDE: CAUSE is used with nouns “sth bad’
▪ LEAK in the physical sense can have either
▫ the substance as subject (the oil leaked out)
▫ the container as subject (the tank leaked out)
▪ the metaphoric sense behaves in a similar way,
with an additional pattern: + to
(it’s not the case with the physical sense)
 This can be seen only from the concordance lines
• shoulder
examine the words that occur one, two, and three places
to the left
▪ his – the most significant word immediately to the left
▪ the most significant word occurring two places to L: over
over is used following a range of verbs: LOOK over his
shoulder, GLANCE over his shoulder
▪ shoulder often follows a possessive (R or L)