Transcript ppt

Measures of association: chi square test, mutual information,
binomial distribution and log likelihood ratio
Lecture 8
1
Experiments in Multidocument
summarization (SNM’02)
Summarization system based on a range of features
Raises issues we have not discussed upto now
Non-extractive techniques
Ordering of information
2
Lead values feature
Lead sentences of news articles can often make
excellent brief summaries
But for multi-document summaries there are
several first sentences, so difficult to choose!
They are information dense
Can we find very informative words based on this
observation
Used Binomial test to decide if
P(Winlead )  P(Wanywhere )
3
Sample lead words
4
Verb specificity
Compare “arrest” with “do” or “be”
Often given subjects are very strongly associated
with a verb
Actors  appear in movies
Singers  release an album
Compute associations between subject nouns and
verbs
Use mutual association measure
I ( x, y )  log
P ( x, y )
P( x | y )
 log
P( x) P( y )
P( x)
5
Concept sets
Frequency of words are not that reliable, even when
stemming is used
Synonyms, hypernyms and hyponyms from wordnet
6
Other features
Location A negative value that penalizes sentencesthat appear late
in the document.
Publication Date Additional value to the most recent documents,
on the assumption that users will want the most up-to-date
information.
Target Indicates the presence of the central personage in the document
cluster, if one exists.
Length A penalty for sentences that are below a minimum (15
words) and above a maximum (30 words). Short sentences
are often require some introduction or reference resolution,
or else are a kind of interjection. Long sentences can cover
multiple thoughts that are often found elsewhere in the document
cluster in single sentences.
Others Indicates the presence of any named entity, weighted to the
frequency of that entity across all documents.
Pronoun A negative value on sentences that have pronouns in the
beginning of the sentence.
7
Other issues
Sentence ordering
How to present the selected information?
Even good choices might be hard to understand if
they are presented in the wrong order
Imagine a newspaper articles with all sentences
randomly permuted
Noun phrases
Depend on the context
8
Extractive summary
9
Partly modified summary
10
Measures of associations
For supervised learning, they can help us detrmine
which features are predictive of the distinctions we
want to make
Chi square test from last lecture
Words that are likely to appear in the first sentence
rather than anywhere else
Verbs that are strongly associated with a given
subjects
 A variety of measures are defined in the Chapter 5
reading
11
2 statistic (CHI)
2 statistic (pronounced “kai square”)
A commonly used method of comparing proportions.
Measures the lack of independence between a term and
a category
12
2 statistic (CHI)
Is “jaguar” a good predictor for the “auto” class?
Term = jaguar
Term  jaguar
Class = auto
2
500
Class  auto
3
9500
We want to compare:
the observed distribution above; and
null hypothesis: that jaguar and auto are independent
13
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
If independent: Pr(j,a) = Pr(j)  Pr(a)
So, there would be N  Pr(j,a), i.e. N  Pr(j)  Pr(a)
occurances of “jaguar”
Pr(j) = (2+3)/N;
Pr(a) = (2+500)/N;
N=2+3+500+9500
N(5/N)(502/N)=2510/N=2510/10005  0.25
Term = jaguar
Term  jaguar
Class = auto
2
500
Class  auto
3
9500
14
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto independent):
How many co-occurrences of jaguar and auto do we expect?
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3
500
9500
observed: fo
15
2 statistic (CHI)
Under the null hypothesis: (jaguar and auto – independent):
How many co-occurrences of jaguar and auto do we expect?
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3 (4.75)
500
(502)
9500 (9498)
observed: fo
16
2 statistic (CHI)
2 is interested in (fo – fe)2/fe summed over all table entries:
 2 ( j , a)   (O  E ) 2 / E  (2  .25) 2 / .25  (3  4.75) 2 / 4.75
 (500  502) 2 / 502  (9500  9498) 2 / 9498  12.9 ( p  .001)
The null hypothesis is rejected with confidence .999,
since 12.9 > 10.83 (the value for .999 confidence).
Term = jaguar
Term  jaguar
expected: fe
Class = auto
Class  auto
2 (0.25)
3 (4.75)
500
(502)
9500 (9498)
observed: fo
17
2 statistic (CHI)
There is a simpler formula for 2:
A = #(t,c)
C = #(¬t,c)
B = #(t,¬c)
D = #(¬t, ¬c)
N=A+B+C+D
18
Finding translation equivalents
19
Binomial distribution
k—number of “successes”
n—number of trails
x—probability of success
n!
k
( nk )
B(k , n, x) 
x (1  x)
k!(n  k )!
20
Log likelihood ratio test
21
Log likelihood ratio test
22