powerpoint file

Download Report

Transcript powerpoint file

Word Association Norms,
Mutual Information,
and Lexicography
Kenneth Ward Church, Patrick Hanks
Computational Linguistics
March 1990
Abstract
• Word association : (in psycholinguistics)
- doctor … nurse (quicker response !)
• A statistical description of linguistic phenomena
- semantic relations : doctor and nurse
- lexico-syntactic constraints : verbs and prepositions
• Mutual information, Association ratio
- an objective measure obtained from large corpora
Meaning and Association
• Classification of words (in linguistics) : based on
- meanings
- co-occurrence with other words
ex) bank ----  money, notes, loan, account, … , of England
 river, swim, boat , … , of the Rhine
• A search for delicate word classes
- dates back to 1948 : verb patterns (Hornby’s A. L. dic.)
• Recently available facilities
- computational storage, NLP for large scale text
- possible to see what company our words do keep !
Practical Applications
• Constraining the language model
- speech recognition or OCR
ex) OCR assigned equal prob. to farm and form
… federal ( farm, form ) credit ...
… some ( farm, form ) of ...
• Providing disambiguation cues for parsing
• Retrieving texts from large database : IR
• Enhancing the productivity of lexicographers
Word association and Psycolinguistics
• Reaction-time experiment (in psycolinguistics, 1975)
 classify successive letters as words or not
 pronounce the letters
ex) BREAD --- BUTTER (faster)
NURSE --- BUTTER
- subjects
• Subjective method (1964)
- empirical estimates for word association norm
200
words
doctor  next word ?
.
.
.
.
asking
doctor : nurse,sick,health, ...
a few thousands
(70 words)
to write down .
.
.
An Information theoretic measure (1)
• Mutual Information : [Fano1961]
p ( x, y )
MI ( x, y )  log 2
p( x ) p( y )
p(x) = f(x)/N , p(x,y) = fw(x,y)/N
N : size of corpus
(15mill ‘87 AP, 36mill ‘88 AP, 8.6mill tagged corpus)
w : window size(5) : [Table 1]
• MI meaning between two words (x, y)
- genuine association :
- no relationship
:
- complementary distribution :
MI(x,y) >> 0
MI  0
MI << 0
An Information theoretic measure (2)
• Association Ratio (AR)
- alternative measure of word association norms
- based on MI
- more objective, less costly than subjective method
- easy to scale up for a large portion of language
ex) doctor -- dentists, nurses, treating, treat, hospitals, ...
• Association Ratio vs. MI
- joint prob. not symmetric : encodes linear precedence
ex) [Table 2] : asymmetry biases from sexism to syntax
- frequency counting method : considering window size
Characteristics of association ratio
• Association Ratio
- large : the same effect as the subjective method [Table 3]
- AR threshold = 3.0 : in this paper
- rarely able to observe MI << 0
ex) p(x) = p(y) = 10-5 , p(x)p(y) = 10-10
MI << 0 <------- p(x,y) << 10-10
in fact, can’t observe a prob. less than 10-7
so, compensate for the window size ! [Table 3]
divide f(x,y) with window size
Lexico-syntactic regularities
• Identifying phrasal verbs
ex) set up , set off , adhere to , … [Table 4]
• Phrasal verbs involving “to”
- confused preposition “to” with the infinitive “to”
- preprocess tag associations using POS-tagged corpus ( 8.6 mill )
ex) prep. “to” -- 768 verbs ( alluding, amounted, relating, … )
infin. “to” -- 551 verbs ( obligated, trying, compelled, … )
• Associations between verbs and arguments
- preprocess the corpus ( 44 mill ) with a parser [Table 5,6]
- collect SOV triples ( 4 mill )
- measure with association ratio
Applications in lexicography (1)
• Large machine-readable corpora
- just recently availble
• Computational tools
- still rather primitive : concordancing programs [Fig 1]
• Lexicographers in 80’s
- given the concordances of a word
- mark up senses with colored pens
- writes syntactic descriptions and definitions
ex) take, save, from : thousands of concordance lines !!!
save : 666 lines from ‘88 AP corpus
Applications in lexicography (2)
• Association between content/function words
- save ~ from : [Table 7]
• Help categorize concordance lines : [Fig 2]
- save ~ from pattern : 65 lines from 666
- how well 65 lines fit in with all uses of save ?
- Invented semantic tags can be suggested from AR
ex) save the forests [ENV]
save the lake [ENV]
save the planet [ENV]
- help to choose a set of semantic tags
Conclusions
• Psyco-linguistic notion of word association
• MI ---> association ratio (AR)
• AR encodes very interesting patterns
- semantic relations : doctor ~ nurse
- lexico-syntactic constraints : save ~ from
• AR help a lexicographer organize concordance lines
• Weak points of Association Ratio
- only distributional evidence : semantic are compositional !
ex) AR favors set ~ for over set ~ down
- extremely superficial
 natural similarities : picture ~ photograph
 cluster words into syntactic classes without tagger, parser