Transcript paraphrase
Interpreting noun
compounds using
paraphrases
András Dobó
University of Oxford
Stephen G. Pulman
University of Oxford
Interpreting noun compounds
using paraphrases
1.
2.
3.
4.
5.
6.
Motivation
Related work
Method
Results
Summary
Future work
Motivation
English is full of noun compounds, which are
sequences of nouns acting as a single noun
Their interpretation is crucial for many NLP
tasks
Using dictionaries is unfeasible
Automated methods
Related work
Statistical approaches
Web queries or large corpora
Two main categories of methods
Inventory based approaches
Small number of abstract relational categories
Criticized for numerous reasons
Paraphrasing approaches
Verbs and prepositions as paraphrases
Water bottle = bottle that is for water
be for
Method
Paraphrasing method
Ranked list of paraphrases for each NC
Uses large corpora to search for paraphrases
Second noun is the head
noun, object = first noun
subject = second
Validates paraphrases using web queries
Two main approaches in the search of
paraphrases
Subject-paraphrase-objecttriples
Counts the frequency of all (subject,
paraphrase, object) triples in the corpus
Then for each NC it searches for those
triples, where subject = second noun, object
= first noun
List of suitable paraphrases for each NC
Ranks paraphrases for each NC using a
scoring method based on their frequency
Subject-paraphrase-andparaphrase-object-pairs
Counts the frequency of all (subject,
paraphrase) and (paraphrase, object) pairs in
the corpus
Then for each NC it searches for those pairs,
where subject = second noun, object = first
noun
Two lists of paraphrases for each NC
Rank paraphrases for each NC using a
scoring method based on their frequency
Scoring methods
Subject-paraphrase-object-triples version:
Simply the frequency of the relevant (subject,
paraphrase, object) triple
Subject-paraphrase-and-paraphrase-objectpairs version:
Using frequencies is not suitable
The product of pointwise the mutual information of
the relevant (subject, paraphrase) and
(paraphrase, object) pairs
Used corpora and their
preprocessing
Search for paraphrases:
British National Corpus
100 million words
Grammatical relations from parser
Web 1T 5-gram Corpus
Generated from 1 trillion words of web page text
Grammatical relations from POS patterns
Noun verb determiner noun
Validation of paraphrases:
The Web through Google and Yahoo!
Passive paraphrases
Their surface subject is actually their object
(subject, paraphrase) = (paraphrase2, object)
paraphrase: passive, without preposition
paraphrase2: active version of paraphrase
subject = object
Their frequencies are counted together
Passive paraphrases
(subject, paraphrase, object) =
(subject2, paraphrase2, object2)
paraphrase: passive, with by preposition
paraphrase2: active version of paraphrase,
without preposition
object2 = subject
subject2 = object
Their frequencies are counted together
Such (paraphrase, object) and (subject2,
paraphrase2) pairs are treated the same way
Patientive ambitransitive verbs
Three main groups of verbs: strictly transitive,
strictly intransitive, ambitransitive
Strictly intransitive verbs have two
subclasses: unergative and unaccusative
Ambitransitive verbs have two subclasses
too: agentive and patientive
Patientive ambitransitive verbs in intransitive
use behave in the same way as passive
verbs they are treated the same way
Using synonyms, hypernyms,
sister words etc.
No paraphrases are found for several NCs
Hypothesis: NCs comprising semantically
similar words are interpreted the same way
Using semantically similar words in the
search for paraphrases
Synonyms, hypernyms, sister words from
WordNet
Semantically similar words that are automatically
found with a method proposed by Dekang Lin
Validation of paraphrases
Some paraphrases are incorrect
Validation is needed
Hypothesis: If a paraphrase is suitable for a
NC, then there should exist at least some
web pages containing the NC paraphrased
by that paraphrase
Validation of paraphrases
Google and Yahoo! queries
Simple queries: “n2Infl THAT p n1Infl”
Extended queries:
Multiple verb tenses
Wildcard characters (up to 9)
Score for each paraphrase is recalculated
Testing and evaluation
Tested on the first 50 NCs of the SemEval-2
Task #9
3 best paraphrases for each NC
5 native speakers recruited for evaluation
They score each paraphrase from 1 to 5
Their agreement was checked using
Krippendorff’s alpha, and it was too low
The (noun compound, paraphrase) pairs with
highest disagreement were omitted
Best version
Subject-paraphrase-object-triples version
Web 1T 5-gram Corpus
Combination of two basic versions:
No substitute words
Sister words
Scores are recalculated in a way that favors
paraphrases returned by the first version
Validation: Google, present simple, up to 1
wildcard
Results
Mixed performance
Noun compound 1st rank
2nd rank
3rd rank
arts museum
be of
be devoted to
be for
bird droppings
be in
be for
be
Average scores
Rank of paraphrase
Average score
1st rank
3.1842
2nd rank
2.7687
3rd rank
2.5583
Promising results given the difficulty of task
Results
Best scoring NCs
Noun compound
Worst scoring NCs
Avg. Score
Noun compound
Avg. Score
broadway youngster
4,7500
championship bout
2,0000
cell membrane
4,6000
buddhist philosophy
1,8000
cattle population
4,4000
cell block
1,7500
arts museum
4,3333
banana industry
1,7333
business sector
4,2000
ancestor spirits
1,6000
arts colleges
4,0000
anode loss
1,5000
backwoods protagonist
3,8750
bird droppings
1,2667
antibiotic regimen
3,8667
bow scrape
1,2500
census population
3,8667
activity spectrum
1,0000
business applications
3,7000
altitude reconnaissance
1,0000
Future work
Parsing the Web 1T 5-gram Corpus
Much lower error rate in obtaining the
grammatical relations
Extended validation part
Employing synonyms, hypernyms, sister words or
semantically similar words
Combining the different extensions
Summary
Interpreting noun compounds is crucial for
many NLP tasks
We presented a method for noun compound
interpretation that searches for paraphrases
in large corpora and issues web queries to
validate the results
The results are promising, and could be
further improved
Acknowledgements
The attendance of this workshop was partly
supported by the Hungarian National Office
for Research and Technology within the
framework of the R&D project MASZEKER
(Modell-Alapú Szemantikus Kereső Rendszer
– Model Based Semantic Search System).
Thank you!