Transcript slides
Preventing Sexual Unprotected
Intercourse
Prenominal Adjective Ordering
Ben Newman, Chris Collette
Motivation
•
•
•
•
•
•
Leather Old Green Chair
Ninja Mutant Teenage Turtles
Moral Irish High Standards
Sleeping Green Bag
Green Sleeping Bag
…and Sexual Unprotected Intercourse
Outline
• Context
• Method
• Considerations
• Memory-based Learning
• Features
• Results
Context
• Prenominal Adjective Ordering
• Statistics based on establishment of semantic classes
• Building off of work by Robert Malouf (2000)
• Ordering on a bigram level
• Sparsity
• Simplicity
• Generally established approximation:
• Size/length/shape < old/new/young < color < nationality <
style < gerund < denominal
• A < B means a class A adjective should precede a class B
adjective
Method
• Considerations
• Capitalization
• Turned into a feature
• Non-Alphabetic Characters (é for é)
• Left them in as extra information
• Artificial Frequency of Rare Sequences
• e.g. <Nationality> <adjective> in specific articles
• Removed matching adjacent adjective sequences
• Multi-word adjectives
• Used POS tags as delimiters
Method
• Corpus: British National Corpus
• 100 million words
• 415,731 Adj Adj sequences
• 404,686 sequences after adjacent duplicate removal
• Memory-based Learning
• Tilburg Memory-Based Learner
• Order adjective Bigrams based on array of features
• Everything is either ordered correctly or not
• No precision versus recall
Method
• Features
• Morphological
• Last 8 characters of each Adj. as 16 individual features
• First letter capitalization as well
• Nationality and short word extra information
• Improved test set accuracy by 0.14%
• Brute Force
• Lists of words for semantic classes
• Lowered Accuracy
• Positional Probabilities
• Probability that a word is first in any pair given corpus
• Combination
Results
• Accuracies:
• Morphological: 89.47%
• Positional Probabilities: 89.02%
• Combined: 90.17%
• Analysis
• Accurate
• Exact effects of individual features and considerations
difficult to extract
• Less than Malouf’s 91.85%
• Likely due to data cleaning (adjacent sequence removal)
• Data sparsity continual problem