Transcript Slide 1
Week 1 Natural Language Processing
Work in partners on lab with NLTK
Brainstorm and start projects using either or both
NLP and speech recognition
Week 2 Speech Recognition
Speech lab
Finish projects and short critical reading
Week 3
Present projects
Discuss reading
What is “Natural Language”?
Phonetics
Phonetics – the sounds which make up a
word
ie. “cat” – k a t
Phonetics
Morphology
Phonetics
Morphology – The rules by which words are
composed
ie. Run + ing
Phonetics
Morphology
Syntax
Phonetics
Morphology
Syntax - rules for the formation of
grammatical sentences
ie. "Colorless green ideas sleep furiously.”
Not "Colorless ideas green sleep furiously.”
Phonetics
Morphology
Syntax
Semantics
Phonetics
Morphology
Syntax
Semantics – meaning
ie. “rose”
Phonetics
Morphology
Syntax
Semantics
Pragmatics
Phonetics
Morphology
Syntax
Semantics
Pragmatics - relationship of meaning to the
context, goals and intent of the speaker
ie. “Duck!”
Phonetics
Morphology
Syntax
Semantics
Pragmatics
Discourse
Phonetics
Morphology
Syntax
Semantics
Pragmatics
Discourse – 'beyond the sentence boundary'
Truly interdisciplinary
Truly interdisciplinary
Probabilistic methods
Truly interdisciplinary
Probabilistic methods
APIs
Natural Language Toolkit for Python
Natural Language Toolkit for Python
Text not speech
Natural Language Toolkit for Python
Text not speech
Corpora, tokenizers, stemmers, taggers,
chunkers, parsers, classifiers, clusterers…
Natural Language Toolkit for Python
Text not speech
Corpora, tokenizers, stemmers, taggers,
chunkers, parsers, classifiers, clusterers…
words = book.words()
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)
pos = nltk.pos_tag(words)
Token - An instance of a symbol, commonly a
word, a linguistic unit
Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines
Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines
“The quick brown fox jumped over the log.”
Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines
“The quick brown fox jumped over the log.”
“Mr. Brown, we’re confused by your article in
the newspaper regarding widely-used
words.”
Lexeme – The set of forms taken by a single
word; main entries in a dictionary
ex: run [ruhn] verb, ran run runs running,
noun, run, adjective, runny
Morpheme - the smallest meaningful unit in
the grammar of a language
Unladylike
Dogs
Technique
Sememe – a unit of meaning attached to a
morpheme
Dog - A domesticated carnivorous mammal
S – A plural marker on nouns
Phoneme - the smallest contrastive unit in
the sound system of a language
/k/ sound in the words kit and skill
/e/ in peg and bread
International Phonetic Alphabet (IPA)
Lexicon - A Vocabulary, a set of a language’s
lexemes
Lexical Ambiguity - multiple alternative
linguistic structures can be built for the input
ie. “I made her duck”
Lexical Ambiguity - multiple alternative
linguistic structures can be built for the input
ie. “I made her duck”
We use POS tagging and word sense
disambiguation to ATTEMPT to resolve
these issues
Part of Speech - how a word is used in a
sentence
Grammar – the syntax and morphology of a
natural language
Corpus/Corpora - a body of text which may or
may not include meta-information such as
POS, syntactic structure, and semantics
Concordance – list of the usages of a word in
its immediate context from a specific text
>>> text1.concordance(“monstrous”)
Collocation – a sequence of words that occur
together unusually often
ie. red wine
>>> text4.collocations()
Hapax – a word that appears once in a corpus
>>> fdist.hapaxes()
Bigram – sequential pair of words
From the sentence fragment “The quick
brown fox…”
(“The”, “quick”), (“quick”, “brown”),
(“brown”, “fox…”)
Frequency Distribution – tabulation of values
according to how often a value occurs in a
sample
ie. Word frequency in a corpus
Word length in a corpus
>>> fdist = FreqDist(samples)
Conditional Frequency Distribution –
tabulation of values according to how often a
value occurs in a sample given a condition
ie. How often is a word tagged as a noun
compared to a verb
>>> cfd =
nltk.ConditionalFreqDist(tagged_corpus)
POS tagging
Default – tags everything as a noun
Accuracy - .13
Regular Expression – Uses a set of regexes to
tag based on word patterns
Accuracy = .2
Unigram – learns the best possible tag for an
individual word regardless of context
ie. Lookup table
NLTK example accuracy = .46
Supervised learning
Based on conditional frequency analysis of a
corpus
P (word | tag)
ie. What is the probability of the word “run”
having the tag “verb”
Ngram tagger – expands unigram tagger
concept to include the context of N previous
tokens
Including 1 previous token in bigram
Including 2 previous tokens is trigram
N-gram taggers use Hidden Markov Models
P (word | tag) * P (tag | previous n tags)
ie. the probability of the word “run” having
the tag “verb” * the probability of a tag
“verb” given that the previous tag was “noun”
Tradeoff between coverage and accuracy
Ex. If we train on
('In', 'IN'), ('this', 'DT'), ('light', 'NN'), ('we',
'PPSS'), ('need', 'VB'), ('1,000', 'CD'),
('churches', 'NNS'), ('in', 'IN'), ('Illinois', 'NP'),
(',', ','), ('where', 'WRB'), ('we', 'PPSS'), ('have',
'HV'), ('200', 'CD'), (';', '.')
Bigrams for “light” are
(('this', 'DT'), ('light', 'NN'))
Trigrams for light are
('In', 'IN'), ('this', 'DT'), ('light', 'NN’)
Bigrams for “light” are
(('this', 'DT'), ('light', 'NN'))
Trigrams for light are
('In', 'IN'), ('this', 'DT'), ('light', 'NN’)
Try to tag : “Turn on this light”
Higher the value of N
More accurate tagging
Less coverage of unseen phrases
Higher the value of N
More accurate tagging
Less coverage of unseen phrases
Sparse data problem
Backoff
Backoff
Primary – Trigram
Secondary – Bigram
Tertiary – Unigram or default
Backoff
>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)
Backoff
Accuracy = 0.84491179108940495
Brill
Inductive transformation-based learning
Brill
Inductive transformation-based learning
“painting with progressively finer brush
strokes”
Brill
Inductive transformation-based learning
“painting with progressively finer brush
strokes”
Supervised learning using a tagged corpus as
training data
1. Every word is labeled based on the most
likely tag
ex. Sentence “She is expected to race
tomorrow”
PRO/She VBZ/is VBN/expected TO/to
NN/race NN/tomorrow.
2. Candidate transformation rules are
proposed based on errors made.
Ex. Change NN to VB when the previous tag is
TO
3. The rule that results in the most improved
tagging is chosen and the training data is retagged
PRO/She VBZ/is VBN/expected TO/to VB/race
NN/tomorrow.
nltk.tag.brill.demo()
For efficiency Brill uses templates for rule
creation
ie. Previous word is tagged Z
one of the two preceding words is tagged Z,
etc.
More info on POS tagging in Jurafsky ch. 5
available at Bobst
Critical!
Critical!
Human annotators create a “gold standard”
Critical!
Human annotators create a “gold standard”
Inter-annotator agreement
Critical!
Human annotators create a “gold standard”
Inter-annotator agreement
Separating training and test data
Critical!
Human annotators create a “gold standard”
Inter-annotator agreement
Separating training and test data
90% train 10% test
Confusion Matrix
Predicted/
Gold
NN
VB
ADJ
NN
103
10
7
VB
3
117
0
ADJ
9
13
98
Remember: performance is always limited by
ambiguity in training set and agreement
between human annotators
Translation – Google translate
Translation – Google translate
Spelling and grammar check– Microsoft Word
Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha
Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha
Text analysis of online material for marketing
Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha
Text analysis of online material for marketing
IM interface help desk
Break!
And pair up for the lab