Transcript Slide 1


Week 1 Natural Language Processing
 Work in partners on lab with NLTK
 Brainstorm and start projects using either or both
NLP and speech recognition

Week 2 Speech Recognition
 Speech lab
 Finish projects and short critical reading

Week 3
 Present projects
 Discuss reading

What is “Natural Language”?

Phonetics


Phonetics – the sounds which make up a
word
ie. “cat” – k a t


Phonetics
Morphology



Phonetics
Morphology – The rules by which words are
composed
ie. Run + ing



Phonetics
Morphology
Syntax





Phonetics
Morphology
Syntax - rules for the formation of
grammatical sentences
ie. "Colorless green ideas sleep furiously.”
Not "Colorless ideas green sleep furiously.”




Phonetics
Morphology
Syntax
Semantics





Phonetics
Morphology
Syntax
Semantics – meaning
ie. “rose”





Phonetics
Morphology
Syntax
Semantics
Pragmatics






Phonetics
Morphology
Syntax
Semantics
Pragmatics - relationship of meaning to the
context, goals and intent of the speaker
ie. “Duck!”






Phonetics
Morphology
Syntax
Semantics
Pragmatics
Discourse






Phonetics
Morphology
Syntax
Semantics
Pragmatics
Discourse – 'beyond the sentence boundary'

Truly interdisciplinary


Truly interdisciplinary
Probabilistic methods



Truly interdisciplinary
Probabilistic methods
APIs

Natural Language Toolkit for Python


Natural Language Toolkit for Python
Text not speech



Natural Language Toolkit for Python
Text not speech
Corpora, tokenizers, stemmers, taggers,
chunkers, parsers, classifiers, clusterers…



Natural Language Toolkit for Python
Text not speech
Corpora, tokenizers, stemmers, taggers,
chunkers, parsers, classifiers, clusterers…
words = book.words()
bigrams = nltk.bigrams(words)
cfd = nltk.ConditionalFreqDist(bigrams)
pos = nltk.pos_tag(words)

Token - An instance of a symbol, commonly a
word, a linguistic unit


Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines



Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines
“The quick brown fox jumped over the log.”




Tokenize – to break a sequence of characters
into constituent parts
Often uses a delimiter like whitespace,
special characters, newlines
“The quick brown fox jumped over the log.”
“Mr. Brown, we’re confused by your article in
the newspaper regarding widely-used
words.”


Lexeme – The set of forms taken by a single
word; main entries in a dictionary
ex: run [ruhn] verb, ran run runs running,
noun, run, adjective, runny




Morpheme - the smallest meaningful unit in
the grammar of a language
Unladylike
Dogs
Technique



Sememe – a unit of meaning attached to a
morpheme
Dog - A domesticated carnivorous mammal
S – A plural marker on nouns




Phoneme - the smallest contrastive unit in
the sound system of a language
/k/ sound in the words kit and skill
/e/ in peg and bread
International Phonetic Alphabet (IPA)

Lexicon - A Vocabulary, a set of a language’s
lexemes


Lexical Ambiguity - multiple alternative
linguistic structures can be built for the input
ie. “I made her duck”



Lexical Ambiguity - multiple alternative
linguistic structures can be built for the input
ie. “I made her duck”
We use POS tagging and word sense
disambiguation to ATTEMPT to resolve
these issues

Part of Speech - how a word is used in a
sentence

Grammar – the syntax and morphology of a
natural language


Corpus/Corpora - a body of text which may or
may not include meta-information such as
POS, syntactic structure, and semantics


Concordance – list of the usages of a word in
its immediate context from a specific text
>>> text1.concordance(“monstrous”)



Collocation – a sequence of words that occur
together unusually often
ie. red wine
>>> text4.collocations()


Hapax – a word that appears once in a corpus
>>> fdist.hapaxes()



Bigram – sequential pair of words
From the sentence fragment “The quick
brown fox…”
(“The”, “quick”), (“quick”, “brown”),
(“brown”, “fox…”)




Frequency Distribution – tabulation of values
according to how often a value occurs in a
sample
ie. Word frequency in a corpus
Word length in a corpus
>>> fdist = FreqDist(samples)



Conditional Frequency Distribution –
tabulation of values according to how often a
value occurs in a sample given a condition
ie. How often is a word tagged as a noun
compared to a verb
>>> cfd =
nltk.ConditionalFreqDist(tagged_corpus)

POS tagging


Default – tags everything as a noun
Accuracy - .13


Regular Expression – Uses a set of regexes to
tag based on word patterns
Accuracy = .2




Unigram – learns the best possible tag for an
individual word regardless of context
ie. Lookup table
NLTK example accuracy = .46
Supervised learning



Based on conditional frequency analysis of a
corpus
P (word | tag)
ie. What is the probability of the word “run”
having the tag “verb”



Ngram tagger – expands unigram tagger
concept to include the context of N previous
tokens
Including 1 previous token in bigram
Including 2 previous tokens is trigram

N-gram taggers use Hidden Markov Models

P (word | tag) * P (tag | previous n tags)

ie. the probability of the word “run” having
the tag “verb” * the probability of a tag
“verb” given that the previous tag was “noun”

Tradeoff between coverage and accuracy


Ex. If we train on
('In', 'IN'), ('this', 'DT'), ('light', 'NN'), ('we',
'PPSS'), ('need', 'VB'), ('1,000', 'CD'),
('churches', 'NNS'), ('in', 'IN'), ('Illinois', 'NP'),
(',', ','), ('where', 'WRB'), ('we', 'PPSS'), ('have',
'HV'), ('200', 'CD'), (';', '.')




Bigrams for “light” are
(('this', 'DT'), ('light', 'NN'))
Trigrams for light are
('In', 'IN'), ('this', 'DT'), ('light', 'NN’)





Bigrams for “light” are
(('this', 'DT'), ('light', 'NN'))
Trigrams for light are
('In', 'IN'), ('this', 'DT'), ('light', 'NN’)
Try to tag : “Turn on this light”



Higher the value of N
More accurate tagging
Less coverage of unseen phrases




Higher the value of N
More accurate tagging
Less coverage of unseen phrases
Sparse data problem

Backoff




Backoff
Primary – Trigram
Secondary – Bigram
Tertiary – Unigram or default

Backoff



>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)


Backoff
Accuracy = 0.84491179108940495


Brill
Inductive transformation-based learning



Brill
Inductive transformation-based learning
“painting with progressively finer brush
strokes”




Brill
Inductive transformation-based learning
“painting with progressively finer brush
strokes”
Supervised learning using a tagged corpus as
training data



1. Every word is labeled based on the most
likely tag
ex. Sentence “She is expected to race
tomorrow”
PRO/She VBZ/is VBN/expected TO/to
NN/race NN/tomorrow.


2. Candidate transformation rules are
proposed based on errors made.
Ex. Change NN to VB when the previous tag is
TO


3. The rule that results in the most improved
tagging is chosen and the training data is retagged
PRO/She VBZ/is VBN/expected TO/to VB/race
NN/tomorrow.

nltk.tag.brill.demo()



For efficiency Brill uses templates for rule
creation
ie. Previous word is tagged Z
one of the two preceding words is tagged Z,
etc.

More info on POS tagging in Jurafsky ch. 5
available at Bobst

Critical!


Critical!
Human annotators create a “gold standard”



Critical!
Human annotators create a “gold standard”
Inter-annotator agreement




Critical!
Human annotators create a “gold standard”
Inter-annotator agreement
Separating training and test data





Critical!
Human annotators create a “gold standard”
Inter-annotator agreement
Separating training and test data
90% train 10% test

Confusion Matrix
Predicted/
Gold
NN
VB
ADJ
NN
103
10
7
VB
3
117
0
ADJ
9
13
98

Remember: performance is always limited by
ambiguity in training set and agreement
between human annotators

Translation – Google translate


Translation – Google translate
Spelling and grammar check– Microsoft Word



Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha




Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha
Text analysis of online material for marketing





Translation – Google translate
Spelling and grammar check– Microsoft Word
Conversational interfaces - Wolfram Alpha
Text analysis of online material for marketing
IM interface help desk


Break!
And pair up for the lab