771Lec07-NLTK

Download Report

Transcript 771Lec07-NLTK

CSCE 771
Natural Language Processing
Lecture 7
NLTK POS Tagging
Topics





Taggers
Rule Based Taggers
Probabilistic Taggers
Transformation Based Taggers - Brill
Supervised learning
Readings: Chapter 5.4-?
February 3, 2011
NLTK tagging
>>> text = nltk.word_tokenize("And now for something
completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'),
('completely', 'RB'), ('different', 'JJ')]
–2–
CSCE 771 Spring 2011
>>> text = nltk.word_tokenize("They refuse to permit us
to obtain the refuse permit")
>>> nltk.pos_tag(text)
[('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit',
'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the',
'DT'), ('refuse', 'NN'), ('permit', 'NN')]
–3–
CSCE 771 Spring 2011
>>> text = nltk.Text(word.lower() for word in nltk.corpus.brown.words())
>>> text.similar('woman')
Building word-context index... man time day year car moment world
family house country child boy state job way war girl place room
word
>>> text.similar('bought')
made said put done seen had found left given heard brought got been
was set told took in felt that
>>> text.similar('over')
in on to of and for with from at by that into as up out down through is
all about
>>> text.similar('the')
a his this their its her an that our any all one these my in your no some
other and
–4–
CSCE 771 Spring 2011
Tagged Corpora
By convention in NLTK, a tagged token is a tuple.
function str2tuple()
>>> tagged_token = nltk.tag.str2tuple('fly/NN')
>>> tagged_token
('fly', 'NN')
>>> tagged_token[0]
'fly'
>>> tagged_token[1]
–5–
'NN'
CSCE 771 Spring 2011
Specifying Tags with Strings
>>> sent = '''
... The/AT grand/JJ jury/NN commented/VBD on/IN a/AT
number/NN of/IN ... other/AP topics/NNS ,/,
AMONG/IN them/PPO the/AT Atlanta/NP and/CC
...
... accepted/VBN practices/NNS which/WDT inure/VB
to/IN the/AT best/JJT ... interest/NN of/IN both/ABX
governments/NNS ''/'' ./. ... '''
>>> [nltk.tag.str2tuple(t) for t in sent.split()]
[('The', 'AT'), ('grand', 'JJ'), ('jury', 'NN'), ('commented',
'VBD'), ('on', 'IN'), ('a', 'AT'), ('number', 'NN'), ... ('.', '.')]
–6–
CSCE 771 Spring 2011
Reading Tagged Corpora
>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]
–7–
CSCE 771 Spring 2011
tagged_words() method
>>> print nltk.corpus.nps_chat.tagged_words()
[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
>>> nltk.corpus.conll2000.tagged_words()
[('Confidence', 'NN'), ('in', 'IN'), ('the', 'DT'), ...]
>>> nltk.corpus.treebank.tagged_words()
[('Pierre', 'NNP'), ('Vinken', 'NNP'), (',', ','), ...]
–8–
CSCE 771 Spring 2011
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'NP'), ('County', 'N'), ...]
>>> nltk.corpus.treebank.tagged_words(simplify_tags=True)
[('Pierre', 'NP'), ('Vinken', 'NP'), (',', ','), ...]
–9–
CSCE 771 Spring 2011
readme() methods
– 10 –
CSCE 771 Spring 2011
Table 5.1:
Simplified Part-of-Speech Tagset
Tag
ADJ
ADV
CNJ
DET
EX
FW
– 11 –
Meaning
adjective
adverb
conjunction
determiner
existential
foreign word
Examples
new, good, high, special, big, local
really, already, still, early, now
and, or, but, if, while, although
the, a, some, most, every, no
there, there's
dolce, ersatz, esprit, quo, maitre
CSCE 771 Spring 2011
MOD
modal verb
will, can, would, may, must, should
N
noun
year, home, costs, time, education
NP
proper noun
Alison, Africa, April, Washington
NUM
number
twenty-four, fourth, 1991, 14:24
PRO
pronoun
he, their, her, its, my, I, us
P
preposition
on, of, at, with, by, into, under
TO
the word to
to
UH
interjection
ah, bang, ha, whee, hmpf, oops
V
verb
is, has, get, do, make, see, run
VD
past tense
said, took, told, made, asked
VG
present
participle
making, going, playing, working
VN
past participle
given, taken, begun, sung
wh determiner
who, which, when, what, where,
how
CSCE 771
Spring 2011
– 12WH
–
>>> from nltk.corpus import brown
>>> brown_news_tagged =
brown.tagged_words(categories='news',
simplify_tags=True)
>>> tag_fd = nltk.FreqDist(tag for (word, tag) in
brown_news_tagged)
>>> tag_fd.keys()
['N', 'P', 'DET', 'NP', 'V', 'ADJ', ',', '.', 'CNJ', 'PRO', 'ADV',
'VD', ...]
– 13 –
CSCE 771 Spring 2011
Nouns
>>> word_tag_pairs = nltk.bigrams(brown_news_tagged)
>>> list(nltk.FreqDist(a[1] for (a, b) in word_tag_pairs if
b[1] == 'N'))
['DET', 'ADJ', 'N', 'P', 'NP', 'NUM', 'V', 'PRO', 'CNJ', '.', ',',
'VG', 'VN', ...]
– 14 –
CSCE 771 Spring 2011
Verbs
>>> wsj =
nltk.corpus.treebank.tagged_words(simplify_tags=True
)
>>> word_tag_fd = nltk.FreqDist(wsj)
>>> [word + "/" + tag for (word, tag) in word_tag_fd if
tag.startswith('V')]
['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V', 'have/V',
'says/V', 'were/VD', 'had/VD', 'been/VN', "'s/V", 'do/V',
'say/V', 'make/V', 'did/VD', 'rose/VD', 'does/V',
'expected/VN', 'buy/V', 'take/V', 'get/V', 'sell/V', 'help/V',
'added/VD', 'including/VG', 'according/VG', 'made/VN',
'pay/V', ...]
– 15 –
CSCE 771 Spring 2011
>>> cfd1 = nltk.ConditionalFreqDist(wsj)
>>> cfd1['yield'].keys()
['V', 'N']
>>> cfd1['cut'].keys()
['V', 'VD', 'N', 'VN']
– 16 –
CSCE 771 Spring 2011
>>> cfd2 = nltk.ConditionalFreqDist((tag, word) for
(word, tag) in wsj)
>>> cfd2['VN'].keys()
['been', 'expected', 'made', 'compared', 'based', 'priced',
'used', 'sold', 'named', 'designed', 'held', 'fined',
'taken', 'paid', 'traded', 'said', ...]
– 17 –
CSCE 771 Spring 2011
>>> [w for w in cfd1.conditions() if 'VD' in cfd1[w] and
'VN' in cfd1[w]]
['Asked', 'accelerated', 'accepted', 'accused', 'acquired',
'added', 'adopted', ...]
>>> idx1 = wsj.index(('kicked', 'VD'))
>>> wsj[idx1-4:idx1+1]
[('While', 'P'), ('program', 'N'), ('trades', 'N'), ('swiftly',
'ADV'), ('kicked', 'VD')]
>>> idx2 = wsj.index(('kicked', 'VN'))
>>> wsj[idx2-4:idx2+1]
[('head', 'N'), ('of', 'P'), ('state', 'N'), ('has', 'V'), ('kicked',
– 18 –'VN')]
CSCE 771 Spring 2011
def findtags(tag_prefix, tagged_text):
cfd = nltk.ConditionalFreqDist((tag, word) for
(word, tag) in tagged_text if
tag.startswith(tag_prefix)) return dict((tag,
cfd[tag].keys()[:5]) for tag in cfd.conditions())
– 19 –
CSCE 771 Spring 2011
Reading URLs
NLTK book 3.1
>>> from urllib import urlopen
>>> url = "http://www.gutenberg.org/files/2554/2554.txt"
>>> raw = urlopen(url).read()
>>> type(raw) <type 'str'>
>>> len(raw) 1176831
>>> raw[:75]
http://docs.python.org/2/library/urllib2.html
– 20 –
CSCE 771 Spring 2011
>>> tokens = nltk.word_tokenize(raw)
>>> type(tokens) <type 'list'>
>>> len(tokens) 255809
>>> tokens[:10] ['The', 'Project', 'Gutenberg', 'EBook',
'of', 'Crime', 'and', 'Punishment', ',', 'by']
– 21 –
CSCE 771 Spring 2011
Dealing with HTML
>>> url = "http://news.bbc.co.uk/2/hi/health/2284783.stm"
>>> html = urlopen(url).read()
>>> html[:60] '<!doctype html public "-//W3C//DTD HTML 4.0
Transitional//EN‘
>>> raw = nltk.clean_html(html)
>>> tokens = nltk.word_tokenize(raw)
>>> tokens ['BBC', 'NEWS', '|', 'Health', '|', 'Blondes', "'", 'to',
'die', 'out', ...]
– 22 –
CSCE 771 Spring 2011
.
– 23 –
CSCE 771 Spring 2011
Chap 2 Brown corpus
>>> from nltk.corpus import brown
>>> brown.categories() ['adventure', 'belles_lettres',
'editorial', 'fiction', 'government', 'hobbies', 'humor',
'learned', 'lore', 'mystery', 'news', 'religion', 'reviews',
'romance', 'science_fiction']
>>> brown.words(categories='news') ['The', 'Fulton',
'County', 'Grand', 'Jury', 'said', ...]
>>> brown.words(fileids=['cg22']) ['Does', 'our',
'society', 'have', 'a', 'runaway', ',', ...]
>>> brown.sents(categories=['news', 'editorial',
'reviews']) [['The', 'Fulton', 'County'...], ['The', 'jury',
'further'...], ...]
– 24 –
CSCE 771 Spring 2011
Freq Dist
>>> from nltk.corpus import brown
>>> news_text = brown.words(categories='news')
>>> fdist = nltk.FreqDist([w.lower() for w in news_text])
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> for m in modals: ... print m + ':', fdist[m],
... can: 94 could: 87 may: 93 might: 38 must: 53 will: 389
– 25 –
CSCE 771 Spring 2011
>>> fdist1 = FreqDist(text1)
>>> fdist1 <FreqDist with 260819 outcomes>
>>> vocabulary1 = fdist1.keys()
>>> vocabulary1[:50]
[',', 'the', '.', 'of', 'and', 'a', 'to', ';', 'in', 'that', "'", '-', 'his',
'it', 'I', 's', 'is', 'he', 'with', 'was', 'as', '"', 'all', 'for', 'this',
'!', 'at', 'by', 'but', 'not', '--', 'him', 'from', 'be', 'on', 'so',
'whale', 'one', 'you', 'had', 'have', 'there', 'But', 'or',
'were', 'now', 'which', '?', 'me', 'like']
>>> fdist1['whale'] 906 >>>
– 26 –
CSCE 771 Spring 2011
>>> cfd = nltk.ConditionalFreqDist( ... (genre, word) ...
for genre in brown.categories() ... for word in
brown.words(categories=genre))
>>> genres = ['news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor']
>>> modals = ['can', 'could', 'may', 'might', 'must', 'will']
>>> cfd.tabulate(conditions=genres, samples=modals)
– 27 –
CSCE 771 Spring 2011
Table 2.2:
Some of the Corpora and Corpus
Samples Distributed with NLTK
– 28 –
CSCE 771 Spring 2011
Table 2.3 Basic Corpus Functionality
– 29 –
fileids()
the files of the corpus
fileids([categories])
the files of the corpus corresponding to
these categories
categories()
the categories of the corpus
categories([fileids])
the categories of the corpus
corresponding to these files
raw()
the raw content of the corpus
raw(fileids=[f1,f2,f3])
the raw content of the specified files
raw(categories=[c1,c2])
the raw content of the specified
categories
words()
the words of the whole corpus
words(fileids=[f1,f2,f3])
the words of the specified fileids
words(categories=[c1,c2])
the words of the specified categories
sents()
the sentences of the whole corpus
sents(fileids=[f1,f2,f3])
the sentences of the specified fileids
sents(categories=[c1,c2])
the sentences of the specified
categories
abspath(fileid)
..................
the location of the given file on disk
………….
CSCE 771 Spring 2011
def generate_model(cfdist, word, num=15):
for i in range(num):
print word,
word = cfdist[word].max()
text = nltk.corpus.genesis.words('english-kjv.txt')
bigrams = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams)
– 30 –
CSCE 771 Spring 2011
Example 2.5 (code_random_text.py)
– 31 –
CSCE 771 Spring 2011
Table 2.4
– 32 –
Example
Description
cfdist =
ConditionalFreqDist(pairs)
create a conditional frequency
distribution from a list of pairs
cfdist.conditions()
alphabetically sorted list of
conditions
cfdist[condition]
the frequency distribution for this
condition
cfdist[condition][sample]
frequency for the given sample for
this condition
cfdist.tabulate()
tabulate the conditional frequency
distribution
cfdist.tabulate(samples,
conditions)
tabulation limited to the specified
samples and conditions
cfdist.plot()
graphical plot of the conditional
frequency distribution
cfdist.plot(samples,
CSCE
graphical plot limited
to771
theSpring 2011
>>> wsj =
nltk.corpus.treebank.tagged_words(simplify_tags=Tr
ue)
>>> word_tag_fd = nltk.FreqDist(wsj)
>>> [word + "/" + tag for (word, tag) in word_tag_fd if
tag.startswith('V')]
['is/V', 'said/VD', 'was/VD', 'are/V', 'be/V', 'has/V',
'have/V', 'says/V', 'were/VD', 'had/VD', 'been/VN',
"'s/V", 'do/V', 'say/V', 'make/V', 'did/VD', 'rose/VD',
'does/V', 'expected/VN', 'buy/V', 'take/V', 'get/V',
'sell/V', 'help/V', 'added/VD', 'including/VG',
'according/VG', 'made/VN', 'pay/V', ...]
– 33 –
CSCE 771 Spring 2011
Example 5.2 (code_findtags.py)
– 34 –
CSCE 771 Spring 2011
highly ambiguous words
>>> brown_news_tagged =
brown.tagged_words(categories='news',
simplify_tags=True)
>>> data = nltk.ConditionalFreqDist((word.lower(), tag)
... for (word, tag) in brown_news_tagged)
>>> for word in data.conditions():
... if len(data[word]) > 3:
... tags = data[word].keys()
... print word, ' '.join(tags)
...
best ADJ ADV NP V
better ADJ ADV V DET
– 35 –
….
CSCE 771 Spring 2011