PPT - Search

Download Report

Transcript PPT - Search

CSA2050: Natural Language
Processing
Tagging 1
• Tagging
• POS and Tagsets
• Ambiguities
• NLTK
February 2007
CSA3050: Tagging I
1
Tagging 1 Lecture
• Slides based on Mike Rosner and Marti
Hearst notes
• Diane Litman’s version of Steven Bird’s
notes
• Additions from NLTK tutorials
February 2007
CSA3050: Tagging I
2
Tagging
Mr. Sherlock Holmes, who was
usually very X, …
What is the part of speech of X ?
February 2007
CSA3050: Tagging I
3
Tagging
Mr. Sherlock Holmes, who was
usually very late/ADJ in the
mornings, save upon those not
infrequent occasions when he
was up all night, was Y
What is the part of speech of Y ?
February 2007
CSA3050: Tagging I
4
Tagging
Mr. Sherlock Holmes, who was
usually very late in the mornings,
save upon those not infrequent
occasions when he was up all
night, was seated/VBN at the
breakfast table
February 2007
CSA3050: Tagging I
5
Tagging Terminology
• Tagging
– The process of associating labels with
each token in a text
• Tags
– The labels
• Tag Set
– The collection of tags used for a
particular task
February 2007
CSA3050: Tagging I
6
Tagging Example
Typically a tagged text is a sequence of whitespace separated base/tag tokens:
The/at Pantheon’s/np interior/nn ,/,still/rb in/in
its/pp original/jj form/nn ,/, is/bez truly/ql
majestic/jj and/cc an/at architectural/jj triumph/nn
./. Its/pp rotunda/nn forms/vbz a/at perfect/jj
circle/nn whose/wp diameter/nn is/bez equal/jj
to/in the/at height/nn from/in the/at floor/nn to/in
the/at ceiling/nn ./.
February 2007
CSA3050: Tagging I
7
What does tagging do?
1. Collapses Some Distinctions
• Lexical identity may be discarded
• e.g. all personal pronouns tagged with PRP
2. ….But Introduces Others
• Ambiguities may be removed
• e.g. deal tagged with NN or VB
• e.g. deal tagged with DEAL1 or DEAL2
3. Helps classification and prediction
February 2007
CSA3050: Tagging I
8
Parts of Speech (POS)
•
A word’s POS tells us a lot about the word
and its neighbors:
– Limits the range of meanings (deal), pronunciation
(object vs object) or both (wind)
– Helps in stemming
– Limits the range of following words for Speech
Recognition
– Can help select nouns from a document for IR
– Basis for partial parsing (chunked parsing)
– Parsers can build trees directly on the POS tags
instead of maintaining a lexicon
February 2007
CSA3050: Tagging I
9
POS and Tagsets
• The choice of tagset greatly affects the
difficulty of the problem
• Need to strike a balance between
– Getting better information about context
(best: introduce more distinctions)
– Make it possible for classifiers to do their job
(need to minimize distinctions)
February 2007
CSA3050: Tagging I
10
Common Tagsets
• Brown corpus: 87 tags
• Penn Treebank: 45 tags
• Lancaster UCREL C5 (used to tag the
British National Corpus - BNC): 61 tags
• Lancaster C7: 145 tags
February 2007
CSA3050: Tagging I
11
Brown Corpus
•
The first digital corpus (1961)
– Francis and Kucera, Brown University
• Contents: 500 texts, each 2000 words long
– From American books, newspapers,
magazines
– Representing genres:
• Science fiction, romance fiction, press
reportage scientific writing, popular lore
February 2007
CSA3050: Tagging I
12
Penn Treebank
• First syntactically annotated
corpus
• 1 million words from Wall Street
Journal
• Part of speech tags and syntax
trees
February 2007
CSA3050: Tagging I
13
Penn Treebank
The/DT grand/JJ jury/NN commented/VBD
on/IN a/DT number/NN of/IN other/JJ
topics/NNS ./.
VB DT NN .
Book that flight .
VBZ DT NN VB NN
?
Does that flight serve dinner ?
February 2007
CSA3050: Tagging I
14
Penn Treebank
February 2007
CSA3050: Tagging I
15
Penn Treebank – Important Tags
February 2007
CSA3050: Tagging I
16
Penn Treebank – Verb Tags
February 2007
CSA3050: Tagging I
17
Penn Treebank Example
(S (NP-SBJ-1 (DT The)
(NNP Senate))
(VP (VBZ plans_
(S (NP-SBJ (-NONE- *-1))
(VP (TO to)
(VP (VB take)
(PRT (RP up))
(NP (DT the)
(NN measure))
(ADV-TMP (RB quickly))))))
(. .))
February 2007
CSA3050: Tagging I
18
Tagging
• Typically the set of tags is larger
than basic parts of speech
• Tags often contain some
morphological information
• Often referred to as
“morphosyntactic labels”
February 2007
CSA3050: Tagging I
19
Tagging Ambiguities
N
FRUIT
February 2007
N-V
FLIES
V-IN
LIKE
CSA3050: Tagging I
DT
A
N
BANANA
20
Interpretation 1
S
VP
NP
N
FRUIT
February 2007
N
FLIES
NP
V
LIKE
CSA3050: Tagging I
DT
N
A
BANANA
21
Interpretation 2
S
VP
PP
NP
N
FRUIT
February 2007
NP
V
FLIES
IN
LIKE
DT
N
A
BANANA
CSA3050: Tagging I
22
Lots of ambiguities…
1. He can can a can.
2. I can light a fire and you can
open a can of beans. Now the
can is open, and we can eat in
the light of the fire.
February 2007
CSA3050: Tagging I
23
Lots of ambiguities…
•
In the Brown Corpus
– 11.5% of word types are ambiguous
– 40% of word tokens are ambiguous
•
•
•
Most words in English are unambiguous.
Many of the most common words are
ambiguous.
Typically ambiguous tags are not equally
probable.
February 2007
CSA3050: Tagging I
24
Lots of ambiguities…
Brown Corpus
Unambiguous (1 tag):
35,340 types
Ambiguous (2-7 tags):
4,100 types
(Table: Derose, 1988)
February 2007
2 tags
3,760
3 tags
264
4 tags
61
5 tags
12
6 tags
2
7 tags
1
CSA3050: Tagging I
25
Approaches to Tagging
1. Tagger: ENGTWOL Tagger
(Voutilainen 1995)
2. Stochastic Tagger: HMM-based Tagger
3. Transformation-Based Tagger: Brill
Tagger
(Brill 1995)
February 2007
CSA3050: Tagging I
26
NLTK
•
•
•
•
Natural Language Toolkit (NLTK)
http://nltk.sourceforge.net/
Please download and install!
Runs on Python
February 2007
CSA3050: Tagging I
27
NLTK Introduction
• The Natural Language Toolkit (NLTK)
provides:
– Basic classes for representing data relevant
to natural language processing.
– Standard interfaces for performing tasks, such
as tokenization, tagging, and parsing.
– Standard implementations of each task, which
can be combined to solve complex problems.
• Two versions: NLTK and NLTK-Lite
February 2007
CSA3050: Tagging I
28
NLTK Modules
• nltk.token: processing individual elements of text,
such as words or sentences.
• nltk.probability: modeling frequency distributions
and probabilistic systems.
• nltk.tagger: tagging tokens with supplemental
information, such as parts of speech or wordnet sense
tags.
• nltk.parser: high-level interface for parsing texts.
• nltk.chartparser: a chart-based implementation of
the parser interface.
• nltk.chunkparser: a regular-expression based
surface parser.
February 2007
CSA3050: Tagging I
29
Python for NLP
• Python is a great language for NLP:
– Simple
– Easy to debug:
• Exceptions
• Interpreted language
– Easy to structure
• Modules
• Object oriented programming
– Powerful string manipulation
February 2007
CSA3050: Tagging I
30
Python Modules and Packages
• Python modules “package program code
and data for reuse.” (Lutz)
– Similar to library in C, package in Java.
• Python packages are hierarchical modules
(i.e., modules that contain other modules).
• Three commands for accessing modules:
1.import
2.from…import
3.reload
February 2007
CSA3050: Tagging I
31
Import Command
• The import command loads a module:
# Load the regular expression module
>>> import re
• To access the contents of a module, use dotted
names:
# Use the search method from the re module
>>> re.search(‘\w+’, str)
• To list the contents of a module, use dir:
>>> dir(re)
[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]
February 2007
CSA3050: Tagging I
32
from...import
• The from…import command loads
individual functions and objects from a
module:
# Load the search function from the re module
>>> from re import search
• Once an individual function or object is
loaded with from…import, it can be
used directly:
# Use the search method from the re module
>>> search (‘\w+’, str)
February 2007
CSA3050: Tagging I
33
Import vs. from...import
• Import
• Keeps module
functions separate
from user functions.
• Requires the use of
dotted names.
• Works with reload.
February 2007
from…import
• Puts module functions
and user functions
together.
• More convenient
names.
• Does not work with
reload.
CSA3050: Tagging I
34
Reload
• If you edit a module, you must use the reload
command before the changes become visible in
Python:
>>> import mymodule
...
>>> reload (mymodule)
• The reload command only affects modules that
have been loaded with import; it does not
update individual functions and objects loaded
with from...import.
February 2007
CSA3050: Tagging I
35
Reload
• If you edit a module, you must use the reload
command before the changes become visible in
Python:
>>> import mymodule
...
>>> reload (mymodule)
• The reload command only affects modules that
have been loaded with import; it does not
update individual functions and objects loaded
with from...import.
February 2007
CSA3050: Tagging I
36
Next Sessions…
•
•
•
•
Rule-Based Tagging
Stochastic Tagging
Hidden Markov Models (HMMs)
N-Grams
• Read Jurafsky and Marting Chapter 4
(PDF)
• Install NLTK
February 2007
CSA3050: Tagging I
37