PPT - Search

Download Report

Transcript PPT - Search

CSA3180: Natural Language
Processing
Text Processing 2
• Shallow Parsing and Chunking
• Python and NLTK
• NLTK Exercises
October 2005
CSA3180: Text Processing II
1
Parsing Problem
• Given grammar G and sentence A discover all
valid parse trees for G that exactly cover A
S
VP
NP
V
book
Det
that
October 2006
csa3180: Parsing Algorithms 1
Nom
N
flight
2
The elephant is in the garden
S
VP
NP
NP
NP
I
shot
October 2006
an
PP
elephant
in
csa3180: Parsing Algorithms 1
my
garden
3
I own the garden
S
VP
NP
NP
I
shot
October 2006
an
PP
elephant
in
csa3180: Parsing Algorithms 1
my
garden
4
Parsing as Search
• Search within a space defined by
– Start State
– Goal State
– State to state transformations
• Two distinct parsing strategies:
– Top down
– Bottom up
• Different parsing strategy, different state space,
different problem.
• N.B. Parsing strategy ≠ search strategy
October 2006
csa3180: Parsing Algorithms 1
5
Shallow/Chunk Parsing
Goal: divide a sentence into a sequence of
chunks.
• Chunks are non-overlapping regions of a
text
[I] saw [a tall man] in [the park].
• Chunks are non-recursive
– A chunk can not contain other chunks
• Chunks are non-exhaustive
– Not all words are included in chunks
October 2005
CSA3180: Text Processing II
6
Chunk Parsing Examples
• Noun-phrase chunking:
[I] saw [a tall man] in [the park].
• Verb-phrase chunking:
The man who [was in the park] [saw me].
• Prosodic chunking:
[I saw] [a tall man] [in the park].
• Question answering:
– What [Spanish explorer] discovered [the
Mississippi River]?
October 2005
CSA3180: Text Processing II
7
Motivation
• Locating information
– e.g., text retrieval
• Index a document collection on its noun phrases
• Ignoring information
– Generalize in order to study higher-level patterns
• e.g. phrases involving “gave” in Penn treebank:
– gave NP; gave up NP in NP; gave NP up; gave NP help; gave
NP to NP
– Sometimes a full parse has too much structure
• Too nested
• Chunks usually are not recursive
October 2005
CSA3180: Text Processing II
8
Representation
• BIO (or IOB)
Trees
October 2005
CSA3180: Text Processing II
9
Comparison with Full Parsing
• Parsing is usually an intermediate stage
– Builds structures that are used by later stages of processing
• Full parsing is a sufficient but not necessary intermediate
stage for many NLP tasks
– Parsing often provides more information than we need
• Shallow parsing is an easier problem
– Less word-order flexibility within chunks than between chunks
– More locality:
• Fewer long-range dependencies
• Less context-dependence
• Less ambiguity
October 2005
CSA3180: Text Processing II
10
Chunks and Constituency
Constituents: [[a tall man] [ in [the park]]].
Chunks:
[a tall man] in [the park].
• A constituent is part of some higher unit in the
hierarchical syntactic parse
• Chunks are not constituents
– Constituents are recursive
• But, chunks are typically subsequences of
constituents
– Chunks do not cross major constituent boundaries
October 2005
CSA3180: Text Processing II
11
Unchunking
• Remove any chunk with a given pattern
– e.g., unChunkRule(‘<NN|DT>+’, ‘Unchunk NNDT’)
– Combine with Chunk Rule <NN|DT|JJ>+
• Chunk all matching subsequences:
– Input:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
– Apply chunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
– Apply unchunk rule
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
October 2005
CSA3180: Text Processing II
12
Chinking
• A chink is a subsequence of the text that is not a chunk.
• Define a regular expression that matches the sequences
of tags in a chink
A simple chink regexp for finding NP chunks:
(<VB.?>|<IN>)+
• First apply chunk rule to chunk everything
– Input:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
– ChunkRule('<.*>+', ‘Chunk everything’)
[the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN]
– Apply Chink rule above:
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
October 2005
CSA3180: Text Processing II
13
Merging
• Combine adjacent chunks into a single chunk
– Define a regular expression that matches the sequences of tags
on both sides of the point to be merged
• Example:
– Merge a chunk ending in JJ with a chunk starting with NN
MergeRule(‘<JJ>’, ‘<NN>’, ‘Merge adjs and nouns’)
[the/DT little/JJ] [cat/NN] sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN the/DT mat/NN
• Splitting is the opposite of merging
October 2005
CSA3180: Text Processing II
14
Python and NLTK
• Natural Language Toolkit (NLTK)
• http://nltk.sourceforge.net/
• NLTK Slides partly based on Diane Litman
Lectures
• Chunk parsing slides partly based on Marti
Hearst Lectures
October 2005
CSA3180: Text Processing II
15
Python for NLP
• Python is a great language for NLP:
– Simple
– Easy to debug:
• Exceptions
• Interpreted language
– Easy to structure
• Modules
• Object oriented programming
– Powerful string manipulation
October 2005
CSA3180: Text Processing II
16
Python Modules and Packages
• Python modules “package program code
and data for reuse.” (Lutz)
– Similar to library in C, package in Java.
• Python packages are hierarchical modules
(i.e., modules that contain other modules).
• Three commands for accessing modules:
1.import
2.from…import
3.reload
October 2005
CSA3180: Text Processing II
17
Import Command
• The import command loads a module:
# Load the regular expression module
>>> import re
• To access the contents of a module, use dotted
names:
# Use the search method from the re module
>>> re.search(‘\w+’, str)
• To list the contents of a module, use dir:
>>> dir(re)
[‘DOTALL’, ‘I’, ‘IGNORECASE’,…]
October 2005
CSA3180: Text Processing II
18
from...import
• The from…import command loads
individual functions and objects from a
module:
# Load the search function from the re module
>>> from re import search
• Once an individual function or object is
loaded with from…import, it can be
used directly:
# Use the search method from the re module
>>> search (‘\w+’, str)
October 2005
CSA3180: Text Processing II
19
Import vs. from...import
Import
from…import
• Keeps module
functions separate
from user functions.
• Requires the use of
dotted names.
• Works with reload.
• Puts module functions
and user functions
together.
• More convenient
names.
• Does not work with
reload.
October 2005
CSA3180: Text Processing II
20
Reload
• If you edit a module, you must use the reload
command before the changes become visible in
Python:
>>> import mymodule
...
>>> reload (mymodule)
• The reload command only affects modules that
have been loaded with import; it does not
update individual functions and objects loaded
with from...import.
October 2005
CSA3180: Text Processing II
21
NLTK Introduction
• The Natural Language Toolkit (NLTK) provides:
– Basic classes for representing data relevant to natural
language processing.
– Standard interfaces for performing tasks, such as
tokenization, tagging, and parsing.
– Standard implementations of each task, which can be
combined to solve complex problems.
October 2005
CSA3180: Text Processing II
22
NLTK Example Modules
• nltk.token: processing individual elements of text,
such as words or sentences.
• nltk.probability: modeling frequency distributions
and probabilistic systems.
• nltk.tagger: tagging tokens with supplemental
information, such as parts of speech or wordnet sense
tags.
• nltk.parser: high-level interface for parsing texts.
• nltk.chartparser: a chart-based implementation of
the parser interface.
• nltk.chunkparser: a regular-expression based
surface parser.
October 2005
CSA3180: Text Processing II
23
Chunk Parsing in NLTK
• Chunk parsers usually ignore lexical content
– Only need to look at part-of-speech tags
• Possible steps in chunk parsing
– Chunking, unchunking
– Chinking
– Merging, splitting
• Evaluation
– Compare to a Baseline
– Evaluate in terms of
• Precision, Recall, F-Measure
• Missed (False Negative), Incorrect (False Positive)
October 2005
CSA3180: Text Processing II
24
Chunk Parsing in NLTK
• Define a regular expression that matches the sequences
of tags in a chunk
A simple noun phrase chunk regexp:
(Note that <NN.*> matches any tag starting with NN)
<DT>? <JJ>* <NN.?>
• Chunk all matching subsequences:
the/DT little/JJ cat/NN sat/VBD on/IN the/DT mat/NN
[the/DT little/JJ cat/NN] sat/VBD on/IN [the/DT mat/NN]
• If matching subsequences overlap, first 1 gets priority
October 2005
CSA3180: Text Processing II
25
NLTK Exercises for Next Week
• Series of tutorials by Steven Bird, Edward Klein
and Edward Loper
– http://nltk.sourceforge.net/
– University of Pennsylvania
• By next lecture please read and do exercises
in:
–
–
–
–
Introduction
Programming
Tokenize
Tag
October 2005
CSA3180: Text Processing II
26
Next Sessions…
•
•
•
•
•
•
•
•
•
Natural Language Toolkit (NLTK) Exercises
http://nltk.sourceforge.net/
Discovery of Word Associations
Text Classification
Clustering/Data Mining
TF.IDF
Linear and Non-Linear Classification
Binary Classification
Multi-Class Classification
October 2005
CSA3180: Text Processing II
27