Transcript NLTK1

CSA2050
NLTK
NLTK
• A software package for manipulating
linguistic data and performing NLP tasks
• Advanced tasks are possible from an early
stage
• Permits projects at various levels
• Consistent interfaces
• Facilitates reusability of modules
• Implemented in Python
Chart Parsing with NLTK
Why Python
• Popular languages for NLP courses
– Prolog (clean, learning curve, slow)
– Perl (quick, syntax).
• Why Python is better suited
– Easy to learn, clean syntax
– Interpreted, supporting rapid prototyping
– Object oriented
– Powerful
NLTK Structure
• NLTK is implemented as a set of minimally
independent modules.
• Core modules
– Basic data types
• Task Modules
– Tokenising
– Parsing
– Other NLP tasks
Token Class
• The token class to encode information
about NL texts.
• Each token instance represents a unit of
text such as a word, a text, or a document.
• A given instance is defined by a partial
mapping from property names to property
values.
The TEXT Property
• The TEXT property is used to encode a
token’s text content.
>>> from nltk.token import *
>>> Token(TEXT="Hello World!")
<Hello World!>
TAG
• The TAG property is used to encode a
token’s part of speech tag:
>>> Token(TEXT="python",TAG="NN")
<python/NN>
SUBTOKENS
• The SUBTOKENS property is used to
store a tokenized text:
>>> from nltk.tokenizer import *
>>> tok = Token(TEXT="Hello World!")
>>> WhitespaceTokenizer().tokenize(tok)
>>> print tok[’SUBTOKENS’])
[<Hello>, <World!>]
Augmenting the Token
with Information
• Language processing tasks are formulated
as annotations and transformations
involving tokens which add properties to
the Token data structure.
– word-sense disambiguation
– chunking
– parsing
Blackboard Architecture
• Typically these modifications are monotonic –
they add information but do not delete it.
• Tokens serve as a blackboard where information
about a piece of text is collated.
• This architecture contrasts with the more typical
pipeline architecture where each stage
destructively modifies the input information.
• This approach was chosen because it gives
greater flexibility when combining tasks into a
single system.
Other Core Modules
• probability module defines classes for
probability distributions and statistical
smoothing techniques.
• cfg module defines classes for encoding
context free grammars (normal and
probabilistic)
• The corpus module defines classes for
reading and processing different corpora.
Using Brown Corpus
>>> from nltk.corpus import brown
>>> brown.groups()
[’skill and hobbies’, ’popular lore’,
’humor’, ’fiction: mystery’, ...]
>>> brown.items(’humor’)
(’cr01’, ’cr02’, ’cr03’, ’cr04’, ’cr05’,
’cr06’, ’cr07’, ’cr08’, ’cr09’)
>>> brown.tokenize(’cr01’)
<[<It/pps>, <was/bedz>, <among/in>,
<these/dts>, <that/cs>, <Hinkle/np>,
<identified/vbd>, <a/at>, ...]>
Penn Treebank
>>> from nltk.corpus import treebank
>>> treebank.groups()
(’raw’, ’tagged’, ’parsed’, ’merged’)
>>> treebank.items(’parsed’)
[’wsj_0001.prd’, ’wsj_0002.prd’, ...]
>>> item = ’parsed/wsj_0001.prd’
>>> sentences = treebank.tokenize(item)
>>> for sent in sentences[’SUBTOKENS’]:
... print sent.pp() # pretty-print
(S:
(NP-SBJ:
(NP: <Pierre> <Vinken>)
(ADJP:
(NP: <61> <years>)
<old>
) ...
Processing Modules
• Each language processing algorithm is
implemented as a class.
• For example, the ChartParser and
RecursiveDescentParser classes each
define a single algorithm for parsing a text.
• Each processing module defines an interface.
• Interface classes are named with a trailing
capital i, e.g. ParserI.
• Such interface classes define one or more action
methods that perform the task the module is
supposed to perform.
parse method
parse_n method
What is Python
• Python is an interpreted, object-oriented, programming
language with dynamic semantics.
• Attractive for Rapid Application Development
• Easy to learn syntax emphasizes readability and
therefore reduces the cost of program maintenance.
• Python supports modules and packages, which
encourages program modularity and code reuse.
• Developed by Guido van Rossum in the early 1990s
• Named after Monty Python
• Open Source and free.
• Download from www.python.org
Why Python
•
•
•
•
•
•
•
Prolog
clean, learning curve, slow
Lisp
old, syntax, big
Perl
quick,
C#