Transcript NLTK

CSA2050: Introduction
to Computational
Linguistics
NLTK
April 2005
CSA2050:NLTK
1
NLTK






A software package for manipulating linguistic
data and performing NLP tasks
Advanced tasks are possible from an early
stage
Permits projects at various levels
Consistent interfaces
Facilitates reusability of modules
Implemented in Python
April 2005
CSA2050:NLTK
2
Chart Parsing with NLTK
April 2005
CSA2050:NLTK
3
Why Python

Popular languages for NLP courses



Prolog (clean, learning curve, slow)
Perl (quick, syntax).
Why Python is better suited




April 2005
Easy to learn, clean syntax
Interpreted, supporting rapid prototyping
Object oriented
Powerful
CSA2050:NLTK
4
NLTK Structure


NLTK is implemented as a set of minimally
independent modules.
Core modules


Basic data types
Task Modules



April 2005
Tokenising
Parsing
Other NLP tasks
CSA2050:NLTK
5
Token Class



The token class to encode information about
NL texts.
Each token instance represents a unit of text
such as a word, a text, or a document.
A given instance is defined by a partial
mapping from property names to property
values.
April 2005
CSA2050:NLTK
6
The TEXT Property

The TEXT property is used to encode a
token’s text content.
>>> from nltk.token import *
>>> Token(TEXT="Hello World!")
<Hello World!>
April 2005
CSA2050:NLTK
7
TAG

The TAG property is used to encode a
token’s part of speech tag:
>>> Token(TEXT="python",TAG="NN")
<python/NN>
April 2005
CSA2050:NLTK
8
SUBTOKENS

The SUBTOKENS property is used to store a
tokenized text:
>>> from nltk.tokenizer import *
>>> tok = Token(TEXT="Hello World!")
>>> WhitespaceTokenizer().tokenize(tok)
>>> print tok[’SUBTOKENS’])
[<Hello>, <World!>]
April 2005
CSA2050:NLTK
9
Augmenting the Token
with Information

Language processing tasks are formulated as
annotations and transformations involving
tokens which add properties to the Token
data structure.



April 2005
word-sense disambiguation
chunking
parsing
CSA2050:NLTK
10
Blackboard Architecture




Typically these modifications are monotonic – they
add information but do not delete it.
Tokens serve as a blackboard where information
about a piece of text is collated.
This architecture contrasts with the more typical
pipeline architecture where each stage destructively
modifies the input information.
This approach was chosen because it gives greater
flexibility when combining tasks into a single system.
April 2005
CSA2050:NLTK
11
Other Core Modules



probability module defines classes for
probability distributions and statistical
smoothing techniques.
cfg module defines classes for encoding
context free grammars (normal and
probabilistic)
The corpus module defines classes for
reading and processing different corpora.
April 2005
CSA2050:NLTK
12
Using Brown Corpus
>>> from nltk.corpus import brown
>>> brown.groups()
[’skill and hobbies’, ’popular lore’,
’humor’, ’fiction: mystery’, ...]
>>> brown.items(’humor’)
(’cr01’, ’cr02’, ’cr03’, ’cr04’, ’cr05’,
’cr06’, ’cr07’, ’cr08’, ’cr09’)
>>> brown.tokenize(’cr01’)
<[<It/pps>, <was/bedz>, <among/in>,
<these/dts>, <that/cs>, <Hinkle/np>,
<identified/vbd>, <a/at>, ...]>
April 2005
CSA2050:NLTK
13
Penn Treebank
>>> from nltk.corpus import treebank
>>> treebank.groups()
(’raw’, ’tagged’, ’parsed’, ’merged’)
>>> treebank.items(’parsed’)
[’wsj_0001.prd’, ’wsj_0002.prd’, ...]
>>> item = ’parsed/wsj_0001.prd’
>>> sentences = treebank.tokenize(item)
>>> for sent in sentences[’SUBTOKENS’]:
... print sent.pp() # pretty-print
(S:
(NP-SBJ:
(NP: <Pierre> <Vinken>)
(ADJP:
(NP: <61> <years>)
<old>
) ...
April 2005
CSA2050:NLTK
14
Processing Modules





Each language processing algorithm is implemented
as a class.
For example, the ChartParser and Recu
rsiveDescentParser classes each define a single
algorithm for parsing a text.
Each processing module defines an interface.
Interface classes are named with a trailing capital i,
e.g. ParserI.
Such interface classes define one or more action
methods that perform the task the module is
supposed to perform.
April 2005
CSA2050:NLTK
15
parse method
parse_n method
April 2005
CSA2050:NLTK
16
What is Python








Python is an interpreted, object-oriented, programming language
with dynamic semantics.
Attractive for Rapid Application Development
Easy to learn syntax emphasizes readability and therefore
reduces the cost of program maintenance.
Python supports modules and packages, which encourages
program modularity and code reuse.
Developed by Guido van Rossum in the early 1990s
Named after Monty Python
Open Source and free.
Download from www.python.org
April 2005
CSA2050:NLTK
17
Why Python







Prolog
clean, learning curve, slow
Lisp
old, syntax, big
Perl
quick,
C#
April 2005
CSA2050:NLTK
18