LING 681 Intro to Comp Ling

Download Report

Transcript LING 681 Intro to Comp Ling

NLTK & Python
Day 7
LING 681.02
Computational Linguistics
Harry Howard
Tulane University
Course organization
 I have requested that NLTK be installed on
the computers in this room.
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
2
NLPP §2
Accessing text corpora
and lexical resources
§2.1 Accessing text corpora
What's that word
 What is a corpus/corpora?
"large bodies of linguistic data"
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
4
Some corpora in NLTK
 The Project Gutenberg electronic text archive
 25k free electronic books at http://www.gutenberg.org/
 Web and chat text
 The Brown corpus
 First 1M word e-corpus, from 500 sources
 The Reuters corpus
 The Inaugural Address corpus
 Annotated text corpora
 Corpora in other languages
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
5
Using corpora in NLTK
 Only the corpora in the nltk.book corpus are
formatted as lists and so can be arguments
to NLTK functions.
 To convert another corpus into a list, use:
your_text_name = nltk.Text(corpus_name)
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
6
Basic corpus functions
Table 2.3
Example
Description
fileids()
the files of the corpus
categories()
the categories of the corpus
fileids([categories])
the files of the corpus corresponding to these categories
categories([fileids])
the categories of the corpus corresponding to these files
raw()
the raw content of the corpus
raw(fileids=[f1,f2,f3])
the raw content of the specified files
raw(categories=[c1,c2])
the raw content of the specified categories
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
7
Basic corpus functions
Table 2.3
Example
Description
words()
the words of the whole corpus
words(fileids=[f1,f2,f3])
the words of the specified fileids
words(categories=[c1,c2])
the words of the specified categories
sents()
the sentences of the whole corpus
sents(fileids=[f1,f2,f3])
the sentences of the specified fileids
sents(categories=[c1,c2])
the sentences of the specified categories
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
8
Code to get started
>>>
>>>
>>>
>>>
>>>
>>>
>>>
from nltk.corpus import gutenberg
emma = gutenberg.words('austen-emma.txt')
emma = nltk.Text(emma)
emma.collocations()
Frank Churchill; Miss Woodhouse; Miss Bates; Jane Fairfax; Miss
Fairfax; young man; great deal; John Knightley; Maple Grove; Miss
Smith; Miss Taylor; Robert Martin; Colonel Campbell; Box Hill; Harriet
Smith; William Larkins; Brunswick Square; young lady; young woman;
Miss Hawkins
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
9
Loading your own corpus
Table 2.3
Example
Description
abspath(fileid)
the location of the file on disk
encoding(fileid)
the encoding of the file (if known)
open(fileid)
open a stream for reading the given
corpus file
root()
readme()
09-Sept-2009
the path to the root of locally installed
corpus
the contents of the README file of the
corpus
LING 681.02, Prof. Howard, Tulane University
10
NLPP §2
Accessing text corpora
and lexical resources
§2.2 Conditional frequency
distributions
Back to frequency
 FreqDist(mylist) calculates the number of occurrences of
each item in 'mylist'.
 ConditionalFreqDist(mypairs) calculates the number of
occurrences of each pair of items in 'mypairs',
 where the pairing might be of author & word, genre & word, topic
& word, etc.: condition & text
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
12
An example
>>> from nltk.corpus import brown
>>> cfd = nltk.ConditionalFreqDist(
...
(genre, word)
...
for genre in brown.categories()
...
for word in brown.words(categories=genre))
09-Sept-2009
LING 681.02, Prof. Howard, Tulane University
13
Next time
NLPP: §2.3ff
Do "Your Turn" up to p. 55
Exercises 2.8.2-4, 2.8.8