An Introduction to Python
Download
Report
Transcript An Introduction to Python
Programming for
Linguists
An Introduction to Python
13/12/2012
Dictionaries
Like a list, but more general
In a list the index has to be an integer,
e.g. words[4]
In a dictionary the index can be almost
any type
A dictionary is like a mapping between 2
sets: keys and values
To create an empty list:
list = [ ]
To create an empty dictionary:
dictionary = { }
e.g. a dictionary containing English and
Spanish words:
>>>eng2sp = { }
>>>eng2sp['one'] = 'uno’
>>>print eng2sp
{'one': 'uno'}
In this case both the keys and the values
are of the string type
Like with lists, you can create dictionaries
yourselves, e.g.
eng2sp = {'one': 'uno', 'two': 'dos', 'three':
'tres'}
print eng2sp
Note: in general, the order of items in a
dictionary is unpredictable
You can use the keys to look up the
corresponding values, e.g.
>>>print eng2sp['two']
The key ‘two’ always maps to the value
‘dos’ so the order of the items does not
matter
If the key is not in the dictionary you get an
error message, e.g.
>>>print eng2sp[‘ten’]
KeyError: ‘ten’
The len( ) function returns the number of
key-value pairs
len(eng2sp)
The in operator tells you whether
something appears as a key in the
dictionary
>>>‘one’ in eng2sp
True
BUT
>>>‘uno’ in eng2sp
False
To see whether something appears as a
value in a dictionary, you can use the
values( ) function, which returns the values
as a list, and then use the in operator, e.g.
>>>‘uno’ in eng2sp.values( )
True
Lists can be values, but never keys!
Default dictionary
Try this:
words = [‘een’, ‘twee’, ‘drie’]
frequencyDict = { }
for w in words:
frequencyDict[w] += 1
Possible solution:
for w in words:
if w in frequencyDict:
frequencyDict[w] += 1
else:
frequencyDict[w] = 1
The easy solution:
>>>from collections import defaultdict
>>>frequencyDict = defaultdict(int)
>>>for w in words:
frequencyDict[w] += 1
you can use int, float, str,… in the
defaultdict
A Dictionary as a Set of Counters
Suppose you want to count the number of
times each letter occurs in a string, you
could:
create 26 variables, traverse the string and,
for each letter, add 1 to the corresponding
counter
create a dictionary with letters as keys and
counters as the corresponding values
def frequencies(sent):
freq_dict = defaultdict(int)
for let in sent:
freq_dict[let] += 1
return freq_dict
dictA = frequencies(“abracadabra”)
list_keys = dictA.keys( )
list_values = dictA.values( )
z_value = dictA[‘z’]
The first line of the function creates an
empty default dictionary
The for loop traverses the string
Each time through the loop, we create a
new key item with the initial value 1
If the letter is already in the dictionary we
add 1 to its corresponding value
Write a function that counts the word
frequencies in a sentence instead of the
letter frequencies using a dictionary
def words(sent):
word_freq = defaultdict(int)
wordlist = sent.split( )
for word in wordlist:
word_freq[word] += 1
return word_freq
words(“this is is a a test sentence”)
Dictionary Lookup
Given a dictionary “word_freq” and a key
“is”, finding the corresponding value:
word_freq[“is”]
This operation is called a lookup
What if you know the value and want to
look up the corresponding key?
Sorting a Dictionary According to its
Values
First you need to import itemgetter:
from operator import itemgetter
To go over each item in a dictionary you
can use .iteritems( )
To sort the dictionary according to the
values, you need to use
key = itemgetter(1)
To sort it decreasingly: reverse = True
>>>from operator import itemgetter
>>>def getValues(sent):
w_fr = defaultdict(int)
wordlist = sent.split( )
for word in wordlist:
w_fr[word] += 1
byVals = sorted(w_fr.iteritems( ),
key = itemgetter(1),
reverse =True)
return byVals
>>>getValues(‘this is a a a sentence’)
Write a function that takes a sentence as
an argument and returns all words that
occur only once in the sentence.
def getHapax(sent):
words = sent.split( )
freqs = defaultdict(int)
for w in words:
freqs[w] += 1
hapaxlist = [ ]
for item in freqs:
value = freqs[item]
if value == 1:
hapaxlist.append(item)
return hapaxlist
Getting Started with NLTK
In IDLE:
import nltk
nltk.download()
Searching Texts
Start your script with importing all texts in
NLTK:
from nltk.book import *
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908
Any time you want to find out about these
texts, just enter their names at the Python
prompt:
>>> text1
<Text: Moby Dick by Herman Melville
1851>
A concordance view shows every
occurrence of a given word, together with
some context:
e.g. “monstrous” in Moby Dick
text1.concordance(“monstrous”)
Try looking up the context of “lol” in the
chat corpus (text 5)
If you have a corpus that contains texts
that are spread over time, you can look up
how some words are used differently over
time:
e.g. the Inaugural Address Corpus (dates
back to 1789): words like “nation”, “terror”,
“God”…
You can also examine what other words
appear in a similar context, e.g.
text1.similar(“monstrous”)
common_contexts( ) allows you to
examine the contexts that are shared by
two or more words, e.g.
text1.common_contexts([“very”,
“monstrous”])
You can also determine the location of a
word in the text
This positional information can be
displayed using a dispersion plot
Each stripe represents an instance of a
word, and each row represents the entire
text, e.g.
text4.dispersion_plot(["citizens",
"democracy", "freedom", "duties",
"America"])
Counting Tokens
To count the number of tokens (words +
punctuation marks), just use the len( )
function, e.g. len(text5)
To count the number of unique tokens,
you have to make a set, e.g.
set(text5)
If you want them sorted alfabetically,
try this:
sorted(set(text5))
Note: in Python all capitalized words
precede lowercase words (you can
use .lower( ) first to avoid this)
Now you can calculate the lexical diversity
of a text, e.g.
the chat corpus (text5):
45010 tokens
6066 unique tokens or types
The lexical diversity =
nr of types/nr of tokens
Use the Python functions to calculate the
lexical diversity of text 5
len(set(text5))/float(len(text5))
Frequency Distributions
To find n most frequent tokens: FreqDist( ),
e.g.
fdist = FreqDist(text1)
fdist[“have”]
760
all_tokens = fdist.keys( )
all_tokens[:50]
The function .keys( ) combined with the
FreqDist( ) also gives you a list of all the
unique tokens in the text
Frequency distributions can be
informative, BUT the most frequent words
usually are function words (the, of, and,
…)
What proportion of the text is taken up with
such words?
Cumulative frequency plot
fdist.plot(50, cumulative=True)
If frequent tokens do not give enough
information, what about infrequent tokens?
Hapaxes= tokens which occur only once
fdist.hapaxes( )
Without their context, you do not get much
information either
Fine-grained Selection of Tokens
Extract tokens of a certain minimum
length:
tokens = set(text1)
long_tokens = [ ]
for token in tokens:
if len(token) >= 15:
long_tokens.append(token)
#OR shorter:
long_tokens = list(token for token in
tokens if len(token) >= 15)
BUT: very long words are often hapaxes
You can also extract frequently occurring
long words of a certain length:
words = set(text1)
fdist = FreqDist(text1)
#short version
freq_long_words =
list(word for word in words if len(word) >=
7 and fdist[word] >= 7)
Collocations and Bigrams
A collocation is a sequence of words that
occur together unusually often, e.g. “red
whine” is a collocation, “yellow whine” is
not
Collocations are essentially just frequent
bigrams (word pairs), but you can find
bigrams that occur more often than is to
be expected based on the frequency of the
individual words:
text8.collocations( )
Some Functions for NLTK's Frequency
Distributions
fdist = FreqDist(samples)
fdist[“word”] frequency of “word”
fdist.freq(“word”) frequency of “word”
fdist.N( ) total number of samples
fdist.keys( ) the samples sorted in
order of decreasing frequency
for sample in fdist: iterates over the
samples in order of decreasing
frequency
fdist.max( ) sample with the greatest
count
fdist.plot( ) graphical plot of the
frequency distribution
fdist.plot(cumulative=True)
cumulative plot of the
frequency distribution
fdist1 < fdist2 tests if the samples in
fdist1 occur less
frequently than in fdist2
Accessing Corpora
NLTK also contains entire corpora, e.g.:
Brown Corpus
NPS Chat
Gutenberg Corpus
…
A complete list can be found on
http://nltk.googlecode.com/svn/trunk/nltk
_data/index.xml
Each of these corpora contains dozens of
individual texts
To see which files are e.g. in the
Gutenberg corpus in NLTK:
nltk.corpus.gutenberg.fileids()
Do not forget the dot notation nltk.corpus.
This tells Python the location of the corpus
You can use the dot notation to work with
a corpus from NLTK or you can import a
corpus at the beginning of your script:
from nltk.corpus import gutenberg
After that you just have to use the name of
the corpus and the dot notation before a
function
gutenberg.fileids( )
If you want to examine a particular text, e.g.
Shakespeare’s Hamlet, you can use the
.words( ) function
Hamlet = gutenberg.words(“shakespearehamlet.txt”)
Note that “shakespeare-hamlet.txt” is the file
name that is to be found using the previous
.fileids( ) function
You can use some of the previously
mentioned functions (corpus methods) on this
text, e.g.
fdist_hamlet = FreqDist(hamlet)
Some Corpus Methods in NLTK
brown.raw( ) raw data from the corpus
file(s)
brown.categories( ) fileids( ) grouped per
predefined categories
brown.words( ) a list of words and
punctuation tokens
brown.sents( ) words( ) grouped into
sentences
brown.tagged_words( ) a list of
(word,tag) pairs
brown.tagged_sents( ) tagged_words( )
grouped into sentences
treebank.parsed_sents( ) a list of parse
trees
def statistics(corpus):
for fileid in corpus.fileids( ):
nr_chars = len(corpus.raw(fileid))
nr_words = len(corpus.words(fileid))
nr_sents = len(corpus.sents(fileid))
nr_vocab = len(set([word.lower() for
word in corpus.words(fileid)]))
print fileid, “average word length: ”,
nr_chars/nr_words,
“average sentence length: ”,
nr_words/nr_sents,
“lexical diversity: ”,
nr_words/nr_vocab
Some corpora contain several
subcategories, e.g. the Brown Corpus
contains “news”, “religion”,…
You can optionally specify these particular
categories or files from a corpus, e.g.:
from nltk.corpus import brown
brown.categories( )
brown.words(categories='news')
brown.words(fileids=['cg22'])
brown.sents(categories=['news',
'editorial', 'reviews'])
Some linguistic research: comparing
genres in the Brown corpus in their usage
of modal verbs
from nltk.corpus import brown
cfd = nltk.ConditionalFreqDist((genre, word)
for genre in brown.categories( )
for word in brown.words(categories
=genre))
#Do not press enter to type in the for
#statements!
genres = ['news', 'religion', 'hobbies',
'science_fiction', 'romance', 'humor’]
modal_verbs = ['can', 'could', 'may',
'might', 'must', 'will']
cfd.tabulate(conditions=genres,
samples=modal_verbs)
can could may might must will
news
93 86 66 38
50 389
religion
82 59 78 12
54 71
hobbies
268 58 131 22
83 264
science_fiction 16 49 4
12
8 16
romance
74 193 11 51
45 43
humor
16 30 8
8
9 13
A conditional frequency distribution is a
collection of frequency distributions, each one
for a different "condition”
The condition is usually the category of the text
(news, religion,…)
Loading Your Own Text or Corpus
Make sure that the texts/files of your
corpus are in plaintext format (convert
them, do not just change the file
extensions from e.g. .docx to .txt)
Make a map with the name of your corpus
which contains all the text files
A text in Python:
open your file
f = open(“/Users/claudia/text1.txt”, “r”)
read in the text
text1 = f.read( ) reads the text entirely
text1 = f.readlines( ) reads in all lines that
end with \n and makes a list
text1 = f.readline( ) reads in one line
Loading your own corpus in NLTK with no
subcategories:
import nltk
from nltk.corpus import PlaintextCorpusReader
loc = “/Users/claudia/my_corpus” #Mac
loc = “C:\Users\claudia\my_corpus” #Windows
my_corpus = nltk.PlaintextCorpusReader(loc, “.*”)
Now you can use the corpus methods of
NLTK on your own corpus, e.g.
my_corpus.words( )
my_corpus.sents( )
…
Loading your own corpus in NLTK with
subcategories:
import nltk
from nltk.corpus import
CategorizedPlaintextCorpusReader
loc=“/Users/claudia/my_corpus” #Mac
loc=“C:\Users\claudia\my_corpus” #Windows 7
my_corpus = CategorizedPlaintextCorpusReader(loc,
'(?!\.svn).*\.txt', cat_pattern=
r'(cat1|cat2)/.*')
If your corpus is loaded correctly, you
should get a list of all files in your corpus
by using:
my_corpus.fileids( )
For a corpus with subcategories, you can
access the files in the subcategories by
taking the name of the subcategory as an
argument:
my_corpus.fileids(categories = “cat1”)
my_corpus.words(categories = “cat2”)
Writing Results to a File
It is often useful to write output to files
First you have to open/create a file for your
output
output_file = open(‘(path)/output.txt’,‘w’)
output_file = open(‘(path)/output.txt’,‘a’)
output_file.write(‘hallo’)
output_file.close()
To download and install NLTK:
http://www.nltk.org/download
Note: you need to have Python's NumPy
and Matplotlib packages installed in order
to produce the graphical plots
See http://www.nltk.org/ for installation
instructions
Thank you