Programming for Linguists

Download Report

Transcript Programming for Linguists

Programming for
Linguists
An Introduction to Python
22/12/2011
Feedback
 Ex. 1)
Read in the texts of the State of the
Union addresses, using the state_union
corpus reader. Count occurrences of
“men”, “women”, and “people” in each
document. What has happened to the
usage of these words over time?
import nltk
from nltk.corpus import state_union
cfd = nltk.ConditionalFreqDist((fileid, word) for
fileid in state_union.fileids( )
for word in state_union.words(fileids = fileid))
fileids = state_union.fileids( )
search_words = ["men", "women", "people"]
cfd.tabulate(conditions = fileids, samples =
search_words)
 Ex 2)
According to Strunk and White's Elements
of Style, the word “however”, used at the
start of a sentence, means "in whatever
way" or "to whatever extent", and not
"nevertheless". They give this example of
correct usage: However you advise him,
he will probably do as he thinks best.
Use the concordance tool to study actual
usage of this word in 5 NLTK texts.
import nltk
from nltk.book import *
texts = [text1, text2, text3, text4, text5]
for text in texts:
print text.concordance("however")
 Ex 3)
Create a corpus of your own of minimum
10 files containing text fragments. You can
take texts of your own, the internet,…
Write a program that investigates the
usage of modal verbs in this corpus using
the frequency distribution tool and plot the
10 most frequent words.
import nltk
from nltk.corpus import
PlaintextCorpusReader
corpus_root = “/Users/claudia/my_corpus”
#corpus_root = “C:\Users\...”
my_corpus = PlaintextCorpusReader
(corpus_root, '.*’)
words = my_corpus.words( )
cfd = nltk.ConditionalFreqDist((fileid, word) for fileid
in my_corpus.fileids( ) for word in
my_corpus.words(fileid))
fileids = my_corpus.fileids( )
modals = ['can', 'could', 'may',
'might', 'must', 'will’
cfd.tabulate(conditions = fileids,
samples = modals)
fd = nltk.FreqDist(words)
all_tokens = fd.keys( )
for t in all_tokens:
if re.match(r'[^a-zA-Z0-9]+', t):
all_tokens.remove(t)
most_frequent=all_tokens[:10]
most_frequent.plot( )
 Ex 1)
Choose a website. Read it in in Python
using the urlopen function, remove all
HTML mark-up and tokenize it. Make a
frequency dictionary of all words ending
with ‘ing’ and sort it on its values
(decreasingly).
 Ex 2)
Write the raw text of the text in the
previous exercise to an output file.
import nltk
import re
url= “website”
from urllib import urlopen
htmltext= urlopen(url).read( )
rawtext= nltk.clean_html(htmltext)
rawtext2= rawtext.lower( )
tokens= nltk.wordpunct_tokenize(rawtext2)
my_text= nltk.Text(tokens)
wordlist_ing= [w for w in tokens if re.search(r'^.*ing$',w)]
freq_dict= { }
for word in wordlist_ing:
if word not in freq_dict:
freq_dict[word] = 1
else:
freq_dict[word] = freq_dict[word]+1
from operator import itemgetter
sorted_wordlist_ing = sorted(freq_dict.iteritems(), key=
itemgetter(1), reverse=True)
Ex 2)
output_file = open(“dir/output.txt","w")
output_file.write(str(rawtext2)+"\n")
output_file.close( )
 Ex 3)
Write a script that performs the same
classification task as we saw today using
word bigrams as features instead of single
words.
Some Mentioned Issues
 Loading your own corpus in NLTK with no
subcategories:
import nltk
from nltk.corpus import PlaintextCorpusReader
loc = “/Users/claudia/my_corpus” #Mac
loc = “C:\Users\claudia\my_corpus” #Windows 7
my_corpus = PlaintextCorpusReader(loc, “.*”)
 Loading your own corpus in NLTK with
subcategories:
import nltk
from nltk.corpus import
CategorizedPlaintextCorpusReader
loc=“/Users/claudia/my_corpus” #Mac
loc=“C:\Users\claudia\my_corpus” #Windows 7
my_corpus = CategorizedPlaintextCorpusReader(loc,
'(?!\.svn).*\.txt', cat_pattern=
r'(cat1|cat2)/.*')
Dispersion Plot
 determine the location of a word in the text: how many words
from the beginning it appears
Exercises
 Write a program that reads a file, breaks
each line into words, strips whitespace
and punctuation from the text, and
converts the words to lowercase.
You can get a list of all punctuation marks
by:
import string
print string.punctuation
import nltk, string
def strip(filepath):
f = open(filepath, ‘r’)
text = f.read( )
tokens = nltk.wordpunct_tokenize(text)
for token in tokens:
token = token.lower( )
if token in string.punctuation:
tokens.remove(token)
return tokens
 If you want to analyse a text, but filter out
a stop list first (e.g. containing “the”,
“and”,…), you need to make 2 dictionaries:
1 with all words from your text and 1 with
all words from the stop list. Then you need
to subtract the 2nd from the 1st. Write a
function subtract(d1, d2) which takes
dictionaries d1 and d2 and returns a new
dictionary that contains all the keys from
d1 that are not in d2. You can set the
values to None.
def subtract(d1, d2):
d3 = { }
for key in d1.keys():
if key not in d2:
d3[key] = None
return d3
 Let’s try it out:
import nltk
from nltk.book import *
from nltk.corpus import stopwords
d1 = { }
for word in text7:
d1[word] = None
wordlist = stopwords.words(“english”)
d2 = { }
for word in wordlist:
d2[word] = None
rest_dict = subtract(d1, d2)
wordlist_min_stopwords=rest_dict.keys( )
Questions?
Evaluation Assignment
 Deadline = 23/01/2012
 Conversation in the week of 23/01/12
 If you need any explanation about the
content of the assignment, feel free to email me
Further Reading
 Since this was only a short introduction to
programming in Python, if you want to
expand your programming skills further:
see chapters 15 – 18 about objectoriented programming
 Think Python. How to Think Like a
Computer Scientist?
 NLTK book
 Official Python documentation:
http://www.python.org/doc/
 There is a newer version of Python
available, but it is not (yet) compatible with
NLTK
 Our research group:
CLiPS: Computational Linguistics and
Psycholinguistics Research Center
http://www.clips.ua.ac.be/
 Our projects:
http://www.clips.ua.ac.be/projects
Happy holidays and success
with your exams