LING 681 Intro to Comp Ling

Download Report

Transcript LING 681 Intro to Comp Ling

ON-LINE DOCUMENTS
DAY 20 - 10/13/14
LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
2




http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
Chapter numbering
 3.7.
How to deal with non-English characters
 4.5. How to create a pattern with Unicode characters
 6. Control
NLP, Prof. Howard, Tulane University
13-Oct-2014
3
The NLTK archive
Basic text analysis
NLP, Prof. Howard, Tulane University
13-Oct-2014
4
Open Spyder
NLP, Prof. Howard, Tulane University
13-Oct-2014
5
7. Corpora of digital texts
Now that you have gotten a taste of Python, let
us turn to the main course, textual computing or
the computational analysis of text. But we do not
have a text to work with yet, so let’s go and find
one.
NLP, Prof. Howard, Tulane University
13-Oct-2014
6
7.1. How to get a text from an on-line archive
The first step is to figure out where to put the
file.
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.1. How to navigate folders with os
7
1.
>>> import os
2.
>>> os.getcwd()
3.
'/Applications/IDEs/Spyder.app/Contents/Resources'
4.
# if the path is not to your pyScripts folder, then change it:
5.
>>> os.chdir('/Users/{your_user_name}/Documents/pyScripts/')
6.
>>> os.getcwd()
7.
'/Users/{your_user_name}/Documents/pyScripts/'
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.2. Project Gutenberg
http://www.gutenberg.org/ebooks/28554
8
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.3. How to download a file with urllib
and convert it to a string with read()
9
1.
2.
3.
4.
5.
6.
7.
>>> from urllib import urlopen
>>> url = 'http://www.gutenberg.org/cache/epub/28554/pg28554.txt'
>>> download = urlopen(url)
>>> downloadString = download.read()
>>> type(downloadString)
>>> len(downloadString) # 35739?
>>> downloadString[:50]
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.4. How to save a file to your drive with
open(), write(), and close()
10

# it is assumed that Python is looking at your pyScripts folder

>>> tempFile = open('Wub.txt','w')

>>> tempFile.write(downloadString.encode('utf8'))

>>> tempFile.close()

# import os if you haven't already done so

>>> os.listdir('.')
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.5. How to look at a file with open() and
read()
11
1.
2.
3.
4.
5.
>>> tempFile = open('Wub.txt','r')
>>> text = tempFile.read()
>>> type(text)
>>> len(text)
>>> text[:50]
NLP, Prof. Howard, Tulane University
13-Oct-2014
7.1.6. How to slice away what you don’t need
12
1.
>>> text.index('*** START OF THIS PROJECT GUTENBERG EBOOK')
2.
499
3.
>>> lineIndex = text.index('*** START OF THIS PROJECT GUTENBERG EBOOK')
4.
>>> startIndex = text.index('\n',lineIndex)
5.
>>> text[:startIndex]
6.
>>> text.index('*** END OF THIS PROJECT GUTENBERG EBOOK')
7.
>>> endIndex = text.index('*** END OF THIS PROJECT GUTENBERG EBOOK')
8.
>>> story = text[startIndex:endIndex]
NLP, Prof. Howard, Tulane University
13-Oct-2014
Now save it as “Wub.txt”
13
1.
2.
3.
4.
# it is assumed that Python is looking at your pyScripts folder
>>> tempFile = open('Wub.txt','w')
>>> tempFile.write(story.encode('utf8'))
>>> tempFile.close()
NLP, Prof. Howard, Tulane University
13-Oct-2014
Homework
14

Turn the commands reviewed above into a function
in a script that takes a url and the name of a text
file as arguments and results in a Project Gutenberg
file being saved to your pyScripts folder without the
Project Gutenberg header & footer.
NLP, Prof. Howard, Tulane University
13-Oct-2014
15
Next time
How to use PDF files
NLP, Prof. Howard, Tulane University
13-Oct-2014