Powerpoint of Flat text 1

Download Report

Transcript Powerpoint of Flat text 1

FLAT TEXT
DAY 6 - 9/12/16
LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
2



http://www.tulane.edu/~howard/NLP/
1.1.7. Schedule of assignments
Is there anyone here that wasn't here last week?
NLP, Prof. Howard, Tulane University
12-Sep-2016
3
Review
The quiz was the review.
NLP, Prof. Howard, Tulane University
12-Sep-2016
4
5. Flat text
Now that you have gotten a taste of Python, let
us turn to the main course, textual computing or
the computational analysis of text. But we do not
have a text to work with yet, so let’s go and find
one.
NLP, Prof. Howard, Tulane University
12-Sep-2016
5
7.1. How to get a text from an on-line archive
The first step is to figure out where to put the
file.
NLP, Prof. Howard, Tulane University
12-Sep-2016
How to navigate folders with os
6
1.
# check your current working directory in Python
2.
>>> import os
3.
>>> os.getcwd()
4.
'/Users/harryhow/Documents/pyScripts'
5.
>>> os.listdir('.')
6.
# if the path is not to your pyScripts folder, then change it:
7.
>>> os.chdir('/Users/{your_user_name}/Documents/pyScripts')
8.
>>> os.getcwd()
9.
>>> os.listdir('.')
10.
# if you have no pyScripts folder
11.
>>> os.chdir('/Users/{your_user_name}/Documents/')
12.
>>> os.makedirs('pyScripts')
13.
>>> os.listdir('.')
14.
>>> os.path.exists('/Users/{your_user_name}/Documents/pyScripts')
NLP, Prof. Howard, Tulane University
12-Sep-2016
Project Gutenberg
http://www.gutenberg.org/ebooks/28554
7
NLP, Prof. Howard, Tulane University
12-Sep-2016
How to download a file with requests and
convert it to a string with read()
8
1.
2.
3.
4.
5.
6.
7.
>>> import requests
>>> url =
'http://www.gutenberg.org/cache/epub/28554/pg28554.
txt'
>>> download = requests.get(url).text
# find out about it
>>> type(download)
>>> len(download) # 35739?
>>> download[:150]
NLP, Prof. Howard, Tulane University
12-Sep-2016
How to save a file to your hard drive
9
1.
2.
3.
4.
5.
# it is assumed that Python is looking at your pyScripts
folder
>>> tempF = open('Wub.txt','w')
>>> tempF.write(download.encode('utf8'))
>>> tempF.close()
>>> tempF
NLP, Prof. Howard, Tulane University
12-Sep-2016
How to read a file from your hard drive
10
1.
2.
3.
4.
5.
>>> tempF = open('Wub.txt','r')
>>> doc = tempF.read()
>>> tempF.close()
# these can be combined:
>>> doc = open('Wub.txt', 'r').read()
NLP, Prof. Howard, Tulane University
12-Sep-2016
Find out about it
11
1.
2.
3.
4.
>>> type(doc)
>>> len(doc)
>>> import chardet
>>> chardet.detect(doc)
NLP, Prof. Howard, Tulane University
12-Sep-2016
How to slice away what you don’t need
12
1.
>>> text.index('*** START OF THIS PROJECT GUTENBERG
EBOOK')
2.
499
3.
>>> lineIndex = text.index('*** START OF THIS PROJECT
GUTENBERG EBOOK')
4.
>>> startIndex = text.index('\n',lineIndex)
5.
>>> text[:startIndex]
6.
>>> text.index('*** END OF THIS PROJECT GUTENBERG
EBOOK')
7.
>>> endIndex = text.index('*** END OF THIS PROJECT
GUTENBERG EBOOK')
8.
>>> story = text[startIndex:endIndex]
NLP, Prof. Howard, Tulane University
12-Sep-2016
Now save it as “Wub.txt”
13
1.
2.
3.
4.
# it is assumed that Python is looking at your pyScripts
folder
>>> tempFile = open('Wub.txt','w')
>>> tempFile.write(story.encode('utf8'))
>>> tempFile.close()
NLP, Prof. Howard, Tulane University
12-Sep-2016
Homework
14
1.

Get another text from Project Gutenberg onto
your computer.
(NOT YET) Turn the commands reviewed above into
a function in a script that takes a url and the name
of a text file as arguments and results in a Project
Gutenberg file being saved to your pyScripts folder
without the Project Gutenberg header & footer.
NLP, Prof. Howard, Tulane University
12-Sep-2016
15
Next time
Other sources of flat text
NLP, Prof. Howard, Tulane University
12-Sep-2016