LING 681 Intro to Comp Ling
Download
Report
Transcript LING 681 Intro to Comp Ling
WEB TEXT
DAY 34 - 11/14/14
LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
2
http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
Chapter numbering
3.7.
How to deal with non-English characters
4.5. How to create a pattern with Unicode characters
6. Control
NLP, Prof. Howard, Tulane University
14-Nov-2014
3
Open Spyder
NLP, Prof. Howard, Tulane University
14-Nov-2014
4
Review
Twitter
NLP, Prof. Howard, Tulane University
14-Nov-2014
5
Finding text on the web
NLP, Prof. Howard, Tulane University
14-Nov-2014
http://sethgodin.typepad.com/
6
NLP, Prof. Howard, Tulane University
14-Nov-2014
Firefox: Tools > web developer > Page source
Safari: Prefs > Advanced > Show develop >> show page source
7
<div class="entry-body"> <p>If someone asked
you how to do something …. By all means, you still
need pictures, even video. But there's nothing
to replace the specificity that comes from the
alphabet. Use labels. Use words.</p> </div><!-.entry-body -->
NLP, Prof. Howard, Tulane University
14-Nov-2014
We need
8
requests
% pip install feedparser
% pip install BeautifulSoup4
NLP, Prof. Howard, Tulane University
14-Nov-2014
Get the text
9
1.
2.
3.
4.
5.
6.
import requests
from bs4 import BeautifulSoup
url = 'http://sethgodin.typepad.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)
print soup.find("div", {"class":"entrybody"}).text.encode('utf8')
NLP, Prof. Howard, Tulane University
14-Nov-2014
Install feedparser by hand
10
https://pypi.python.org/pypi/feedparser
click on Downloads button
choose .zip file
$ cd /Users/harryhow/Downloads/feedparser5.1.3
$ python setup.py install
NLP, Prof. Howard, Tulane University
14-Nov-2014
Get the RSS feed
11
10.
from bs4 import BeautifulSoup
import feedparser
url = 'feed://feeds.feedblitz.com/sethsblog'
fp = feedparser.parse(url)
print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title,
'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'),
'link': e.links[0].href})
11.
print blog_posts[0]['content']
1.
2.
3.
4.
5.
6.
7.
8.
9.
NLP, Prof. Howard, Tulane University
14-Nov-2014
12
Next time
something else
maybe a quiz
NLP, Prof. Howard, Tulane University
14-Nov-2014