LING 681 Intro to Comp Ling

Download Report

Transcript LING 681 Intro to Comp Ling

WEB TEXT
DAY 34 - 11/14/14
LING 3820 & 6820
Natural Language Processing
Harry Howard
Tulane University
Course organization
2




http://www.tulane.edu/~howard/LING3820/
The syllabus is under construction.
http://www.tulane.edu/~howard/CompCultEN/
Chapter numbering
 3.7.
How to deal with non-English characters
 4.5. How to create a pattern with Unicode characters
 6. Control
NLP, Prof. Howard, Tulane University
14-Nov-2014
3
Open Spyder
NLP, Prof. Howard, Tulane University
14-Nov-2014
4
Review
Twitter
NLP, Prof. Howard, Tulane University
14-Nov-2014
5
Finding text on the web
NLP, Prof. Howard, Tulane University
14-Nov-2014
http://sethgodin.typepad.com/
6
NLP, Prof. Howard, Tulane University
14-Nov-2014
Firefox: Tools > web developer > Page source
Safari: Prefs > Advanced > Show develop >> show page source
7

<div class="entry-body"> <p>If someone asked
you how to do something …. By all means, you still
need pictures, even video. But there&#39;s nothing
to replace the specificity that comes from the
alphabet. Use labels. Use words.</p> </div><!-.entry-body -->
NLP, Prof. Howard, Tulane University
14-Nov-2014
We need
8

requests

% pip install feedparser

% pip install BeautifulSoup4
NLP, Prof. Howard, Tulane University
14-Nov-2014
Get the text
9
1.
2.
3.
4.
5.
6.
import requests
from bs4 import BeautifulSoup
url = 'http://sethgodin.typepad.com/'
html = requests.get(url).text
soup = BeautifulSoup(html)
print soup.find("div", {"class":"entrybody"}).text.encode('utf8')
NLP, Prof. Howard, Tulane University
14-Nov-2014
Install feedparser by hand
10





https://pypi.python.org/pypi/feedparser
click on Downloads button
choose .zip file
$ cd /Users/harryhow/Downloads/feedparser5.1.3
$ python setup.py install
NLP, Prof. Howard, Tulane University
14-Nov-2014
Get the RSS feed
11
10.
from bs4 import BeautifulSoup
import feedparser
url = 'feed://feeds.feedblitz.com/sethsblog'
fp = feedparser.parse(url)
print "Fetched %s entries from '%s'" % (len(fp.entries), fp.feed.title)
blog_posts = []
for e in fp.entries:
blog_posts.append({'title': e.title,
'content': BeautifulSoup(e.content[0].value).get_text().encode('utf8'),
'link': e.links[0].href})
11.
print blog_posts[0]['content']
1.
2.
3.
4.
5.
6.
7.
8.
9.
NLP, Prof. Howard, Tulane University
14-Nov-2014
12
Next time
something else
maybe a quiz
NLP, Prof. Howard, Tulane University
14-Nov-2014