Project Management

Download Report

Transcript Project Management

Web Scraping
with Python and Selenium
What is Web Scraping?

Software technique for extracting info from
websites
Get information programmatically that would otherwise
be inaccessible (i.e. there is not “Download data”
button)

Many modules in many languages

We’ll be using Selenium (a module for python)
More info:
https://en.wikipedia.org/wiki/Web_scraping
Things to Consider

What question do you want to ask?
 What
data is needed?
 Where
can we find this data?
 There’s
 What
 This
no button to download a csv file…
do we do now?
is where Web Scraping comes in.
Tools we will need: a Browser
 We’re
using Google Chrome
https://www.google.com/chrome/browser/d
esktop/index.html
Tools we will need: Programming Text
Editor
 Pick
your favorite programming text editor
(NOT Notepad, Word, etc.)
 Sublime
Text 2: http://www.sublimetext.com/2
Select
Open
Drag
your operating system.
the dmg file
it to applications folder (Mac)
Tools we will need: Text Editors (other
editors)
 Gedit:
https://sourceforge.net/projects/gedit/?source
=typ_redirect
 Xcode
Tools we will need: Python
 Install
(or update) Python:
https://www.python.org/downloads/
Click
Once
“Download Python 2.7.11”
the download finishes, click the
package and run the installer
Tools we will need: a Web driver

We’re using chromedriver
 Found
here:
http://chromedriver.storage.googleapis.com/index.html?path=2.
21/
 Select
 Take
the one for your OS
note of where you save the driver on your file system
Last -- Install Selenium
 Open
terminal (Mac) or Command Line (PC)
 Run the following command:
pip install selenium
Errors? Try sudo -H pip install selenium
NOTE: if your version of Python is older than
2.7.9, you may not have pip. Upgrade python!
Before we get started, let’s talk about…
 HTML
Structure
Just
Key
a giant tree of tags.
is to focus in on the right tags.
Let’s get started!
 Think
about your question!
We
will skip this step because we already
know what page we want.
 Go
to the webpage
 Find
the data you want to scrape!
 Inspect
element…
Questions?
Start Coding 
References
Selenium API