Project Management
Download
Report
Transcript Project Management
Web Scraping
with Python and Selenium
What is Web Scraping?
Software technique for extracting info from
websites
Get information programmatically that would otherwise
be inaccessible (i.e. there is not “Download data”
button)
Many modules in many languages
We’ll be using Selenium (a module for python)
More info:
https://en.wikipedia.org/wiki/Web_scraping
Things to Consider
What question do you want to ask?
What
data is needed?
Where
can we find this data?
There’s
What
This
no button to download a csv file…
do we do now?
is where Web Scraping comes in.
Tools we will need: a Browser
We’re
using Google Chrome
https://www.google.com/chrome/browser/d
esktop/index.html
Tools we will need: Programming Text
Editor
Pick
your favorite programming text editor
(NOT Notepad, Word, etc.)
Sublime
Text 2: http://www.sublimetext.com/2
Select
Open
Drag
your operating system.
the dmg file
it to applications folder (Mac)
Tools we will need: Text Editors (other
editors)
Gedit:
https://sourceforge.net/projects/gedit/?source
=typ_redirect
Xcode
Tools we will need: Python
Install
(or update) Python:
https://www.python.org/downloads/
Click
Once
“Download Python 2.7.11”
the download finishes, click the
package and run the installer
Tools we will need: a Web driver
We’re using chromedriver
Found
here:
http://chromedriver.storage.googleapis.com/index.html?path=2.
21/
Select
Take
the one for your OS
note of where you save the driver on your file system
Last -- Install Selenium
Open
terminal (Mac) or Command Line (PC)
Run the following command:
pip install selenium
Errors? Try sudo -H pip install selenium
NOTE: if your version of Python is older than
2.7.9, you may not have pip. Upgrade python!
Before we get started, let’s talk about…
HTML
Structure
Just
Key
a giant tree of tags.
is to focus in on the right tags.
Let’s get started!
Think
about your question!
We
will skip this step because we already
know what page we want.
Go
to the webpage
Find
the data you want to scrape!
Inspect
element…
Questions?
Start Coding
References
Selenium API