Python In The Cloud

Download Report

Transcript Python In The Cloud

Python In The Cloud
PyHou MeetUp, Dec 17th 2013
Chris McCafferty, SunGard Consulting Services
Overview
•
•
•
•
•
•
•
What is the Cloud?
What is Big Data?
Big Data Sources
Python and Amazon Web Services
Python and Hadoop
Other Pythonic Cloud providers
Wrap-up
What Is The Cloud
• I want 40 servers and I want them NOW
• I want to store 100Tb of data cheaply and
reliably
• We can do this with Cloud technologies
What is Big Data
• “Three Vs”
– Volume
– Variety
– Velocity
• Genome: sequencing machines throw off
several TB per day. Each.
• Hard drive performance is often the killer
bottleneck, both reading and writing
What is NOT Big Data
• Anything where the whole data set can be
held in memory on a single standard instance
• Data that can be held straightforwardly in a
traditional relational database
• Problems where most of the data can be
trivially excluded
• There are many challenging problems in the
world – but not all need Cloud or Big Data
tools to solve them
To The Cloud!
• Amazon Web Services is the 800lb gorilla in this
space
– Start here if in doubt
• Other options are RackSpace, Microsoft Azure,
(PiCloud/Multyvac?)
• You can also spin up some big iron very cheaply
– Current AWS big memory spec is cr1.8xlarge
– 244GB RAM, 32 Xeon-E5 cores, 10 Gigabit network
– $3.50 per hour
Geo Big Data Sources
• NASA SRTM data is on the large side
• NASA recently released a huge set of data directly into
the cloud: NEX
– Earth Sciences data sets
• Made available on Amazon Web Services public
datasets
• Available on S3 at:
– s3://nasanex/NEX-DCP30
– s3://nasanex/MODIS
– s3://nasanex/Landsat
• There are many, many geo data sets available now
(NOAA Lidar, etc)
Time for some code
• Example - Use S3 browser to look at new
NASA NEX data
• Let’s download some with boto package
• Quickest to do this from an Amazon data
centre
• See DemoDownloadNasaNEX.py
Weather & Big Data Sources
• Good public weather and energy data
• It's hard to move data around for free: just try!
• Power grids shed many GB of public data a day
– Historical data sets form many Terabytes
• Weather data available from NOAA
– QCLD: Hourly, daily, and monthly summaries for
approximately 1,600 U.S. locations.
– ASOS data contains sensor data at one-minute
intervals. 5 min intervals available too.
• 900 stations, 3-4MB per day, 12 years of data = 11-15TB data
set.
Why go to the cloud
• Cheap - see AWS pricing here
– spot pricing of m1.medium normally ~1c/hr
• The cloud is increasingly where the (public) data
will reside
• Pay as you go, less bureaucracy
• Support for Big Data technologies out of the box
– Amazon Elastic Compute Cloud (EC2) gives you a
Hadoop cluster with minimal
• Host a big web server farm or video streaming
cluster
Python on AWS EC2
• AWS = Amazon Web Services. The Big Cloud
• EC2 = Elastic Cloud Compute
• Let’s run up an instance and see what we have
available
• See this script as one way to upgrade to Python
2.7
• Note absence of high-level packages like NumPy,
matplotlib and Pandas
• It would be very useful to have a very high-level
Python environment…
StarCluster
• Cluster management in AWS, written by a group at MIT
• Convenient package to spin up clusters (Hadoop or other)
and copy across files
• Machine images (AMIs) for high-level Python environments
(NumPy, matplotlib, Pandas, etc)
• Not every high-level library is there
– No sklearn (SciKit-Learn, machine learning)
– But easier to pip-install with most pre-requisites already there
•
•
•
•
Sun Grid Engine: Job Management
Hadoop
Boto plugin
dumbo… and much more
Python's Support for AWS
• boto - interface to AWS (Amazon Web Services)
• Hadoop Streaming - use Python in MapReduce
tasks
• mrjob - Framework that wraps Hadoop Streaming
and uses boto
• pydoop- wraps Hadoop Pipes, which is a C++ API
into Hadoop Map Reduce
• Write Python in User-Defined Functions in Pig,
Hive
– Essentially wraps MapReduce and Hadoop Streaming
Boto - Python Interface to AWS
•
•
•
•
•
Support for HDFS
Upload/download from Amazon S3 and Glacier
Start/stop EC2 instances
Manage users through IAM
Virtually every API available from AWS is
supported
• django-storages uses boto to present an S3
storage option
• See http://docs.pythonboto.org/en/latest/
• Make sure you keep your AWS key-pair secure
Another Code Example – upload
• Example where we merge many files together
and upload to S3
• Merge files to avoid the Small Files Problem
• Note use of retry decorator (exponential
backoff)
• See CopyToCloud.py and
MergeAndUploadTxOutages.py
What is
?
• A scalable data and job manager suitable for
MapReduce jobs
• Core technologies date from early 2000s at
Google
• Retries failed tasks, redundant data, good for
commodity hardware
• Rich ecosystem of tools including NoSQL
databases, good Python support
• Example, let’s spin up a cluster of 30 machines
with StarCluster
Hadoop Scales Massively
Hadoop Streaming
• Hadoop passes incoming data in rows on stdin
• Any program (including Python) can process
the rows and emit to stdout
• Logging and errors to stderror
Hadoop Streaming - Echo
• Useful example that can be used for
debugging
• Tells you what Hadoop is actually passing your
task
• See echo.py
• Similar example firstten.py peeks at the first
ten lines then stops
• Useful for debugging
Hadoop Parsing Example
• Python's regex support makes it very good for parsing
unstructured data
• One of the keys in working with Hadoop and Big Data is
getting it into a clean row-based format
• Apply 'schema on read'
• Transmission Data from PJM is updated here every 5
mins: https://edart.pjm.com/reports/linesout.txt
• Needs cleaning up before we can use it for detailed analysis
- note multi-line format
• Script split_transmission.py
• Watch out for Hadoop splitting input blocks in the middle
of a file
Alternatives to AWS
• Picloud offers open source software enabling you to run
large computational clusters
– Just acquired by DropBox
– Pay for what you use: 1 core and 300MB or RAM costs $0.05/hr
– Doesn't offer many of the things Amazon does (AMIs, SMS) but
great for computation or a private cloud
• Disco is MapReduce implemented in Python
– Started life at Nokia
– Has its own Distributed Filesystem (like HDFS)
• Or roll your own cluster in-house with pp (parallel python)
• StarCluster Sun Grid Engine on other vendor/in-house
• Google App Engine…?
PiCloud
• Acquired by DropBox Nov 2013
• DropBox will probably come out with its own
cloud compute offering in 2014
• As of Dec 2013, no new sign-ups
• Existing customers encouraged to migrate
to Multyvac
• Feb 25th 2014 PiCloud will switch off
• The underlying PiCloud software is still open
source
Conclusions
• For cheap compute power and cheap storage, look to
the cloud
• Python is well-supported in this space
• Consider being close to your data: in the same cloud
– Moving data is expensive and slow
• Leverage AWS with tools like boto, StarCluster
• Beware setting up complex environments: installing
packages takes time and effort
• Ideally, think Pythonicly – use the best tools to get the
job done
Links
• Good rundown on the Python ecosystem around
Hadoop from Jan 2013:
– http://blog.cloudera.com/blog/2013/01/a-guide-topython-frameworks-for-hadoop/
• Early vision for PiCloud (YouTube Mar 2012)
– http://www.youtube.com/watch?v=47NSfuuuMfs
• Disco MapReduce Framework from PyData
– http://www.youtube.com/watch?v=YuLBsdvCDo8
– PuTTY tool for windows
• Some AWS & Python war stories:
– http://nz.pycon.org/schedule/presentation/12
Thank you
• Chris McCafferty
• http://christophermccafferty.com/blog
• Slides will be at:
• http://christophermccafferty.com/slides
• Contact me at:
• [email protected][email protected]