Mapreduce, Hadoop and Big Data

Download Report

Transcript Mapreduce, Hadoop and Big Data

Software Installation Deck
Big Data Workshop
Saturday March 10th, 2012
Outline
• Local Installation
–
–
–
–
Python
Word Count Code and Files
R and R-Studio
Hadoop Local Installation
• Cloud Access
–
–
–
–
–
–
Amazon Web Services Account
Cloud-Based Software Demos
R and R-Studio in the Cloud
Cloudera Virtual Manager
Virtualization Software
R and Hadoop: ‘rmr’
Local
Python Installation
• Mac/Linux comes with Python (should be able to
run).
• Windows use the following website to download
and install:
– http://www.python.org/getit/windows
Python Wikipedia Word Count Files
The four files of different sizes were created by Vipin to
test out the time to run each one locally.
What
URL
Python Word Count Script
https://s3.amazonaws.com/com.hadoopinboston.scripts/seq.py
Very Small File: 10 lines, 251 words:
https://s3.amazonaws.com/com.hadoopinboston.inputdata/input-lines
Small: 64188 lines, 1.65M words (10MB)
https://s3.amazonaws.com/com.hadoopinboston.inputdata/input.txt
Large: 500000 lines, 12M words (76 MB)
https://s3.amazonaws.com/com.hadoopinboston.inputdata/input2.txt
Very Large: 85 million lines, (8 GB)
https://s3.amazonaws.com/com.hadoopinboston.inputdata/all.txt
Mapper.py – mapper in python
https://s3.amazonaws.com/com.hadoopinboston.scripts/mapper.py
Reducer.py – reducer in python
https://s3.amazonaws.com/com.hadoopinboston.scripts/reducer-all.py
Mapper in R
https://s3.amazonaws.com/com.hadoopinboston.scripts/mapper.R
Reducer in R
https://s3.amazonaws.com/com.hadoopinboston.scripts/reducer.R
R and R-Studio Local Installation
LOCAL INSTALLATION:
R
http://lib.stat.cmu.edu/R/CRAN/
R-Studio
http://rstudio.org/
Hadoop Installation Mac/Linux
Please note that the local installation is for test and debug,
and that ‘production’ jobs will be ran on the cloud.
• Macbook –
– Install ports package to get Hadoop
(www.macports.org).
sudo port install hadoop (DONE!)
• Linux –
– Use yum/apt-get package to get hadoop.
sudo yum install hadoop (your mirror should have
hadoop binaries)
Hadoop Installation Windows
Please note that the local installation is for test and debug,
and that ‘production’ jobs will be ran on the cloud.
• Microsoft is working with Hortonworks on contributing to the
Apache Hadoop project for Windows. Microsoft is working on a
Community Technology Preview for Hadoop on Windows Azure
(http://hadooponazure.com) and the release for on-premises
installation is forthcoming. Those interested in running Hadoop on
their own Windows hardware can follow
http://www.microsoft.com/sqlserver/en/us/solutionstechnologies/business-intelligence/big-data-solution.aspx to sign up
for the preview when it’s available.
• TODAY, it is possible to install Hadoop on Windows, but those
distributions require Cygwin, whereas the upcoming release will
not. There are some instructions for Windows (see for instance
http://blog.sqltrainer.com/2012/01/installing-and-configuringapache.html) that people can try.
Cloud
Cloud Account
• http://aws.amazon.com/
• The first example will be through Amazon's Elastic
Map/Reduce. Similar in nature to:
• http://www.youtube.com/watch?v=kNsS9aDf6uE
Cloud-Based Software Packages (Demos)
Cloud Numerics
• http://blogs.msdn.com/b/cloudnumerics/archive/20
12/02/07/cloud-numerics-example-analyzingdemographics-data-from-windows-azuremarketplace.aspx
MortarData
• http://mortardata.com/
R and R-Studio Cloud Access (No VM)
R-Studio in the Cloud:
•
http://www.r-bloggers.com/rstudio-in-the-cloud-for-dummies/
R or R-Studio in the Cloud:
•
http://toreopsahl.com/2011/10/17/securely-using-r-and-rstudio-on-amazons-ec2/
Virtual Manager with Hadoop
Please note that these are 64-bit versions, and that the Virtualization Software will require a
laptop that supports virtualization. If you are unsure, one way this can be checked by looking at
your BIOS and seeing if Virtualization is Enabled. Most chips support virtualization; however a
handful of MFG installed BIOS do not enable virtualization.
Cloudera Hadoop Package
• https://ccp.cloudera.com/display/SUPPORT/Cloudera's+Hado
op+Demo+VM
• There are 3 options that relate to different Virtualization
Software one of which also need to be installed (next slide)
• SSH Software (Windows)
http://www.chiark.greenend.org.uk/~sgtatham/putty/downlo
ad.html
Virtual Manager with Hadoop
Jeffrey will be walking through this process.
• VMware Player:
Jeffrey Uses This One in his Session
http://downloads.vmware.com/d/info/desktop_end_user_computing/vmware_player/4_0
• KVM:
http://www.linux-kvm.org/page/Main_Page
• VirtualBox: Jim uses this one.
– https://www.virtualbox.org/
Session 6: R and Hadoop: rmr
Jeffrey will be walking through this process.
• https://github.com/RevolutionAnalytics/RHadoop/wiki/rmr
We realize the VM and R and Hadoop parts are very detailed,
and that there may be questions on other workshop parts.
Following the last session we will try to have a post-workshop
help session.