Transcript Hadoop

Team: 3
Md Liakat Ali
Abdulaziz Altowayan
Andreea Cotoranu
Stephanie Haughton
Gene Locklear
Leslie Meadows
 Hadoop
 MapReduce
 Download Links
 Install and Tutorials
• Software platform that lets one easily write and run applications that
process vast amounts of data. It includes:
o MapReduce – offline computing engine
o HDFS – Hadoop distributed file system
o HBase (pre-alpha) – online data access
• Here's what makes it especially useful:
o Scalable: It can reliably store and process petabytes.
o Economical: It distributes the data and processing across clusters
of commonly available computers (in thousands).
o Efficient: By distributing the data, it can process it in parallel on
the nodes where the data is located.
o Reliable: It automatically maintains multiple copies of data and
automatically redeploys computing tasks based on failures.
• Hadoop implements Google’s MapReduce, using HDFS
• MapReduce divides applications into many small blocks of work.
• HDFS creates multiple replicas of data blocks for reliability,
placing them on compute nodes around the cluster.
• MapReduce can then process the data where it is located.
• A9.com – Amazon: To build Amazon's product search indices; process millions of
sessions daily for analytics, using both the Java and streaming APIs; clusters vary
from 1 to 100 nodes.
• Yahoo! : More than 100,000 CPUs in ~20,000 computers running Hadoop; biggest
cluster: 2000 nodes (2*4cpu boxes with 4TB disk each); used to support research for
Ad Systems and Web Search
• AOL : Used for a variety of things ranging from statistics generation to running
advanced algorithms for doing behavioral analysis and targeting; cluster size is 50
machines, Intel Xeon, dual processors, dual core, each with 16GB Ram and 800 GB
hard-disk giving us a total of 37 TB HDFS capacity.
• Facebook: To store copies of internal log and dimension data sources and use it as a
source for reporting/analytics and machine learning; 320 machine cluster with 2,560
cores and about 1.3 PB raw storage;
• FOX Interactive Media : 3 X 20 machine cluster (8 cores/machine, 2TB/machine
storage) ; 10 machine cluster (8 cores/machine, 1TB/machine storage); Used for log
analysis, data mining and machine learning
• University of Nebraska Lincoln: one medium-sized Hadoop cluster (200TB) to store
and serve physics data;
•
•
•
•
•
•
Adknowledge - to build the recommender system for behavioral targeting, plus
other clickstream analytics; clusters vary from 50 to 200 nodes, mostly on EC2.
Contextweb - to store ad serving log and use it as a source for Ad optimizations/
Analytics/reporting/machine learning; 23 machine cluster with 184 cores and about
35TB raw storage. Each (commodity) node has 8 cores, 8GB RAM and 1.7 TB of
storage.
Cornell University Web Lab: Generating web graphs on 100 nodes (dual 2.4GHz
Xeon Processor, 2 GB RAM, 72GB Hard Drive)
NetSeer - Up to 1000 instances on Amazon EC2 ; Data storage in Amazon S3;
Used for crawling, processing, serving and log analysis
The New York Times : Large scale image conversions ; EC2 to run hadoop on a
large virtual cluster
Powerset / Microsoft - Natural Language Search; up to 400 instances on Amazon
EC2 ; data storage in Amazon S3
• MapReduce is a programming model and an
associated implementation for processing and
generating large data sets with a parallel, distributed
algorithm on a cluster.
• MapReduce program is composed of
 Map() procedure(method) that performs
filtering and sorting, and
 Reduce() method that performs a summary
operation (such as counting the number of students
in each queue, yielding name frequencies).
• Pioneered by Google
– Processes 20 PB of data per day
• Sort/merge based distributed computing
• Initially, it was intended for their internal
search/indexing application, but now used extensively by
more organizations
• Popularized by open-source Hadoop project
– Used by Yahoo!, Facebook, Amazon, …
• At Google:
– Index building for Google Search
– Article clustering for Google News
– Statistical machine translation
• At Yahoo!:
– Index building for Yahoo! Search
– Spam detection for Yahoo! Mail
• At Facebook:
– Data mining
– Ad optimization
– Spam detection
• In research:
– Analyzing Wikipedia conflicts (PARC)
– Natural language processing (CMU)
– Bioinformatics (Maryland)
– Particle physics (Nebraska)
– Ocean climate simulation (Washington)
1. Scalability to large data volumes:
– Scan 100 TB on 1 node @ 50 MB/s = 24
days
– Scan on 1000-node cluster = 35 minutes
2. Cost-efficiency:
– Commodity nodes (cheap, but unreliable)
– Commodity network
– Automatic fault-tolerance (fewer admins)
– Easy to use (fewer programmers)
• Distributed file system (DFS)
The Hadoop Distributed File System (HDFS) is a
distributed file system designed to run on
commodity hardware. It has many similarities with
existing distributed file systems. However, the
differences from other distributed file systems are
significant.
• highly fault-tolerant and is designed to be deployed on
low-cost hardware.
• provides high throughput access to application data
and is suitable for applications that have large data sets.
•part of the Apache Hadoop Core project. The project
URL is http://hadoop.apache.org/core/.
• Files split into 128MB
blocks
• Blocks replicated across
several datanodes (usually 3)
• Namenode stores metadata
(file names, locations, etc)
• Optimized for large files,
sequential reads
• Files are append-only
• MapReduce framework
– Executes user jobs specified as “map”
and “reduce” functions
– Manages work distribution & faulttolerance
HBase is an open source, non-relational, distributed
database modeled after Google's BigTable and written
in Java.
oIt is developed as part of Apache Software
Foundation's Apache Hadoop project and runs on top
of HDFS
o it provides a fault-tolerant way of storing large
quantities of sparse data (small amounts of information
caught within a large collection of empty or unimportant
data)
o Hbase is now serving several data-driven websites,
including Facebook's Messaging Platform.
o




Download links
Sandbox
How to Download and Install
Tutorials
http://hortonworks.com/
 Running Hadoop on Ubuntu Linux (Single-Node Cluster)
http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-singlenode-cluster/
http://www.michael-noll.com/tutorials/writing-an-hadoop-mapreduceprogram-in-python/

Sandbox
http://hortonworks.com/products/hortonworks-sandbox/#install

Some other links:
http://hadoop-skills.com/setup-hadoop/
https://hadoop.apache.org/
• The easiest way to get started with Enterprise
Hadoop
• Sandbox is a personal, portable Hadoop
environment , comes with a dozen interactive
Hadoop tutorials that will guide through the
basics of Hadoop.
• Includes the Hortonworks Data Platform in an
easy to use form. We can add our own datasets,
and connect it to our existing tools and
applications.
• We can test new functionality with the Sandbox
before we put it into production. Simply, easily
and safely.
o Install any one of the following on your host machine :
1. VirtualBox
2. VMware Fusion
3. Hyper-V
o Oracle VirtualBox
Version 4.2 or later
(https://www.virtualbox.org/wiki/Downloads)
o Hortonworks Sandbox virtual appliance for VirtualBox
Download the correct virtual appliance file for your
environment
http://hortonworks.com/products/hortonworkssandbox/#install
o CPU - A 64-bit machine with a multi-core CPU
that supports virtualization.
o BIOS - Has been enabled for virtualization
support
o RAM - At least 4 GB of RAM
o Browsers - Chrome 25+, IE 9+ (Sandbox will not
run on IE 10), Safari 6+
o Complete Installation guide can be found at:
http://hortonworks.com/wpcontent/uploads/2015/07/Import_on_Vbox_7_2
0_2015.pdf