C++11 in Parallel

Download Report

Transcript C++11 in Parallel

Joe Hummel, PhD
Visiting Researcher: U. of California, Irvine
Adjunct Professor: U. of Illinois, Chicago &
Loyola U., Chicago
Materials: http://www.joehummel.net/downloads.html
Email:
[email protected]

A little history…

Why Hadoop?

How it works

Demos

Summary
Hadoop on Azure
2

Map-Reduce is from functional programming
// function returns 1 if i is prime, 0 if not:
let isPrime(i) = ...
// sums 2 numbers:
let sum(x, y) = return x + y
// count the number of primes in 1..N:
let countPrimes(N) =
let L = [ 1 .. N ]
// [ 1, 2, 3, 4, 5, 6, ... ]
let T = map isPrime L
// [ 0, 1, 1, 0, 1, 0, ... ]
let count = reduce sum T
// 42
return count
Hadoop on Azure
3

Created by
to drive internet search
◦ BIG data ― scalable to TBs and beyond
◦ Parallelism: to get the performance
◦ Data partitioning: to drive the parallelism
◦ Fault tolerance: at this scale, machines are going to crash, a lot…
BIG
page
hits
Data
4

Search engines: Google, Yahoo, Bing

Facebook

Twitter

Financials

Health industry

Insurance

Credit card companies

Just about any company collecting user data…
Hadoop on Azure
5

Freely-available framework for big data
◦ http://hadoop.apache.org/

Based on concept of Map-Reduce:
map function
reduce intermediate results
Map
Map
BIG
data
Reduce
R
Map
Map
..
.
..
.
6
Hadoop on Azure
Mapper
Mapper
Mapper
Reducer
Mapper
Mapper
Mapper
Reducer
Mapper
Mapper
Mapper
Reducer
Mapper
Mapper
Mapper
Reducer
Mapper
Mapper
Mapper
Reducer
Mapper
Mapper
Mapper
Reducer
7
Data
Map
Map
Map
[ <key1,value>, <key4,value>, <key2,value>, … ]
Sort
Sort
Sort
[ <key1,value>, <key1,value>, … ]
Merge
[ <key1, [value,value,…]>, <key2, [value,value,…]>, … ]
Reduce
[ <key1, value>, <key2, value>… ]
R
8

Netflix data-mining…
Netflix
Movie
Reviews
(.txt)
Average rating…
Netflix Data
Mining App
movieid,userid,rating,date
1,2390087,3,2005-09-06
217,5567801,5,2006-01-03
42,1121098,3,2006-03-25
1,8972234,5,2003-12-02
.
.
.
Hadoop on Azure
9
Data
Map
Map
Map
[ <1,3>, <217,5>, <42,3>, <1,5>, <134,2>, <42,1>, … ]
Sort
Sort
Sort
[ <1,3>, <1,5>, <42,3>, <42,1>, <134,2>, <217,5>, … ]
Merge
[ <1, [3,5]>, <42, [3,1]>, <134, [2, …]>, <217, [5, …]>, … ]
Reduce
[ <1, 4>, <42, 2>, <134, ?>, … ]
R
10

To compute average rating for every movie:
// Javascript version:
var map = function (key, value, context)
{
var values = value.split(",");
// field 0 contains movieid, field 2 the rating:
context.write(values[0], values[2]);
};
var reduce = function (key, values, context)
{
var sum
= 0;
var count = 0;
while (values.hasNext())
{
count++;
sum += parseInt(values.next());
}
context.write(key, sum/count);
};
Hadoop on Azure
11

Upload data to HDFS
◦ Hadoop file system

Write map / reduce functions
◦ default is to use Java
◦ most languages supported: C, C++, C#, JavaScript, Python, …

Compile and upload code
◦ For Java, you upload .jar file
◦ For others, .exe or script

Submit MapReduce job

Wait for job to complete
Hadoop on Azure
12

Queries against big datasets

Embarrassingly-parallel problems
◦ Solution must fit into map-reduce framework

Non-real-time demands

Hadoop is not for:
◦ Small datasets (< 1GB?)
◦ Sub-second / real-time needs (though clearly Google makes it work)
Hadoop on Azure
13

We’ll be working with Chicago crime data…
◦ https://data.cityofchicago.org/Public-Safety/Crimes-2001-to-present/ijzp-q8t2
◦ http://www.cityofchicago.org/city/en/narr/foia/CityData.html
1 GB
5M rows
14

Compute top-10 crimes…
IUCR
Count
0486
0820
.
.
.
0890
366903
308074
166916
IUCR = Illinois Uniform Crime Codes
https://data.cityofchicago.org/Public-Safety/Chicago-Police-Department-IllinoisUniform-Crime-R/c7ck-438e
15

Hadoop on Azure…

Supports traditional Hadoop usage
◦ Upload data
◦ Write MapReduce program
◦ Submit job

Additional features:
◦ Allows access to persistent data from Azure Storage Vault
◦ Provides interactive JavaScript console
◦ Built-in higher-level query languages (PIG, HIVE)
Hadoop on Azure
16
// Javascript version:
var map = function (key, value, context)
{
var values = value.split(",");
context.write(values[4], 1);
};
var reduce = function (key, values, context)
{
var sum = 0;
while (values.hasNext())
{
sum += parseInt(values.next());
}
context.write(key, sum);
};
Hadoop on Azure
0486
0820
.
.
.
366903
308074
17
// interactive PIG with explicit Map-Reduce functions:
pig.from("asv://datafiles/CC-from-2001.txt").
mapReduce("scripts/IUCR-Count.js", "IUCR, Count:long").
orderBy("Count DESC").
take(10).
to("output-from-2001")
// visualize the results:
file = fs.read("output-from2001/part-r-00000")
data = parse(file.data, "IUCR, Count:long")
graph.bar(data)
Hadoop on Azure
18

Microsoft is offering free access to Hadoop
◦ Request invitation @ http://www.hadooponazure.com/

Hadoop connector for Excel
◦ Process data using Hadoop, analyze/visualize using Excel
Hadoop on Azure
19
Hadoop on Azure
20

Hadoop is all about big data processing
◦ Scalable, parallel, fault-tolerant

Easy to understand programming model
◦ Map-Reduce
◦ But then solution must fit into this framework…

Rich ecosystem developing around Hadoop
◦ Technologies: PIG, HIVE, HBase, …
◦ Companies: Cloudera, Hortonworks, MapR, …
Hadoop on Azure
21

Presenter: Joe Hummel
◦ Email:
[email protected]
◦ Materials: http://www.joehummel.net/downloads.html

For more info:
◦ http://www.hadooponazure.com/
◦ http://msdn.microsoft.com/en-us/magazine/jj190805.aspx
◦ Overview, including how to access via .NET API:
 http://www.simple-talk.com/cloud/data-science/analyzebig-data-with-apache-hadoop-on-windows-azurepreview-service-update-3/
Hadoop on Azure
22