10/01/2008 - Wenbin $HOME

Download Report

Transcript 10/01/2008 - Wenbin $HOME

Mars: A MapReduce Framework
on Graphics Processors
Bingsheng He1, Wenbin Fang, Qiong Luo
Hong Kong Univ. of Sci. and Tech.
Naga K. Govindaraju
Microsoft Corp.
1, Currently in Microsoft Research Asia
Tuyong Wang
Sina Corp.
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
2/41
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
3/41
Graphics Processing Units
• Massively multi-threaded co-processors
– 240 streaming processors on NV GTX 280
– ~1 TFLOPS of peak performance
• High bandwidth memory
– 10+x more than peak bandwidth of the main
memory
– 142 GB/s, 1 GB GDDR3 memory on GTX280
4/41
Graphics Processing Units (Cont.)
• High latency GDDR memory
– 200 clock cycles of latency
– Latency hiding using large number of concurrent
threads (>8K)
– Low context-switch overhead
• Better architectural support for memory
– Inter-processor communication using a local
memory
– Coalesced access
• High speed bus with the main memory
– Current: PCI-E express (4GB/sec)
5/41
GPGPU
• Linear algebra [Larsen 01, Fatahalian 04,
Galoppo 05]
• FFT [Moreland 03, Horn 06]
• Matrix operations [Jiang 05]
• Folding@home, Seti@home
• Database applications
– Basic Operators [Naga 04]
– Sorting [Govindaraju 06]
– Join [He 08]
6/41
GPGPU Programming
• “Assembly languages”
– DirectX, OpenGL
• Graphics rendering pipelines
• “C/C++”
– NVIDIA CUDA, ATI CAL or Brook+
• Different programming models
• Low portability among different hardware vendors
– NV GPU code cannot run on AMD GPU
• “Functional language”?
7/41
Without worrying about hardware details—
• Make GPGPU programming much easier.
• Well harness high parallelism and high
computational capability of GPUs.
MapReduce
8/41
MapReduce Functions
• Process lots of data to produce other data
• Input & Output: a set of records in the form
of key/value pair
• Programmer specifies two functions
– map (in_key, in_value) -> emit
list(intermediate_key, intermediate_value)
– reduce (out_key, list(intermediate_value)) ->
emit list(out_key, out_value)
9/41
MapReduce Workflow
From http://labs.google.com/papers/mapreduce.html
10/41
MapReduce outside google
• Hadoop [Apache project]
• MapReduce on multicore CPUs -- Phoenix
[HPCA'07, Ranger et al.]
• MapReduce on Cell [07, Kruijf et al.]
• Merge [ASPLOS '08, Linderman et al.]
• MapReduce on GPUs [stmcs'08,
Catanzaro et al.)]
11/41
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
12/41
MapReduce on GPU
Web
Analysis
Data Mining
MapReduce (Mars)
Rendering APIs
(DirectX)
GPGPU languages
(CUDA, Brook+)
GPU Drivers
13/41
MapReduce on Multi-core CPU
(Phoenix [HPCA'07])
Input
Split
Map
Partition
Reduce
Merge
Output
14/41
Limitations on GPUs
• Rely on the CPU to allocate memory
– How to support variant length data?
– How to allocate output buffer on GPUs?
• Lack of lock support
– How to synchronize to avoid write conflict?
15/41
Data Structure for Mars
Support variant length record!
A Record = <Key, Value, Index entry>
Key1
Key2
Key3
Value1
Value2
Value3
Index entry1
Index entry2
Index entry3
…
…
…
An index entry = <key size, key offset, val size, val offset>
16/41
Lock-free scheme for result output
Basic idea:
Calculate the offset for each thread on the output buffer.
17/41
Lock-free scheme example
Pick up odd numbers from the
array [1, 3, 2, 3, 4, 6, 9, 8].
map function as a filter –
filter all odd numbers
18/41
Lock-free scheme example
T1
[ 1,
T1
3,
T2
T3
2,
T3
3,
T4
1
4
Step1:
Histogram
Step2:
Prefix sum
T2
T4
4,
T1
3
7
7,
T2
2
9
9,
T3
8 ]
T4
3
8
1
2
1
1
0
1
3
4
(5)
19/41
Lock-free scheme example
T1
[ 1,
T1
Histogram
Step3:
Allocate
3,
T2
T2
2,
T3
1
T1
T3
3,
T4
2
T2
T4
4,
T1
1
T2
7,
T2
T3
T4
(5)
1
T3
9 ]
T4
20/41
Lock-free scheme example
T1
[ 1,
T1
T2
3,
T3
2,
T4
3,
4,
7,
T2
T3
T4
T1
T2
1
3
7
9
3
9,
T3
8
]
T4
Step4:
Computation
Prefix sum
0
1
3
4
21/41
Lock-free scheme
1. Histogram on key size, value size, and record count.
2. Prefix sum on key size, value size, and record count.
3. Allocate output buffer on GPU memory.
4. Perform computing.
Avoid write conflict.
Allocate output buffer exactly once.
22/41
Mars Workflow
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Sort and Group
ReduceCount
Prefixsum
Allocate output buffer on GPU
Reduce
Output
23/41
Mars Workflow– Map Only
Input
MapCount
Prefixsum
Allocate intermediate buffer on GPU
Map
Output
Map only, without grouping and reduce
24/41
Mars Workflow – Without Reduce
Input
MapCount
Prefix Sum
Allocate intermediate buffer on GPU
Map
Sort and Group
Output
Map and grouping, without reduce
25/41
APIs of Mars
Runtime Provided:
User-defined:
•AddMapInput
•MapReduce
•EmitInterCount
•EmitIntermediate
•EmitCount (optional)
•Emit (optional)
MapCount
Map
Compare (optional)
ReduceCount (optional)
Reduce (optional)
26/41
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
27/41
Mars-GPU
• NVIDIA CUDA
• Each map instance or reduce instance is a
GPU thread.
Mars-CPU
• Operating system’s thread APIs
• Each map instance or reduce instance is a
CPU thread.
28/41
Optimization According to
CUDA features
• Coalesced Access
– Multiple accesses to consecutive memory
addresses are combined into one transfer.
• Build-in vector type (int4, char4 etc)
– Multiple small data items are fetched in one
memory request.
29/41
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
30/41
Experimental Setup
• Comparison
– CPU: Phoenix, Mars-CPU
– GPU: Mars-GPU
GPU (NV GTX8800)
Processors (HZ)
CPU (P4 Quad)
2.66G*4
Cache size
8MB
256KB
Bandwidth
(GB/sec)
10.4
86.4
OS
1.35G*128
Fedora Core 7.0
31/41
Applications
• String Match (SM): Find the position of a string
in a file.
[S: 32MB, M: 64MB, L: 128MB]
• Inverted Index (II): Build inverted index for links
in HTML files.
[S: 16MB, M: 32MB, L: 64MB]
• Similarity Score (SS): Compute the pair-wise
similarity score for a set of documents.
[S: 512x128, M: 1024x128, L: 2048x128]
32/41
Applications (Cont.)
• Matrix Multiplication (MM): Multiply two matrices.
[S: 512x512, M: 1024x10242, L: 2048x2048]
• Page View Rank (PVR): Count the number of
distinct page views from web logs.
[S: 32MB, M: 64MB, L: 96MB]
• Page View Count (PVC): Find the top-10 hot
pages in the web log.
[S: 32MB, M: 64MB, L: 96MB]
33/41
Effect of Coalessed Access
Coalessed access achieves a speedup of 1.2-2X
34/41
Effect of Built-In Data Types
Built-in data types achieve a speedup up to 2 times
35/41
Time Breakdown of Mars-GPU
GPU accelerates computation in MapReduce
36/41
Mars-GPU vs. Phoenix on
Quadcore CPU
The speedup is 1.5-16 times with various data sizes
37/41
Mars-GPU vs. Mars-CPU
The GPU accelerates MapReduce up to 7 times
38/41
Mars-CPU vs. Phoenix
Mars-CPU is 1-5 times as fast as Phoenix
39/41
Overview
•
•
•
•
•
Motivation
Design
Implementation
Evaluation
Conclusion
40/41
Conclusion
• MapReduce framework on GPUs
–Ease of GPU application
development
–Performance acceleration
• Want a Copy of Mars?
http://www.cse.ust.hk/gpuqp/Mars.html
41/41
Discussion
• A uniform co-processing framework
between the CPU and the GPU
• High performance computation routines
– Index serving
– Data mining (on-going)
• Power consumption benchmarking of the
GPU
– The GPU is a test bed for the future CPU.
• …
42/41