PowerPoint TEMPLATE

Download Report

Transcript PowerPoint TEMPLATE

MapReduce in Action
数据挖掘研究组
Data Mining Group @ Xiamen University
College of
Information
Science and
Technology
Team 306
Led by
Chen Lin
Contents
LOGO
1. Basic MapReduce Programs
2. Advanced MapReduce
3. Beyond the horizon
4. discussion
YOUR SITE HERE
Basic MapReduce Programs
LOGO
Job
Configuration
Master
Jobtracker
Master
Jobtracker
Job
YOUR SITE HERE
Basic MapReduce Programs
LOGO
Job Configuration?
Java Class
Implement
Interface
Environment
Configuration
YOUR SITE HERE
LOGO
Mapper
Reducer
Partitioner
Interface
Combiner
InputFormat
OutputFormat
YOUR SITE HERE
LOGO
How many
Map/Reduce
Tasks?
InputPath
OutputPath
Configure
jvm:
Mapred.child.java.opts
{mapred.local.dir}
YOUR SITE HERE
Basic MapReduce Program
LOGO
<K1,V2>
Inputsplit
InputFormat
Map
List<K1,V1>
Reduce
OutputFormat
K1,List<V1>
Text
YOUR SITE HERE
Basic MapReduce
LOGO
YOUR SITE HERE
PARTITIONERS AND COMBINERS
LOGO
Combiners
an optimization in MapReduce that allow for local
aggregation before the shue and sort phase
Partitioner
determines which reducer will be responsible for processing
a particular key, and the execution framework uses this
information to copy the data to the right location during the
shue and sort phase
YOUR SITE HERE
Basic MapReduce Program
LOGO
InputFormat
CREATING
CUSTOM
INPUTFORMAT
KeyValue
Text
Text
Input
Format
Sequence
File
NLine
YOUR SITE HERE
InputFormat
LOGO
• TextInputFormat
-
Each line in the text fi les is a record. Key is the byte
offset of the line, and value is the content of the line.
• KeyValueTextInputFormat
-
Each line in the text fi les is a record. The fi rst separator
character divides each line. Everything before the
separator is the key, and everything after is the value.
The separator is set by the key.value.separator.in.input.line property, and
the default is the tab (\t) character.
• NLineInputFormat
-
Same as TextInputFormat, but each split is guaranteed
to have exactly N lines. The mapred.line.input.format.
Lines/map property, which defaults to one, sets N.
YOUR SITE HERE
Basic MapReduce Program
LOGO
types for the key/value pairs
4
YOUR SITE HERE
Summary for basic Program
LOGO
What’s a complete MapReduce job ??
code for
mapper, reducer,
combiner, partitioner,
along with
job conguration parameters
The execution framework
handles
everything else
YOUR SITE HERE
Advanced
MapReduce
LOGO
Chaining MapReduce jobs
LOCAL AGGREGATION
SECONDARY SORTING
Work on Hadoop Files
YOUR SITE HERE
Chaining MapReduce jobs
LOGO
You’ve been doing data processing tasks which a
single MapReduce job can accomplish.
But……
As you get more comfortable writing MapReduce
programs and take on more ambitious data
processing tasks
you’ll find many complex tasks need to be broken
down into simpler subtasks, each accomplished by
an individual MapReduce job
YOUR SITE HERE
LOCAL AGGREGATION
LOGO
in Hadoop, intermediate results are written to local
disk before being sent over the network.
Reductions in the amount of intermediate data
translate should increase in algorithmic efficiency
use of the combiner is possible to substantially
reduce both the number and size of key-value pairs
that need to be shuffled from the mappers to the
reducers
YOUR SITE HERE
seudo-code for computing the mean of values associated with the same string.
LOGO
YOUR SITE HERE
LOCAL AGGREGATION , Is it right ??
LOGO
YOUR SITE HERE
LOCAL AGGREGATION
LOGO
1. combiners must have the same input and output
key-value type
2. Combiners are optimizations that cannot change
the correctness of the algorithm
Hadoop makes no guarantees on how many times
combiners are called; it could be zero, one, or multiple
times
YOUR SITE HERE
LOCAL AGGREGATION , right usage !
LOGO
YOUR SITE HERE
SECONDARY SORTING
LOGO
we also need to sort by value sometimes







(k1;m1; v8)
(k1;m2; v1)
(k1;m3; v7)
:::
(k2;m1; v2)
(k2;m2; v6)
(k2;m3; v9)
 k1
(m1; k8)
 (k1; m1)
(k8)
YOUR SITE HERE
Beyond the horizon
LOGO
It’s a shame
The rest I will talk about Plays an important role in
MapReduce, but, they are beyond my horizon.
So, need all your help, to master them together….
YOUR SITE HERE
Beyond the horizon
LOGO
YOUR SITE HERE
Beyond the horizon
LOGO
YOUR SITE HERE
Joining data from different sources
Joining data from different sources
LOGO
Joey Leung,555-555-55
Edward,123-456-7890
Jose Madriz,281-330-8004
David Stork,408-555-0000
…....
A,12.95,02-Jun-2008
B,88.25,20-may-2008
C,32.00,30-Nov-2007
D,25.02,22-Jan-2009
Joey Leung,555-555-5555,B,88.25,20-May-2008
Edward,123-456-7890,C,32.00,30-Nov-2007
Jose Madriz,281-330-8004,A,12.95,02-Jun-2008
Jose Madriz,281-330-8004,D,25.02,22-Jan-2009
YOUR SITE HERE
数据挖掘研究组
Data Mining Group @ Xiamen University
LOGO
Thank you!
YOUR SITE HERE