Map-Reduce-Merge: Simplified Relational Data Processing on

Download Report

Transcript Map-Reduce-Merge: Simplified Relational Data Processing on

Map-Reduce-Merge: Simplified Relational Data
Processing on Large Clusters
Hung-chih Yang(Yahoo!), Ali Dasdan(Yahoo!),
Ruey-Lung Hsiao(UCLA), D. Stott Parker(UCLA)
SIGMOD 2007 (Industrial)
Presented by Kisung Kim
2010. 7. 14
Contents






Introduction
Map-Reduce
Map-Reduce-Merge
Applications to Relational Data Processing
Case Study
Conclusion
Introduction
 New challenges of data processing
– A vast amount of data collected from the entire WWW
 Solutions of search engine companies
– Customized parallel data processing systems
– Use large clusters of shared-nothing commodity nodes
– Ex) Google’s GFS, BigTable, MapReduce
Ask.com’s Neptune
Microsoft’s Dryad
Yahoo!’s Hadoop
Introduction
 Properties of data-intensive systems
– Simple
 Adopt only a selected subset of database principles
– Sufficiently generic and effective
– Parallel data processing system deployed on large clusters of shared-nothing
commodity nodes
– Refactoring of data processing into two primitives:
 Map function
 Reduce function
 Map-Reduce allow users not to worry about the nuisance details of:
– Coordinating parallel sub-tasks
– Maintaining distributed file storage\
 This abstraction can greatly increase user productivity
Introduction
 Map-Reduce framework is best at handling homogeneous
datasets
– Ex) Joining multiple heterogeneous datasets does not quite fit into the
Map-Reduce framework
 Extending Map-Reduce to process heterogeneous datasets
simultaneously
– Processing data relationships is ubiquitous
– Join-enabled Map-Reduce system can provide a highly parallel yet cost
effective alternative
– Include relational algebra in the subset of the database principles
 Relational operators can be modeled using various combinations
of the three primitives: Map, Reduce, and Merge
Map-Reduce
 Input dataset is stored in GFS
 Mapper
–
–
–
–
Read splits of the input dataset
Apply map function to the input records
Produce intermediate key/value sets
Partition the intermediate sets into # of reducers
sets
 Reducer
Input
map: (k1, v1)  [(k2, v2)]
reduce: (k2, [v2])  [v3]
reducer
reducer
reducer
mapper
mapper
mapper
Intermediate
Sets
– Read their part of intermediate sets from
mappers
– Apply reduce function to the values of a same key
– Output final results
Signatures of Map, Reduce Function
Final Results
split
split
split
split
Join using Map-Reduce:
 Use homogenization procedure
–
–
–
–
Apply one map/reduce task on each dataset
Insert a data-source tag into every value
Extract a key attribute common for all heterogeneous datasets
Transformed datasets now have two common attributes
 Key and data-source
 Problems
– Take lots of extra disk space and incur excessive map-reduce
communications
– Limited only to queries that can be rendered as equi-joins
Join using Map-Reduce: Homogenization
reduce
map
Key
Collect records with same key
Key
Value
Value
10
1, “Value1”
10
2, “Value4”
85
1, “Value2”
54
2, “Value5”
320
1, “Value3”
320
2, “Value6”
Dataset 1
reduce
reduce
map
map
Dataset 2
Map-Reduce-Merge
 Signatures
map: (k1, v1)  [(k2, v2)]
reduce: (k2, [v2])  [v3]
Map-Reduce
– α, β, γ represent dataset lineages
– Reduce function produces a key/value list instead of just values
– Merge function reads data from both lineages
 These three primitives can be used to implement the parallel
version of several join algorithm
Merge Modules
 Merge function
– Process two pairs of key/values
 Processor function
– Process data from one source only
– Users can define two processor functions
 Partition selector
– Determine from which reducers this merger retrieves its input data based
on the merger number
 Configurable iterator
– A merger has two logical iterators
– Control their relative movement against each others
Merge Modules
Merge
Iterator
Processor
Processor
Partition Selector
Reducers for 1st Dataset
Reducers for 2nd Dataset
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Reducer
Output
Applications to Relational Data Processing
 Map-Reduce-Merge can be used to implement primitive and
some derived relational operators
–
–
–
–
–
–
–
–
–
Projection
Aggregation
Generalized selection
Joins
Set union
Set intersection
Set difference
Cartesian product
Rename
 Map-Reduce-Merge is relationally complete, while being loadbalanced, scalable and parallel
Example: Hash Join
•Read from two sets of reducer outputs that share the
same hashing buckets
•One is used as a build set and the other probe
merger
Read from every mapper for
one designated partition
Use a hash partitioner
merger
merger
reducer
reducer
reducer
reducer
reducer
reducer
mapper
mapper
mapper
mapper
mapper
mapper
split
split
split
split
split
split
split
split
Case Study: TPC-H Query 2
 Involves 5 tables, 1 nested query, 1 aggregate and group by
clause, and 1 order by
Case Study: TPC-H Query 2
 Map-Reduce-Merge workflow
Combining phases
13 passes of Map-Reduce-Merge
10 mappers, 10 reducers, and 4 mergers
6 passes of Map-Reduce-Merge
5 mappers, 4 reduce-merge-mappers,
1 reduce-mapper and 1 reducer
Conclusion
 Map-Reduce-Merge programming model
– Retain Map-Reduce’s many great features
– Add relational algebra to the list of database principles it upholds
– Contains several configurable components that enable many dataprocessing patterns
 Next step
– Develop an SQL-like interface and an optimizer to simplify the process of
developing a Map-Reduce-Merge workflow
– This work can readily reuse well-studied RDBMS techniques