09 : NoSQL II
Download
Report
Transcript 09 : NoSQL II
Unit 09 : NoSQL II
COMP 5323
Web Database Technologies and
Applications 2014
Doctrine of Fair Use
• This PowerPoint is prepared for educational purpose
and is strictly used in the classroom lecturing.
• We have adopted the "Fair Use" doctrine in this
PowerPoint which allows limited copying of
copyrighted works for educational and research
purposes.
Outline
1. Aggregate Data Models
2. Distribution Models
3. Introduction to MapReduce
4. The design and how it works
5. Fault Tolerance
6. Applications
1 Aggregate Data Models
RDBMS Data Model
Typical Data
Aggregate Data Model
Data
2 Distribution Model
Sharding
Sharding puts different data on separate nodes,
each of which does its own reads and writes
Master-Slave
Data is replicated from master to slaves. The
master services all writes; reads may come
from either master or slaves
Peer-to-Peer
Peer-to-peer replication has all nodes applyng
reads and writes to all the data
Version Stamp
The most common approach used by peerto-peer NoSQL systems is a special form of
version stamp which we call a vector stamp.
A vector stamp for three nodes (blue, green,
black) would look something like [blue: 43,
green: 54, black: 12].
Each time a node has an internal update, it
updates its own counter, so an update in the
green node would change the vector to [blue:
43, green: 55, black: 12].
Version Stamp
By using this scheme you can tell if one version
stamp is newer than another because the newer
stamp will have all its counters greater than or
equal to those in the older stamp.
[blue: 1, green: 2, black: 5] is newer than [blue:1,
green: 1, black 5] since one of its counters is
greater.
If both stamps have a counter greater than the
other, e.g. [blue: 1, green: 2, black: 5] and [blue:2,
green: 1, black: 5], then you have a write-write
conflict.
3 Introduction to MapReduce
What is MapReduce
• Origin from Google
– MapReduce: Simplified Data Processing on Large
Clusters, Jeffrey Dean and Sanjay Ghemawat, OSDI ’04:
6th Symposium on Operating Systems Design and
Implementation
• A simple programming model
• Functional model
• For large-scale data processing
– Exploits large set of commodity computers
– Executes process in distributed manner
– Offers high availability
Motivation
• Lots of demands for very large scale data
processing
• A certain common themes for these
demands
– Lots of machines needed (scaling)
– Two basic operations on the input
• Map
• Reduce
Motivation
• Large-Scale Data Processing
– Want to use 1000s of CPUs
• But don’t want hassle of managing things
• MapReduce provides
– Automatic parallelization & distribution
– Fault tolerance
– I/O scheduling
– Monitoring & status updates
• Two basic operations on the input
• Map
• Reduce
Concept
Lisp
• The terms Map and Reduce come from
Lisp and functional programming
• Installation (see reference)
Lisp
•
•
•
•
(defun square(n) (* n n))
(square 2) returns 4
However, we get errors when (square ‘(1 2))
We need to call a function map
• The reduce function consolidates the results
http://www.cs.cmu.edu/Groups/AI/html/cltl/clm/node143.html
MapReduce -- R
MapReduce -- R
MapReduce -- R
Distributed Grep
matches
Split data
grep
grep
grep
Split data
grep
matches
Split data
Very
big
data
Split data
matches
matches
cat
All
matches
Distributed Word Count
count
Split data
count
count
count
Split data
count
count
Split data
Very
big
data
Split data
count
count
merge
merged
count
Question: Which one is “map” or “reduce”?
Map+Reduce
Very
big
data
M
A
P
• Map:
– Accepts input key/value pair
– Emits intermediate key/value
pair
– map(key, val)
Partitioning
Function
R
E
D
U
C
E
Result
• Reduce :
– Accepts intermediate
key/value* pair
– Emits output key/value pair
– reduce(key, vals)
4 The design and how it works
Architecture overview
Master node
user
Job tracker
Slave node 1
Slave node 2
Slave node N
Task tracker
Task tracker
Task tracker
Workers
Workers
Workers
GFS: underlying storage system
• Goal
– global view
– make huge files available in the face of node failures
• Master Node (meta server)
– Centralized, index all chunks on data servers
• Chunk server (data server)
– File is split into contiguous chunks, typically 1664MB.
– Each chunk replicated (usually 2x or 3x).
• Try to keep replicas in different racks.
GFS architecture
GFS Master
Client
C0
C1
C5
C2
Chunkserver 1
C1
C5
C3
Chunkserver 2
C0
…
C5
C2
Chunkserver N
Functions in the Model
• Map
– Process a key/value pair to generate intermediate
key/value pairs
• Reduce
– Merge all intermediate values associated with the
same key
• Partition
– By default : hash(key) mod R
– Well balanced
Diagram (1)
Diagram (2)
Counting Words
• Counting words in a large set of documents (URLs)
map(string value)
//key: document name
//value: document contents
for each word w in value
EmitIntermediate(w, “1”);
reduce(string key, iterator values)
//key: word
//values: list of counts
int results = 0;
for each v in values
result += ParseInt(v);
Emit(AsString(result));
Count Words
Workflow
Search: Inputs
• A set of files containing lines of text and a search pattern
to find
– Mapper key is file name, line number
– Mapper value is the contents of the line
– Search pattern sent as special parameter
• Mapper:
– Given (filename, some text) and “pattern”, if “text”
matches “pattern” output (filename, _)
• Reducer:
– Identity function
Search: An Optimization
• Once a file is found to be interesting, we only need to
mark it that way once
• Use Combiner function to fold redundant (filename, _)
pairs into a single one
– Reduces network I/O
Locality issue
• Master scheduling policy
– Asks GFS for locations of replicas of input file blocks
– Map tasks typically split into 64MB (== GFS block
size)
– Map tasks scheduled so GFS input block replica are
on same machine or same rack
• Effect
– Thousands of machines read input at local disk speed
– Without this, rack switches limit read rate
5 Fault Tolerance
Fault Tolerance
• Reactive way
– Worker failure
• Heartbeat, Workers are periodically pinged by master
– NO response = failed worker
• If the processor of a worker fails, the tasks of that worker are
reassigned to another worker.
– Master failure
• Master writes periodic checkpoints
• Another master can be started from the last checkpointed state
• If eventually the master dies, the job will be aborted
Fault Tolerance
• Proactive way (Redundant Execution)
– The problem of “stragglers” (slow workers)
• Other jobs consuming resources on machine
• Bad disks with soft errors transfer data very slowly
• Weird things: processor caches disabled (!!)
– When computation almost done, reschedule inprogress tasks
– Whenever either the primary or the backup
executions finishes, mark it as completed
Fault Tolerance
• Input error: bad records
– Map/Reduce functions sometimes fail for particular inputs
– Best solution is to debug & fix, but not always possible
– On segment fault
• Send UDP packet to master from signal handler
• Include sequence number of record being processed
– Skip bad records
• If master sees two failures for same record, next worker is told to
skip the record
Refinements
• Task Granularity
– Minimizes time for fault recovery
– load balancing
• Local execution for debugging/testing
• Compression of intermediate data
Points need to be emphasized
• No reduce can begin until map is complete
• Master must communicate locations of
intermediate files
• Tasks scheduled based on location of data
• If map worker fails any time before reduce
finishes, task must be completely rerun
• MapReduce library does most of the hard
work for us!
Model is Widely Applicable
•
MapReduce Programs In Google Source Tree
Examples as follows
distributed grep
distributed sort
web link-graph reversal
term-vector / host
web access log stats
inverted index construction
document clustering
machine learning
statistical machine translation
...
...
...
How to use it
• User to do list:
– indicate:
•
•
•
•
Input/output files
M: number of map tasks
R: number of reduce tasks
W: number of machines
– Write map and reduce functions
– Submit the job
6 Applications
MapReduce Implementations
MapReduce
Cluster,
1, Google
2, Apache Hadoop
Multicore CPU,
Phoenix @ stanford
GPU,
Mars@HKUST
Hadoop
Google
Yahoo
MapReduce
Hadoop
GFS
HDFS
Bigtable
HBase
Chubby
(nothing yet… but
planned)
Recent news about Hadoop
• Apache Hadoop Wins Terabyte Sort
Benchmark
• The sort used 1800 maps and 1800 reduces
and allocated enough memory to buffers to
hold the intermediate data in memory.
MapReduce on GPU
• PACT’08
GeForce 8800 GTX, PS3, Xbox360
Conclusions
• Provide a general-purpose model to simplify
large-scale computation
• Allow users to focus on the problem without
worrying about details
Lisp Installation
www.clisp.org
Reference
• Original paper
(http://labs.google.com/papers/mapreduce.html)
• Hadoop – MapReduce in Java
(http://lucene.apache.org/hadoop/)