1 - cse, hkust
Download
Report
Transcript 1 - cse, hkust
Learning with Hadoop
– A case study on MapReduce
based Data Mining
Evan Xiang, HKUST
1
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
2
Introduction to Hadoop
Hadoop Map/Reduce is
a java based software framework for easily writing
applications
which process vast amounts of data (multi-terabyte
data-sets) in-parallel on large clusters (thousands of
nodes) of commodity hardware
in a reliable, fault-tolerant manner.
3
Hadoop Cluster Architecture
Client
TaskTracker
DataNode
Slave node
Job submission node
HDFS master
JobTracker
NameNode
TaskTracker
DataNode
Slave node
From Jimmy Lin’s slides
TaskTracker
DataNode
Slave node
4
Hadoop HDFS
5
Hadoop Cluster Rack Awareness
6
Hadoop Development Cycle
1. Scp data to cluster
2. Move data into HDFS
3. Develop code locally
4. Submit MapReduce job
4a. Go back to Step 3
Hadoop Cluster
You
5. Move data out of HDFS
6. Scp data from cluster
From Jimmy Lin’s slides
7
Divide and Conquer
“Work”
Partition
w1
w2
w3
“worker”
“worker”
“worker”
r1
r2
r3
“Result”
From Jimmy Lin’s slides
Combine
8
High-level MapReduce pipeline
9
Detailed Hadoop MapReduce data flow
10
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
11
Word Count with MapReduce
Doc 1
Doc 2
one fish, two fish
Map
Doc 3
red fish, blue fish
cat in the hat
one
1 1
red
2 1
cat
3 1
two
1 1
blue
2 1
hat
3 1
fish
1 2
fish
2 2
Shuffle and Sort: aggregate values by keys
cat
Reduce
fish
3 1
1 4
one
1 1
red
2 1
From Jimmy Lin’s slides
blue
2 1
hat
3 1
two
1 1
12
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
13
Calculating document pairwise
similarity
Trivial Solution
load each vector o(N) times
load each term o(dft2) times
Goal
scalable and efficient solution
for large collections
From Jimmy Lin’s slides
14
Better Solution
Each term contributes only if appears in
Load weights for each term once
Each term contributes o(dft2) partial scores
From Jimmy Lin’s slides
15
Decomposition
Each term contributes only if appears in
reduce
map
Load weights for each term once
Each term contributes o(dft2) partial scores
From Jimmy Lin’s slides
16
Standard Indexing
(a) Map
doc
doc
doc
doc
(b) Shuffle
(c) Reduce
tokenize
tokenize
tokenize
Shuffling
group values
by: terms
combine
posting
list
combine
posting
list
combine
posting
list
tokenize
From Jimmy Lin’s slides
17
Inverted Indexing with MapReduce
Doc 1
Doc 2
one fish, two fish
Map
Doc 3
red fish, blue fish
cat in the hat
one
1 1
red
2 1
cat
3 1
two
1 1
blue
2 1
hat
3 1
fish
1 2
fish
2 2
Shuffle and Sort: aggregate values by keys
cat
Reduce
fish
3 1
1 2
one
1 1
red
2 1
2 2
From Jimmy Lin’s slides
blue
2 1
hat
3 1
two
1 1
18
Indexing (3-doc toy collection)
Clinton
Clinton
Obama
Obama
Clinton
Clinton
Clinton
1
2
1
Cheney
Clinton
Cheney
1
Indexing
Barack
1
Clinton
Clinton
Barack
Barack
Obama
Obama
Obama
1
From Jimmy Lin’s slides
1
19
Pairwise Similarity
(b) Group pairs
(a) Generate pairs
Clinton
1
2
(c) Sum pairs
2
1
2
Cheney
1
2
2
3
1
1
Barack
1
1
Obama
1
1
How to deal with
the long list?
1
From Jimmy Lin’s slides
20
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
21
PageRank
PageRank – an information propagation model
Intensive access of
neighborhood list
22
PageRank with MapReduce
n1 [n2, n4]
n2 [n3, n5]
n2
n3
n3 [n4]
n4 [n5]
n4
n5
n5 [n1, n2, n3]
Map
n1
n4
n2
n2
n5
n3
n3
n4
n4
n1
n2
n5
n3
n5
Reduce
n1 [n2, n4] n2 [n3, n5]
n3 [n4]
n4 [n5]
n5 [n1, n2, n3]
How to maintain the
graph structure?
From Jimmy Lin’s slides
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
24
K-Means Clustering
25
K-Means Clustering with
MapReduce
Mapper_i-1
Mapper_i
Each Mapper loads a set
of data samples, and
assign each sample to a
nearest centroid
Reducer_i-1
3
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
Reducer_i
Each Mapper needs
to keep a copy of
centroids
Reducer_i+1
4
2
3
Mapper_i+1
2
4
How to set the initial centroids is very important!
Usually we set the centroids using Canopy Clustering.
[McCallum, Nigam and Ungar: "Efficient Clustering of High Dimensional
Data Sets with Application to Reference Matching", SIGKDD 2000]
26
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
27
Matrix Factorization
for Link Prediction
In this task, we observe a sparse matrix X∈ Rm×n with
entries xij. Let R = {(i,j,r): r = xij, where xij ≠0} denote the
set of observed links in the system. In order to predict
the unobserved links in X, we model the users and the
items by a user factor matrix U∈ Rk×m and an item
factor matrix V∈ Rk×n. The goal is to approximate the
link matrix X via multiplying the factor matrix U and V,
which can be learnt by minimizing:
28
Solving Matrix Factorization via
Alternative Least Squares
Given X and V, updating U:
n
m
X
ui
n
k
k
V
k
k
k
k
k
k
A
b
Similarly, given X and U, we can alternatively update V
29
MapReduce for ALS
Stage 1
Stage 2
Mapper_i
Group rating
data in X using
for item j
Mapper_i
Group features
in V using for
item j
Group rating
data in X using
for user i
i
Vj
i
i+1
Vj+2
Vj
Reducer_i
Reducer_i
Rating for
Features for
item j
item j
Align ratings and features for
item j, and make a copy of Vj for
each observe xij
Standard ALS:
Calculate A and b,
and update Ui
i-1
i+1
Vj
i
Vj
Vj
30
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
31
Cluster Coefficient
In graph mining, a clustering coefficient is a measure of
degree to which nodes in a graph tend to cluster
together. The local clustering coefficient of a vertex in a
graph quantifies how close its neighbors are to being a
clique (complete graph), which is used to determine
whether a graph is a small-world network.
[D. J. Watts and Steven
Strogatz (June 1998).
"Collective dynamics of
'small-world' networks".
Nature 393 (6684): 440–442]
How to maintain the
Tier-2 neighbors?
32
Cluster Coefficient with MapReduce
Stage 1
Stage 2
Mapper_i
Mapper_i
Reducer_i
Reducer_i
Calculate the cluster
coefficient
BFS based method need three stages, but actually we only need two!
33
Resource Entries to ML labs
Mahout
Apache’s scalable machine learning libraries
Jimmy Lin’s Lab
iSchool at the University of Maryland
Jimeng Sun & Yan Rong’s Collections
IBM TJ Watson Research Center
Edward Chang & Yi Wang
Google Beijing
34
Advanced Topics in Machine
Learning with MapReduce
Probabilistic Graphical models
Gradient based optimization methods
Graph Mining
Others…
35
Some Advanced Tips
Design your algorithm with a divide and
conquer manner
Make your functional units loosely dependent
Carefully manage your memory and disk
storage
Discussions…
36
Outline
Hadoop Basics
Case Study
Word Count
Pairwise Similarity
PageRank
K-Means Clustering
Matrix Factorization
Cluster Coefficient
Resource Entries to ML labs
Advanced Topics
Q&A
37
Q&A
Why not MPI?
Hadoop is Cheap in everything…D.P.T.H…
What’s the advantages of Hadoop?
Scalability!
How do you guarantee the model equivalence?
Guarantee equivalent/comparable function logics
How can you beat “large memory” solution?
Clever use of Sequential Disk Access
38