Machine Learning on Spark

Download Report

Transcript Machine Learning on Spark

Machine Learning on
Spark
Shivaram Venkataraman
UC Berkeley
Machine learning
Computer Science
Statistics
Spam filters
Click prediction
Machine learning
Recommendations
Search ranking
Classification
Clustering
Machine learning
techniques
Regression
Active learning
Collaborative filtering
Implementing Machine Learning
 Machine learning algorithms are
- Complex, multi-stage
- Iterative
 MapReduce/Hadoop unsuitable
 Need efficient primitives for data sharing
Machine Learning using Spark
 Spark RDDs  efficient data sharing
 In-memory caching accelerates performance
- Up to 20x faster than Hadoop
 Easy to use high-level programming interface
- Express complex algorithms ~100 lines.
Classification
Clustering
Machine learning
techniques
Regression
Active learning
Collaborative filtering
K-Means Clustering using Spark
Focus: Implementation and Performance
Grouping data according to
similarity
Distance North
Clustering
E.g. archaeological dig
Distance East
Grouping data according to
similarity
Distance North
Clustering
E.g. archaeological dig
Distance East
Benefits
• Popular
• Fast
• Conceptually straightforward
Distance North
K-Means Algorithm
E.g. archaeological dig
Distance East
K-Means: preliminaries
data = lines.map(line=>
parseVector(line))
Feature 2
Data: Collection of values
Feature 1
Dissimilarity:
Squared Euclidean distance
dist = p.squaredDist(q)
Feature 2
K-Means: preliminaries
Feature 1
K-Means: preliminaries
Data assignments to clusters
S1, S2,. . ., SK
Feature 2
K = Number of clusters
Feature 1
K-Means: preliminaries
Data assignments to clusters
S1, S2,. . ., SK
Feature 2
K = Number of clusters
Feature 1
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 2
K-Means Algorithm
Feature 1
• Initialize K cluster centers
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 2
K-Means Algorithm
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 2
• Initialize K cluster centers
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 2
• Initialize K cluster centers
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
Assign each data point to
the cluster with the closest
center.
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 2
• Initialize K cluster centers
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
Assign each cluster center
to be the mean of its
cluster’s data points.
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
newCenters = pointsGroup.mapValues(
ps => average(ps))
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
while (dist(centers, newCenters) > ɛ)
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
Feature 1
K-Means Algorithm
centers = data.takeSample(
false, K, seed)
• Repeat until convergence:
while (dist(centers, newCenters) > ɛ)
closest = data.map(p =>
(closestPoint(p,centers),p))
Feature 2
• Initialize K cluster centers
pointsGroup =
closest.groupByKey()
newCenters =pointsGroup.mapValues(
ps => average(ps))
Feature 1
centers = data.takeSample(
false, K, seed)
while (d > ɛ)
{
closest = data.map(p =>
(closestPoint(p,centers),p))
pointsGroup =
closest.groupByKey()
Feature 2
K-Means Source
newCenters =pointsGroup.mapValues(
ps => average(ps))
d = distance(centers, newCenters)
centers = newCenters.map(_)
}
Feature 1
Ease of use
 Interactive shell:
Useful for featurization, pre-processing data
 Lines of code for K-Means
- Spark ~ 90 lines – (Part of hands-on tutorial !)
- Hadoop/Mahout ~ 4 files, > 300 lines
Performance
Logistic Regression
111
116
76
62
80
100
0
0
25
50
100
Number of machines
[Zaharia et. al, NSDI’12]
25
50
100
Number of machines
3
50
6
50
150
15
33
61
100
200
184
Iteration time (s)
87
106
121
150
157
200
143
250
Hadoop
HadoopBinMem
Spark
250
Hadoop
HadoopBinMem
Spark
197
Iteration time (s)
300
274
K-Means
Conclusion
 Spark: Framework for cluster computing
 Fast and easy machine learning programs
 K means clustering using Spark
 Hands-on exercise this afternoon !
Examples and more: www.spark-project.org