Sessions Title Session Code

Download Report

Transcript Sessions Title Session Code

Building a Recommendation
Engine with HDInsight and
Hadoop in Microsoft Azure
App3008
Sebastian Brandes
Technology Evangelist, Microsoft
November 26, 2014
#CampusDays
Agenda
1. Introduction to Personalization
2. How to Build a Recommendation Engine
3. Overview of Technologies (Hadoop 2.4.0 and Mahout 0.9)
4. Moving Data into Microsoft Azure
5. Analyzing Data in Azure with HDInsight 3.1
6. Introduction to Azure Machine Learning
#CampusDays
7. Future of Apache Mahout and Azure ML – uprising of Apache Spark?
8. Demo of Spark 1.0.2 on Customized Hadoop Cluster in Azure
9. Getting Started on Your Own
10. Q&A
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
#CampusDays
Introduction to Personalization
Introduction to Personalization
• 27% of customers have seen Personalization online
• 86% of those say Personalization influenced what
they purchased to some extent
• 31% want a more Personalized experience
• 59% of customers who have experienced
#CampusDays
Personalization believe it has a noticeable influence
on purchasing
58% prefer product
#CampusDays
recommendations from previous
purchases over other forms of
personalization
#CampusDays
Introduction to Personalization
#CampusDays
How to Build a Recommendation
Engine
#CampusDays
How to Build a Recommendation Engine
#CampusDays
Overview of Technologies
#CampusDays
Hadoop Timeline
• October 2003: Google publishes GFS paper
•
http://research.google.com/archive/gfs.html
• December 2004: Google publishes MapReduce paper
•
http://research.google.com/archive/mapreduce.html
• 2005: Doug Cutting (Yahoo!) and Mike Cafarella (U Washington) create Hadoop
• [Insert many years here.]
#CampusDays
• November 2012: HDInsight is released in Technical Preview (as PaaS and IaaS)
• August 2013: Hadoop 1.X is stable
• October 2013: HDInsight is generally available
• October 2013: Hadoop 2.X is generally available
• June 2014: HDInsight 3.1 (Hadoop 2.4.0) is released
#CampusDays
Overview of Technologies
#CampusDays
Overview of Technologies
#CampusDays
Microsoft Azure is a
• cloud computing platform and infrastructure,
• created by Microsoft,
• for building, deploying and managing applications and services
• through a global network of Microsoft-managed datacenters.
#CampusDays
Apache Hadoop is an
• open-source for
• software framework
• distributed storage and
• distributed processing of
• Big Data on
• clusters of commodity hardware.
A mahout is a person who rides an elephant.
#CampusDays
Errh...
Apache Mahout is a project of the Apache Software Foundation to produce
• free implementations of
• distributed or otherwise scalable
• machine learning algorithms focused primarily in the areas of
• collaborative filtering, clustering and classification.
#CampusDays
MapReduce is
• a programming model and
• an associated implementation for processing and generating large data sets with a
• parallel,
• distributed algorithm
• on a cluster.
Traditional Hadoop Cluster
Slaves
Data
#CampusDays
Masters
Client(s)
Primary Name
Node
Job Tracker
Data Node
Data Node
Task Tracker
Task Tracker
...
Secondary
Name Node
Data Node
Task Tracker
#CampusDays
HDInsight Cluster and Storage
#CampusDays
#CampusDays
MapReduce in C#
public class NaiveMapReduceProgram<K1, V1, K2, V2, V3>
{
public delegate IEnumerable<KeyValuePair<K2, V2>> MapFunction(K1 key, V1 value);
public delegate IEnumerable<V3> ReduceFunction(K2 key, IEnumerable<V2> values);
#CampusDays
private MapFunction _map;
private ReduceFunction _reduce;
public NaiveMapReduceProgram(MapFunction mapFunction, ReduceFunction
reduceFunction)
{
_map = mapFunction;
_reduce = reduceFunction;
}
[...]
MapReduce in C#
[...]
public IEnumerable<KeyValuePair<K2, V2>> Map(Dictionary<K1, V1> input)
{
var q = from pair in input
from mapped in _map(pair.Key, pair.Value)
select mapped;
return q;
#CampusDays
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Reduce(IEnumerable<KeyValuePair<K2, V2>>
intermediateValues)
{
var groups = from pair in intermediateValues
group pair.Value by pair.Key into g
select g;
var reduced = from g in groups
let k2 = g.Key
from reducedValue in _reduce(k2, g)
select new KeyValuePair<K2, V3>(k2, reducedValue);
return reduced;
}
[...]
MapReduce in C#
#CampusDays
[...]
public IEnumerable<KeyValuePair<K2, V3>> Execute(IEnumerable<KeyValuePair<K1,
V1>> input)
{
return Reduce(Map(input));
}
}
MapReduce in C#: Word Count
public IList<KeyValuePair<string, int>> MapFromMem(string key, string value)
{
var result = new List<KeyValuePair<string, int>>();
foreach (var word in value.Split(' '))
{
result.Add(new KeyValuePair<string, int>(word, 1));
}
#CampusDays
return result;
}
MapReduce in C#: Word Count
public IEnumerable<int> Reduce(string key, IEnumerable<int> values)
{
int sum = 0;
foreach (int value in values)
{
sum += value;
}
#CampusDays
return new int[1] { sum };
}
MapReduce in C#: Word Count
#CampusDays
var master = new MapReduceProgram<string, string, string, int, int>(MapFromMem,
Reduce);
var result = master.Execute(inputData).ToDictionary(key => key.Key, v =>
v.Value);
How to Build a Recommendation Engine
Broadly speaking, we can build a recommender either by
• finding questions that a user may be interested in answering, based on the questions
answered by other users like him, or
• finding other questions that are similar to the questions he answered already.
#CampusDays
The first technique is known as user based recommendation, and the second technique
is known as item based recommendations.
#CampusDays
#CampusDays
Similarity Matrix:
S1
S2
S3
S4
S1
2
2
2
0
S2
2
3
3
1
S3
2
3
4
1
S4
0
1
1
1
Excel formula:
{=MMULT(A2:E5,F2:F5)}
#CampusDays
Similarity Matrix:
Tina Stenderup-Larsen:
Preference:
S1
S2
S3
S4
User 3
R3
S1
2
2
2
0
0
4
S2
2
3
3
1
1
6
S3
2
3
4
1
1
7
S4
0
1
1
1
0
2
In Mahout 0.9 we will make use of:
• User-Based Collaborative Filtering
More specifically, this class:
#CampusDays
• org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Documentation: https://builds.apache.org/job/MahoutQuality/javadoc/org/apache/mahout/cf/taste/hadoop/item/RecommenderJob.html
Overview of Technology
#CampusDays
2.4.0.2.1.8.0-2176
0.9.0.2.1.8.0-2176
Microsoft Azure HDInsight
3.1
Can’t Use mahout-distribution-0.9
You need the Mahout distribution especially built for HDInsight 3.1.
It’s included in any new HDInsight cluster by default.
#CampusDays
This error is a sign of a package
made for Hadoop 1.x, which is
uncompatible with Hadoop 2.x.
#CampusDays
Moving Data into Azure
Moving Data into Azure
• A very common question among customer: How to do it?
• Different approaches to the task:
•
Manual Movement
•
Automated Movement (on some type of schedule)
• Initial data load vs. incremental data load
#CampusDays
• Challenges with private networks (corporate networks)
Moving Data into Azure
UI-based Explorers
• Azure Storage Explorer
• CloudXplorer
#CampusDays
• Drag and Drop
• Easy to use for manual movement
• Movement of complete folders
Moving Data into Azure
AzCopy
• Blob Loading Tool
•
•
#CampusDays
•
•
•
•
Upload
Download
Command Line Utility
Logging
Recovery
Naming Conflicts
• http://azure.microsoft.com/en-us/documentation/articles/storage-use-azcopy/
Moving Data into Azure
PowerShell
• Set-AzureStorageBlobContent
•
•
Uploads blob synchronously
Can use “foreach” to iterate through a directory
• New-AzureStorageContext
• Start-AzureStorageBlobCopy
#CampusDays
•
Asynchronously move between storage accounts
Moving Data into Azure
#CampusDays
C#
// Variables for the cloud storage objects.
var StorageAccount =
CloudStorageAccount.Parse("DefaultEndpointsProtocol=https;AccountName="
+ ConfigurationManager.AppSettings["AzureAccountName"] + ";AccountKey="
+ ConfigurationManager.AppSettings["AzureStorageKey"] + "");
// Create the blob client.
var blobClient = StorageAccount.CreateCloudBlobClient();
// Get the container reference.
var blobContainer = blobClient.GetContainerReference("data");
// Create the container if it does not exist.
blobContainer.CreateIfNotExists();
// Upload blob.
blob.BeginUploadFromFile(Filename, FileMode.Open, ProcessComplete,
blobname);
#CampusDays
Demo of HDInsight
The Echo Nest Taste Profile Subset
The dataset contains
• real user–play counts from undisclosed partners,
all songs already matched to the Million Song Dataset.
Link: http://labrosa.ee.columbia.edu/millionsong/tasteprofile
#CampusDays
• 1,019,318 unique users
• 384,546 unique MSD songs
• 48,373,586 user - song - play count triplets
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
b80344d063b5ccb3212f76538f3d9e43d87dca9e
...
SOAKIMP12A8C130995
SOAPDEY12A81C210A9
SOBBMDR12A8C13253B
SOBFNSP12AF72A0E22
SOBFOVM12A58A7D494
1
1
2
1
1
#CampusDays
The Echo Nest Taste Profile Subset
userId,songId,timesPlayed
1,1,1
1,2,1
1,3,2
1,4,1
1,5,1
1,6,1
1,7,2
1,8,1
1,9,1
1,10,1
...
Transforming the Data
Using C# on my Surface Pro 3 with:
• Intel Core i5-4300U @ 1.90 GHz
• 8 GB RAM
• 256 GB SSD
Console app written for .NET Framework 4.5 (x64), 100 lines of optimized code.
#CampusDays
Input file = 3 GB. More than 48 million rows...
Took more than 30 minutes to complete with 10 million rows...
A Hadoop cluster could have done that in a matter of seconds.
#CampusDays
#CampusDays
#CampusDays
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
hadoop fs -rm -r input
hadoop fs -rm -r temp
hadoop fs –mkdir input
#CampusDays
hadoop fs -copyFromLocal mInput.txt /user/writer/input/
hadoop fs -copyFromLocal users.txt /user/writer/input/
hadoop jar C:\apps\dist\mahout-0.9.0.2.1.8.0-2176\core\target\mahoutcore-0.9.0.2.1.8.0-2176-job.jar
org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
-s SIMILARITY_COOCCURRENCE
--input=/user/writer/input/mInput.txt
--output=/user/writer/output
--usersFile=/user/writer/input/users.txt
hadoop fs -copyToLocal output/part-r-00000 c:\output.txt
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
Analyzing Data in Azure with HDInsight 3.1
#CampusDays
#CampusDays
Demo of Azure Machine Learning
#CampusDays
Azure Machine Learning
#CampusDays
Azure Machine Learning
Azure Machine Learning
#CampusDays
https://social.msdn.microsoft.com/Forums/azure/en-US/24a06c13-bde0-4955-86f3-85a2f2b37153/testdataset-contains-invalid-data
#CampusDays
Future of Apache Mahout and
Azure Maching Learning
Future of Apache Mahout and Azure ML
Apache Mahout
• Latest stable release: v0.9
• Under development: v1.0
• Replacing MapReduce with Apache Spark
• Compatibility with Hadoop 2
•
https://issues.apache.org/jira/browse/MAHOUT1512?jql=project%20%3D%20MAHOUT%20AND%20resolution%20%3D%20Unresolved%20AND%2
0fixVersion%20%3D%201.0%20ORDER%20BY%20priority%20DESC
#CampusDays
• Rewriting algorithms for Apache Spark, H2O and Flink
Azure Machine Learning
• Made available in public preview in July 2014
• Half price until general availability (GA)
• GA: Mid-February 2015
#CampusDays
Apache Spark is
• an open-source
• in-memory
• data analytics cluster computing framework
• originally developed in the AMPLab at UC Berkeley.
• In contrast to Hadoop's two-stage disk-based MapReduce paradigm,
• Spark's in-memory primitives
• provide performance up to 100 times faster for certain applications.
#CampusDays
MLlib is Spark’s
• scalable machine learning library
• consisting of common learning algorithms
• and utilities,
• Including classification,
• regression,
• clustering,
• collaborative filtering,
• dimensionality reduction, as well as
• underlying optimization primitives.
#CampusDays
Apache Mahout vs. Apache Spark?
•
•
•
•
•
•
MapReduce
Introduced in Nov. 2011
Very mature
Community support
Proven in the field
Tested stability
•
•
•
•
MLLib – much faster!
Not as mature
Hot in the community
Became an Apache
top-level project
in February 2014
Apache Spark 1.0 is a preview feature in Microsoft Azure.
Guide to set it up here: http://azure.microsoft.com/enus/documentation/articles/hdinsight-hadoop-spark-install/
#CampusDays
HDInsight 3.1 is required. It is compatible with HDFS and WASB so existing data can be
used for Apache Spark jobs.
A sample script for installation is available via the link above.
To run Spark queries, you can use the Scala shell or write standalone programs.
Setting Up Hadoop with Spark 1.0.2 in Azure using PowerShell
Import-Module Azure
$subscriptionName = "Platform – Internt forbrug" # Name of the Azure subscription
$clusterName = "cd2014spark"
# The HDInsight cluster name
$storageAccountName = "cd2014spark"
# Azure storage account that hosts the default container
$storageAccountKey = “<YOURKEY>"
$containerName = $clusterName
$location = "West Europe"
# The location of the HDInsight cluster
$clusterNodes = 4
# The number of nodes in the HDInsight cluster
$version = "3.1"
# For example "3.1"
#CampusDays
Select-AzureSubscription $subscriptionName
$config = New-AzureHDInsightClusterConfig -ClusterSizeInNodes $clusterNodes
$config.DefaultStorageAccount.StorageAccountName="$storageAccountName.blob.core.windows.net"
$config.DefaultStorageAccount.StorageAccountKey=$storageAccountKey
$config.DefaultStorageAccount.StorageContainerName=$clusterName
$config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection
HeadNode -Uri https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv01/spark-installer-v01.ps1
New-AzureHDInsightCluster -Config $config -Name $clusterName -Location $location -Version $version
Valid per November 26, 2014.
Word Count Example in Spark 1.0.2
Using the Scala shell:
val f = sc.textFile("/example/data/gutenberg/davinci.txt")
val counts = f.flatMap(line => line.split(" "))
.map(word => (word, 1))
.reduceByKey(_ + _)
#CampusDays
counts.toArray().foreach(println)
#CampusDays
Word Count Example in Spark 1.0.2
#CampusDays
Slides and demos are available at:
sebastianbrandes.com
EVENT SPONSORER
TRACK SPONSORER
EXPO SPONSORER
Q&A
#Ask me about everything!
Join me at the Microsoft Booth the next
30 minutes @Meet The Experts
#CampusDays
Dont forget to: Evaluate this session!