Apache Mahout

Download Report

Transcript Apache Mahout

Guided By
Ms. Shikha Pachouly
Assistant Professor
Computer Engineering
Department
4/10/2016
Machine Learning
 Machine learning is programming computers to
optimize a performance criterion using example data
or past experience.
 Machine Learning Strategies
1) Supervised
2)Unsupervised
4/10/2016
Common Use Cases
 Recommend friends/dates/products
 Classify content into predefined groups
 Find similar content based on object properties
 Find associations/patterns in action/behaviors
 Identify key topics in large collection of text
 Detect anomalies in output
 Ranking search results
4/10/2016
Apache Mahout Introduction
 Machine Learning Library for Scalable applications
 Includes core algorithms for Recommendation,
Clustering and Classification that are implemented on
top of Hadoop Map-Reduce model.
 Also includes core libraries are highly optimized to
allow for good performance also for non-distributed
algorithms.
4/10/2016
4/10/2016
• Mahout is distributed under a commercially friendly
Apache Software license.
• The goal of Mahout is to build a vibrant, responsive,
diverse community to facilitate discussions not only on
the project itself but also on potential use cases.
• Currently Mahout supports mainly three use cases:
1) Recommendation mining
2) Clustering
3) Classification
4/10/2016
Why Mahout
 Many Open Source ML libraries (PyBrain, Shark etc)
either
1) lack community
2) lack scalability
3) lack documentations and examples
 Most Mahout implementations are Map Reduce
enabled
4/10/2016
 The main goal of Apache Mahout is to be useful to
practitioners.
-This means implementations should be easy to
use from within Java applications.
-It should be close to trivial to deploy the
trained models.
-Scaling to include more and more diverse data
should be simple.
4/10/2016
Recommendations
 Extensive Framework for collaborative filtering
 Recommenders
1) user based
2) item based
 Many different similarity measures
e.g. Cosine, LLR, Tanimoto, Pearson,
4/10/2016
Algorithms For Recommendatation
 User-Based Collaborative Filtering – Single Machine
 Item-Based Collaborative Filtering - single machine /
Mapreduce
 Matrix Factorization with Alternating Least Squares single machine / MapReduce
 Matrix Factorization with Alternating Least Squares on
Implicit Feedback- single machine / MapReduce
 Weighted Matrix Factorization, SVD++, Parallel SGD single machine
4/10/2016
User-Based Recommender
4/10/2016
4/10/2016
Clustering
4/10/2016
Algorithms for Clustering
 K-Means Clustering
 Fuzzy K-Means
 Mean Shift Clustering
 Dirichlet Process Clustering (For Topic Modelling)
4/10/2016
 We can use commands instead of Clustering algorithms
that can run on Hadoop infrastructure
e.g. for Canopy Clustering command is
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.canopy.Job
 k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.kmeans.Job
 Fuzzy k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.fuzzykmeans.Job
4/10/2016
Classification
Algorithms implemented in Mahout for Classifiaction
 Logistic Regression - trained via SGD - single machine
 Naive Bayes/ Complementary Naive Bayes MapReduce
 Random Forest - MapReduce
 Hidden Markov Models - single machine
 Multilayer Perceptron - single machine
4/10/2016
Running Naïve Bayes from
Command Line
 Three Commands
1) mahout seq2sparse
performs TF/IDF transformations
2) mahout trainnb
model is trained by using Byes Model
3) mahout testnb
classification and testing is performed.
4/10/2016
Installation of Mahout
 Download the tar files of both apache-mahout and
apache-maven projects
 Unzip the tar files in a directory
 Set the Path Variables for maven
 Set present working directory to the mahout's core
folder
 Compile the project by 'mvn-compile'
 Build the project by 'mvn-install'
4/10/2016
Mahout Vs Weka
Base\ Technologies
Mahout
WEKA
Scalability
More
Less
Algorithms
Less
More
GUI
No
Yes
License
Apache
GPL
4/10/2016
MAHOUT COMMERCIAL USERS
 Adobe: Uses clustering algorithms to increase video








consumption by better user targeting.
Amazon: For Personalization platform.
AOL: For shopping recommendations.
Twitter: Uses Mahout’s LDA implementation for user interest
modeling.
Yahoo! Mail: Uses Mahout’s Frequent Pattern Set Mining.
Drupal: Users Mahout to provide open source content
recommendation solutions.
Evolv: Uses Mahout for its Workforce Predictive Analytics
platform.
Foursquare: Uses Mahout for its recommendation engine .
Idealo: Uses Mahout’s recommendation engine.
4/10/2016
References
 Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on “Scalable
Sentiment Classification for Big Data Analysis Using Naıve Bayes Classifier”,
2013 IEEE International Conference on Big Data.
 Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering
Wikipedia’s latest Articles”, 2011 Third IEEE International Conference on Cloud
Computing Technology and Science.
 Kathleen Ericson and Shrideep Pallickara, “On the Performance of Distributed
Data Clustering Algorithms in File and Streaming Processing Systems”, 2011
Fourth IEEE International Conference on Utility and Cloud Computing.
 https://mahout.apache.org/
 Sean Owen, Robin Anil , “Mahout In Action”, Manning Publications
4/10/2016
THANK YOU
4/10/2016