Apache Mahout
Download
Report
Transcript Apache Mahout
Guided By
Ms. Shikha Pachouly
Assistant Professor
Computer Engineering
Department
4/10/2016
Machine Learning
Machine learning is programming computers to
optimize a performance criterion using example data
or past experience.
Machine Learning Strategies
1) Supervised
2)Unsupervised
4/10/2016
Common Use Cases
Recommend friends/dates/products
Classify content into predefined groups
Find similar content based on object properties
Find associations/patterns in action/behaviors
Identify key topics in large collection of text
Detect anomalies in output
Ranking search results
4/10/2016
Apache Mahout Introduction
Machine Learning Library for Scalable applications
Includes core algorithms for Recommendation,
Clustering and Classification that are implemented on
top of Hadoop Map-Reduce model.
Also includes core libraries are highly optimized to
allow for good performance also for non-distributed
algorithms.
4/10/2016
4/10/2016
• Mahout is distributed under a commercially friendly
Apache Software license.
• The goal of Mahout is to build a vibrant, responsive,
diverse community to facilitate discussions not only on
the project itself but also on potential use cases.
• Currently Mahout supports mainly three use cases:
1) Recommendation mining
2) Clustering
3) Classification
4/10/2016
Why Mahout
Many Open Source ML libraries (PyBrain, Shark etc)
either
1) lack community
2) lack scalability
3) lack documentations and examples
Most Mahout implementations are Map Reduce
enabled
4/10/2016
The main goal of Apache Mahout is to be useful to
practitioners.
-This means implementations should be easy to
use from within Java applications.
-It should be close to trivial to deploy the
trained models.
-Scaling to include more and more diverse data
should be simple.
4/10/2016
Recommendations
Extensive Framework for collaborative filtering
Recommenders
1) user based
2) item based
Many different similarity measures
e.g. Cosine, LLR, Tanimoto, Pearson,
4/10/2016
Algorithms For Recommendatation
User-Based Collaborative Filtering – Single Machine
Item-Based Collaborative Filtering - single machine /
Mapreduce
Matrix Factorization with Alternating Least Squares single machine / MapReduce
Matrix Factorization with Alternating Least Squares on
Implicit Feedback- single machine / MapReduce
Weighted Matrix Factorization, SVD++, Parallel SGD single machine
4/10/2016
User-Based Recommender
4/10/2016
4/10/2016
Clustering
4/10/2016
Algorithms for Clustering
K-Means Clustering
Fuzzy K-Means
Mean Shift Clustering
Dirichlet Process Clustering (For Topic Modelling)
4/10/2016
We can use commands instead of Clustering algorithms
that can run on Hadoop infrastructure
e.g. for Canopy Clustering command is
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.canopy.Job
k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.kmeans.Job
Fuzzy k-Means Clustering
bin/mahoutorg.apache.mahout.clustering.syntheticcontrol
.fuzzykmeans.Job
4/10/2016
Classification
Algorithms implemented in Mahout for Classifiaction
Logistic Regression - trained via SGD - single machine
Naive Bayes/ Complementary Naive Bayes MapReduce
Random Forest - MapReduce
Hidden Markov Models - single machine
Multilayer Perceptron - single machine
4/10/2016
Running Naïve Bayes from
Command Line
Three Commands
1) mahout seq2sparse
performs TF/IDF transformations
2) mahout trainnb
model is trained by using Byes Model
3) mahout testnb
classification and testing is performed.
4/10/2016
Installation of Mahout
Download the tar files of both apache-mahout and
apache-maven projects
Unzip the tar files in a directory
Set the Path Variables for maven
Set present working directory to the mahout's core
folder
Compile the project by 'mvn-compile'
Build the project by 'mvn-install'
4/10/2016
Mahout Vs Weka
Base\ Technologies
Mahout
WEKA
Scalability
More
Less
Algorithms
Less
More
GUI
No
Yes
License
Apache
GPL
4/10/2016
MAHOUT COMMERCIAL USERS
Adobe: Uses clustering algorithms to increase video
consumption by better user targeting.
Amazon: For Personalization platform.
AOL: For shopping recommendations.
Twitter: Uses Mahout’s LDA implementation for user interest
modeling.
Yahoo! Mail: Uses Mahout’s Frequent Pattern Set Mining.
Drupal: Users Mahout to provide open source content
recommendation solutions.
Evolv: Uses Mahout for its Workforce Predictive Analytics
platform.
Foursquare: Uses Mahout for its recommendation engine .
Idealo: Uses Mahout’s recommendation engine.
4/10/2016
References
Bingwei Liu, Erik Blasch, Yu Chen, Dan Shen and Genshe Chen, on “Scalable
Sentiment Classification for Big Data Analysis Using Naıve Bayes Classifier”,
2013 IEEE International Conference on Big Data.
Rui Máximo Esteves, Chunming Rong, “Using Mahout for clustering
Wikipedia’s latest Articles”, 2011 Third IEEE International Conference on Cloud
Computing Technology and Science.
Kathleen Ericson and Shrideep Pallickara, “On the Performance of Distributed
Data Clustering Algorithms in File and Streaming Processing Systems”, 2011
Fourth IEEE International Conference on Utility and Cloud Computing.
https://mahout.apache.org/
Sean Owen, Robin Anil , “Mahout In Action”, Manning Publications
4/10/2016
THANK YOU
4/10/2016