Online Machine Learning With Distributed In

Download Report

Transcript Online Machine Learning With Distributed In

Online Machine Learning with
Distributed In-memory Clusters
Arshak Navruzyan, VP Product
Argyle Data
Acknowledgements
Nina Zumel, PhD – Win-Vector LLC
Contents
 Business use-cases
- DevOps / Continuous Deployment
- Quality of Service (QOS)
- Quality of Experience (QoE)
 Machine learning methods for anomaly detection
- Nearest Neighbor (NN)
- Isolation Forest (iForest)
- Random Forests (RF)
Ecommerce operator use case
 Large distributed search farm logs usage data into in-memory grid
- Mixed measurements
- Continuous: query latency, resource utilization, etc.
- Categorical: client IP, server IP, code version, etc.
- ~1-3 TB of search logs every 12 hours
 Find the anomalies
- Data isn’t easy to characterize due to size
- Anomalies are across multiple variables (combination of server, code version, latency)
- No labeled data is available
- High rate of false-positives at this scale is a flood of data
- Very few ML methods operate at this scale
In-memory grid provides live view of the data
 Distributed SQL store
- Read / write optimized
- Automatic placement (sharding)
- Fast ingest (million inserts per sec.)
- Fast aggregations (billion rows per sec.)
- Holds ~30 days of data online
- Insert also sets time-to-live (ttl)
 Simple monitoring tool
- D3.js based horizon graphs
- Socket.io / Node.js
Isn’t there an open source project that does this?
 Etsy’s continuous deployment problem
- 1.5b page views, 1M users, $117m in goods sold
- 250+ commiters, everyone deploys!
- 30+ deploys to production a day
- ~8 commits per deploy
- How do you NOT break production?
How do you monitor such an environment like Etsy?
 Usual suspects: IT monitoring tools like Ganglia, Nagios
- But … “Not all things that break throw errors”
 Etsy’s Kale
- StatsD - StatsD::increment(“foo.bar”)
- Skyline – Real-time anomaly detection system
- Oculus – Anomaly correlation system using dynamic warping
Skyline & Oculus
Did they solve the problem?
 Skyline’s Anomaly Detection
- Basic principle: “Metric is anomalous if it’s latest datapoint is over three standard
deviations above the moving average”
- Ensemble of methods from tailing average, median absolute deviation, grubbs, stdev
from moving average, least squares (3 sigma), Kolmogorov-Smirnov test, etc.
- Results get better with ensemble technqiue but still very noisy
- Non-normal distributions
- Spike influence
- Periodicity / seasonality
Machine learning based anomaly detection
 “Benchmarking algorithms for detecting anomalies in large datasets” -Uriel Carrasquilla
(2010)
Nearest-neighbor approach




N1 and N2 are regions of normal behaviour
Points o1 and o2 are anomalies
Points in region O3 are anomalies
Advantage
- No need for assumptions about data distribution
- No need to have pre-labelled anomalies
- Supports categorical as wells as continuous variables
Y
N1
o1
O3
o2
N2
X
“Anomaly Detection: A Tutorial” - A. Banerjee, V. Chandola, V. Kumar, J. Srivastava
(SIAM International Conference on Data Mining 2008)
 Drawbacks
- Computationally expensive – quadratic in data volume (every point has to be compared to
every other point)
- Existing implementations are batch-oriented
Avoiding the quadratic computation time of NN
 Bay & Schwabacher SIGKDD 2003
- Outliners in Near Linear Time with Randomization and a Simple Pruning Rule
 Only anomalies are compared to every point in the data set
- If anomalies are rare, points only get compared to a small constant number of points
 Challenges that remain
- Batch learner won’t work for our scale
- How would this work in a sharded environment?
- Linear time is still long for “Big Data”
Nearest-neighbor as an online learner
 ORCA concepts can be extended to online learning
- Randomly pull a small sample of the data that is representative of a time period
- Compare the streaming data (latest observations) to the representative set
- Representative set moves with time (sliding window) so that noon is compared to
typical noon (this addresses periodicity)
 Other enhancements
- Every node needs the full reference set
- Distributed cache moves reference set between nodes
- Locality sensitive hashing
Random partitioning approach
 “Isolation Forest” – Liu IEEE (2008)
 When building a tree, anomalies are likely to be isolated closer to the root of the tree;
whereas normal points appear deeper in the tree structure
 No need to profile normal data points
 No distance or density measures
 Gap
- No support for categorical attributes
- Lacks explanatory power
- No obvious way to turn into an online
method
Ensemble of trees as an online learner
 Random Forest (RF) basics
- Take the square root of the number of attributes and build that many trees
- Random selection (with replacement) of observations (in-bag/out-of-bag)
- Random selection of variables
 “Online Random Forests” (ORF) algorithm of Saffari et al., ICCV-OLCV 2009
- How do you perform bagging online? (Oza “Online bagging and Boosting”)
- Probability of an individual tree seeing an observation in batch learning mode is
Poisson (λ = 1)
- How to grow random trees on-the-fly?
- Continuously measure info-gain (or gini) of a potential split
Offline vs. Online Training
To summarize
 Machine learning can produce more accurate anomaly detection than the stats
approach
 Nearest Neighbor and Random Forests may be adapted to an online learner
 In-memory distributed processing can improve the performance of such algorithms
 Supervised methods for classification and regression will work similarly (we think …)
Thank you!
www.linkedin.com/in/arshak