NEQCA Provider Database Interview Findings
Download
Report
Transcript NEQCA Provider Database Interview Findings
Predictive Analytics
Solutions: Leveraging
HDInsight and RapidMiner
November 2013
Agenda
Background
Database Marketing
VLDB Marketing systems for the Cable Industry
Dashboards and Modeling led to Recommender Engines (nnus)
Web 2.0 Analytics
Migrated legacy processing from RDBMS to Hadoop (Hive)
Social Graph Processing in Hadoop(MR/Hive)
Healthcare
Document management system in Hadoop(HBase)
High volume, low-latency processing
3
Synopsis
Predictive Analytics is all the rage lately. The use of
predictive techniques has expanded beyond simple
recommender engines (Netflix and Amazon, for example)
and is now becoming a key strategic tool for competitive
advantage. Organizations looking for competitive
advantage are yearning to better understand their
customers and the industry the compete in. Part of the
challenge is the ability to understand internal operational
data, the other challenge is to identify and analyze
external data that will provide the additional insight to truly
provide competitive advantage. In this talk, we will
present a pattern for leveraging Hadoop (through
HDInsight) to obtain, integrate, conform and aggregate
data (internal and external) for consumption/presentation
by RapidMiner (a leading predictive analytics tool).
4
Approach: CRISP-DM
Business-focused
Results-Oriented
Recognizes that data preparation is a significant portion
of the overall effort
Studies say 90%
Recognizes the iterative nature of the process
Tenured
Conceived in 1996
Most common practice by 2002
5
Cross Industry Standard Process for Data Mining
(CRISP-DM)
CRISP-DM 1.0
6
Summary of a customer case
Customer’s Core Business and Goals
Customer provides roadside assistance service for major auto
manufacturers
Current business is reactive
call a tow truck when a car breaks down
Opportunities (to save costs, increase customer satisfaction, and
generate new revenue) depend on predicting when cars will break
down
Project Goals
Prove HDInsight as a processing platform for handling large data
Prepare (obtain, understand, massage) existing customer data
Incorporate external data that could be useful (e.g. weather data)
Produce a preliminary model
7
Business Understanding:
Initial hypotheses
1. It might be possible to predict breakdowns (to some
degree and under some conditions) so that we can
know when to “pre-position” trucks.
2. Weather should affect breakdowns
1. but we don’t know how much, under what conditions, or what
type of breakdowns might be predictable
3. Commercially available weather data for all of the US
for the last 7 years is prohibitively expensive, but free
data might be “good enough” to validate a weatherbased model.
8
Data Preparation
Client (DB extracts/email)
Roadside
Provider
EventsRosters
NOAA (ftp)
Zip Code
Masterfile
Roadside
Lookups
Azure Windows VM
Split Events into
(64) .gz files
Station
List
ISD/ISH
Contract
Files
Performance
Local Workstation
find_nearest_n_stations.py
Produced Lookup
CSV files
get_noaa_ish_files.py
normalize_noaa_data.py
Azure Blob Storage
Roadside
Events
Roadside
Lookups
Zip Code
Masterfile
Station
List
ISD/ISH
Files
roadside_weather
export.csv
Azure HDInsight/Hadoop
create_hive_tables.py
6
Modeling:
Prototyping a preliminary model
1. Industry-standard data mining methodology
1. Sample, Explore, Modify, Model, Assess (SEMMA)
2. Randomly sample prototype-scale dataset
3. Perform exploratory data analysis
4. Construct several models
5. Visualize model results
10
Randomly sample dataset
2007-01 to 2013-07
Select 10 coldest states:
AK, ME, MN (majority), MT, ND, NH, SD, VT, WI, WY
Random sample of 20k events
11
Perform exploratory analysis
Rough analysis can reveal preparation errors
12
Total events / hour
Winch calls relate to temperature
13
Winch - temperature effect
14
Proportion of Winch events
15
Construct RapidMiner models
16
Linear regression on Winch
17
Decision tree on winch ratio
18
Naïve Bayes probabilities
19
Naïve Bayes results
20
Naïve Bayes temperature
21
Naïve Bayes latitude
22
© 2013 Slalom, LLC. All rights reserved. The information herein is for informational purposes only and represents the current view of Slalom, LLC. as of the date of this presentation.
SLALOM MAKES NO WARRANTIES, EXPRESS, IMPLIED, OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.