Succeeding with Predictive Analytics and unlocking the power of

Download Report

Transcript Succeeding with Predictive Analytics and unlocking the power of

Succeeding with Predictive Analytics
and unlocking the power of Data
Science with R on Netezza
Abhik Roy
Experian
Session Code: E03
May 23, 2016 (03:45 PM – 04:45 PM) | Platform: Cross Platform
Photo by Steve from Austin, TX, USA
Succeeding with Predictive Analytics and unlocking
the power of Data Science with R on Netezza
• Introduction to Machine Learning
• IBM Netezza and OpenR as a Data Processing Platform
for Predictive Analytics
• Case Study of Linear Regression
• A quick introduction to some
other popular Machine Learning
Algorithms
• The future of Machine Learning
and how it can help build
Cognitive Apps
2
Machine Learning
• Machine learning explores the
study and construction of
algorithms that can learn from
and make predictions on data
• Deep capability of understanding
patterns in data
• Uses principals of mathematics,
computational statistics and
computer processing to develop
predictive data models
3
Types of Machine Learning
• The computer is provided training data which teaches it the relationships between
predictor and target variables
Supervised
Learning
• The computer is then presented with test data set consisting of the predictor
variables, and asked to predict the value of the target variables
• No training sets are provided, leaving it on its own to find structures and
patterns in data
Unsupervised
• Examples include clustering data based on similar attributes
Learning
• A computer program interacts with a dynamic environment to perform a certain goal
and has to be intelligent enough to understand how it is progressing in its goal
Reinforcement • Example include automatically driving a car
Learning
4
Technology platform for Machine learning
• Machine learning involves very complex
computations, and CPU and memory
intensive number crunching
• Development of machine learning
algorithms is a very expensive process,
hence the processing platform must be
able to scale horizontally as the data
processing size grows
• Fault tolerant and redundant systems to
ensure minimal impact during hardware
component failures
• Ideally support open source
computational and statistical processing
languages like R and Python
5
IBM Netezza with its massive
data processing capability and
built in redundancy / fault
tolerance of the disks and
snippet processing Units, forms
an ideal platform to build
Predictive Data Analytics
Platforms.
6
IBM Netezza and OpenR Reference Architecture
R
NZ Host
INZA
RODBC
R
R
R Studio
Studio
INZA
SPUn
R
INZA
SPU2
R
INZA
SPU1
IBM Pure Data for Analytics – Netezza Appliance
NZR, NZA, NZMATRIX
7
Data Processing with R on IBM Netezza
NZ Data Frames
R code and Functions /
ANSI SQL
NZ Matrics and Analytics
Packages
IBM
Netezza
Pack R code in SQL
based clients
8
Linear Regression (Multi Variate)
What is Linear regression?
It is a method of investigating functional
relationship between variables. It tries to
estimate the value of dependent variables
from the values of independent variables
using a linear equation.
Regression analysis is typically used when
dependent and independent variables are
continuous and have some co relation.
9
Example of a Simple Linear Equation
Y = ãX + ß
The above plot shows a
simple linear equation
where we only have one
variable X, which we are
using to find the value of Y.
ã is called the slope which
is Y/X
ß is the intercept which is
the value of Y when X=0
10
Multi Variate Linear Regression
We have multiple independent variables x1, x2…xn
which we use to calculate the value of variable Y.
It can be expressed in the form
Y = x1ã1 +x2ã2 +x3ã3…….x1ãn + ß
11
Model Accuracy of a Linear Regression
Fitting a line in Linear regression
A linear regression algorithm will try to
fit a line that will give the least
residuals. Residuals is the sum of
square of vertical distances between
the points.
Goodness of fit
R-squared is a measure which tells
us how close the data is to the fitted
line. It goes from values 0 to 1. The
higher the value, the better is the fit.
12
Using Linear Regression for Machine Learning
Training Data:
Predictors
+
Target Variables
IBM Netezza and Open R
Development of
Linear equation
Predict target
Variables
Test Data:
Predictors
13
Example of Linear Regression with Open R on
IBM Netezza
Problem statement:
The input data set contains data about
details of various car models. Based on the
information provided, the goal is to come up
with a model to predict Miles-per-gallon of a
given model.
14
Example of Linear Regression with Open R on
Netezza
Techniques used:
• Linear Regression – Multi Variate
• Data Imputation
15
Example of Linear Regression with Open R on
IBM Netezza
setwd("C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case
study")
 getwd()
[1] "C:/Users/abhik/Documents/OpenR on Netezza/Netezza R case study"
library(nzr)
library(nza)
nzConnectDSN('DBATEST', force = TRUE, verbose =
TRUE)
16
Example of Linear Regression with Open R on
IBM Netezza
DBATEST.ADMIN(ADMIN)=> select * from auto_miles_per_gallon_id where
horsepower is NULL;
ID
| MGP | CYLINDERS | DISPLACEMENT | HORSEPOWER | WEIGHT | ACCELERATION | MODELYEAR | NAME
-----+------+-----------+--------------+------------+--------+--------------+-----------+---------------------127 | 21 | 6
| 200
|
| 2875
|
17
|
maverick
337 | 23.6 | 4
mustang cobra
33 | 25 | 4
pinto
|
|
140
|
| 2905
98
|
| 2046
|
14.3
|
19
74
|
| ford
80
|
71
| ford
| ford
17
Example of Linear Regression with Open R on
IBM Netezza
> nz_auto_miles_per_gallon = nz.data.frame("auto_miles_per_gallon")
> reg_df <- as.data.frame(nz_auto_miles_per_gallon)
nz_auto_miles_per_gallon
SELECT
"ID","MGP","CYLINDERS","DISPLACEMENT","HORSEPOWER","WEIGHT
","ACCELERATION","MODELYEAR","NAME" FROM
AUTO_MILES_PER_GALLON_ID
18
Example of Linear Regression with Open R on
IBM Netezza
summary(reg_df)
ID
MGP
CYLINDERS
DISPLACEMENT
HORSEPOWER
Min. : 1.0
Min. : 9.00
Min. :3.000
Min. : 68.0
Min. : 46.0
1st Qu.:100.2
1st Qu.:17.50
1st Qu.:4.000
1st Qu.:104.2
1st Qu.: 75.0
Median :199.5
Median :23.00
Median :4.000
Median :148.5
Median : 93.5
Mean :199.5
Mean :23.51
Mean :5.455
Mean :193.4
Mean :104.5
3rd Qu.:298.8
3rd Qu.:29.00
3rd Qu.:8.000
3rd Qu.:262.0
3rd Qu.:126.0
Max. :398.0
Max. :46.60
Max. :8.000
Max. :455.0
Max. :230.0
19
Example of Linear Regression with Open R on
IBM Netezza
str(reg_df)
'data.frame':
398 obs. of 9 variables:
$ ID
: int 1 2 3 4 5 6 7 8 9 10 ...
$ MGP
: num 18 15 18 16 17 15 14 14 14 15 ...
$ CYLINDERS : num 8 8 8 8 8 8 8 8 8 8 ...
$ DISPLACEMENT: num 307 350 318 304 302 429 454 440 455 390 ...
$ HORSEPOWER : num 130 165 150 150 140 198 220 215 225 190 ...
$ WEIGHT
: num 3504 3693 3436 3433 3449 ...
$ ACCELERATION: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
$ MODELYEAR : num 70 70 70 70 70 70 70 70 70 70 ...
$ NAME
: chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite"
"amc rebel sst" ...
20
Example of Linear Regression with Open R on
IBM Netezza
 unlist(lapply(reg_df, function(x) any(is.na(x))))
ID
MGP CYLINDERS DISPLACEMENT HORSEPOWER
FALSE
FALSE
FALSE
FALSE
TRUE
FALSE
ACCELERATION MODELYEAR
NAME
FALSE
FALSE
FALSE
WEIGHT
21
Example of Linear Regression with Open R on
IBM Netezza
t=nzQuery("EXECUTE NZA..IMPUTE_DATA('intable=auto_miles_per_gallon,
method=mean, outtable=auto_miles_per_gallon_2, inColumn=horsepower')")
> head(t)
IMPUTE_DATA
1
1
unlist(lapply(reg_df_2, function(x)
any(is.na(x))))
ID MGP CYLINDERS DISPLACEMENT
HORSEPOWER
WEIGHT ACCELERATION
FALSE FALSE
FALSE
FALSE
FALSE
FALSE
FALSE
22
Example of Linear Regression with Open R on
IBM Netezza
library(ggplot2)
ggplot(reg_df_2, aes(factor(CYLINDERS), MGP)) +geom_boxplot( aes(fill=factor(CYLINDERS)))
23
Example of Linear Regression with Open R on
IBM Netezza
> t=nzQuery ("SELECT nza..CORR_AGG(cylinders,weight) from
auto_miles_per_gallon_2")
>t
CORR_AGG
1 0.8960168
24
Example of Linear Regression with Open R on
IBM Netezza
Using R psyc package to find Pearson’s co relation co efficients
library(psych)
pairs.panels(reg_df_2)
25
Example of Linear Regression with Open R on
IBM Netezza
Modeling and Prediction:
glmfit <- nzGlm(MGP ~
CYLINDERS+DISPLACEMENT+HORSEPOWER+WEIGHT+ACCELE
RATION+MODELYEAR, nz_auto_miles_per_gallon_id_2 , id="ID",
family="gaussian", link="identity", method='irls')
26
Example of Linear Regression with Open R on
IBM Netezza
 print(glmfit)
Model Name
AUTO_MILES_PER_GALLON_ID_2_MODEL92860
Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER +
WEIGHT + ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,
id = "ID", family = "gaussian", link = "identity", method = "irls")
Coefficients:
INTERCEPT ACCELERATION CYLINDERS DISPLACEMENT HORSEPOWER
MODELYEAR
0.005629164 -0.050536985 0.014586362 0.002332460 -0.017306055 0.597021079
WEIGHT
-0.006660499
Residuals Summary:
Pearson:
Deviance:
RSS: 4786.2168380755
RSS: 4786.2168380755
df: 391
df: 391
p-value: 1
p-value: 1
27
Example of Linear Regression with Open R on
IBM Netezza
> summary(glmfit)
Call:nzGlm(form = MGP ~ CYLINDERS + DISPLACEMENT + HORSEPOWER + WEIGHT +
ACCELERATION + MODELYEAR, data = nz_auto_miles_per_gallon_id_2,
id = "ID", family = "gaussian", link = "identity", method = "irls")
GLM coefficients for model: "AUTO_MILES_PER_GALLON_ID_2_MODEL92860"
| Parameter
| Beta
| Std Error | Test
| p-value
| INTERCEPT
| 0.005629 | 0.000669 | 8.417371 | 0
|
| ACCELERATION
| -0.050537| 0.090939 | -0.555723| 0.5784
|
| CYLINDERS
| 0.014586 | 0.000797 | 18.297005
|0
|
| DISPLACEMENT
| 0.002332 | 0.005524 | 0.42227 | 0.672828 |
| HORSEPOWER
| -0.017306| 0.011938 | -1.449698| 0.147143 |
| MODELYEAR
| 0.597021 | 0.022447 | 26.597039
|0
|
| WEIGHT | -0.00666 | 0.000657 | -10.138912
|0
|
Residuals Summary:
| Residual Type
| RSS
| Pearson |4786.2168380755|391|1|
| Deviance |4786.2168380755|391|1|
| df | p-value
|
|
28
Example of Linear Regression with Open R on
IBM Netezza
Note:
•
The metadata of the models generated by calling NZ Analytics packages
are stored in the Netezza database
•
There are various utilities for extensive model management like copying
the model to a file, running a prediction from a saved model etc
•
The details of these are beyond the scope of this presentation. However, it
should be noted that being able to store the models in the Netezza
database gives a centralized repository, which is an added bonus
29
Prediction:
Now that we developed the regression model, we will predict the values of MPG using the same sample data.
Example of Linear Regression with Open R on
IBM Netezza
Prediction:
Now that we developed the regression model, we will predict the values of
MPG using the same sample data.
> pred = predict(glmfit, nz_auto_miles_per_gallon_id_2, "ID")
30
Example of Linear Regression with Open R on
IBM Netezza
 head(pred)
1
2
3
4
5
6
ID
45
38
18
27
44
38
PRED
31.70386
34.19114
16.77949
30.45490
27.06798
34.22652
31
Example of Linear Regression with Open R on
IBM Netezza
 head(pred)
ID PRED
1 45 31.70386
2 38 34.19114
3 18 16.77949
4 32 30.45490
5 44 27.06798
6 38 38.22652
45
38
18
32
44
38
31.7
34.19
16.7
30.45
26
38
4
4
8
4
4
4
76
91
304
90
97
91
52
67
150
48
78
67
1649
1965
3672
2335
2300
1965
16.5
15.7
11.5
23.7
14.5
15
74
82
73
80
74
82
toyota corona
honda civic (auto)
amc matador
vw dasher (diesel)
opel manta
honda civic
32
Example of Linear Regression with Open R on
IBM Netezza
Conclusions:
•
NZ Analytics built in regression algorithm provides comparable
model accuracy compared to other popular R packages
•
We were able to push down data imputation (a part of data
engineering) and model creation and prediction process down to
the Netezza database making this a truly scalable process
33
Some other popular Machine Learning
Algorithms that could be implemented in IBM
Netezza and OpenR
Decision Tree
Uses a decision tree as a
predictive model which maps
observations about an item to
conclusions about the item's
target value. It is one of the
predictive modeling approaches
used in statistics, data mining and
machine
For case study and
implementation details, please
visit:
http://www.theanalyticsuniverse.co
m/predictive-analytic-s-decisiontrees-using-openr-ibm-netezza
34
Some other popular Machine Learning
Algorithms that could be implemented in IBM
Netezza and OpenR
Associative Rules Mining
(Market Basket Analysis)
Associative rule mining is a method to discover
hidden relationships between variables in large
data sets. It is intended to identify strong rules
discovered in databases using measures of
interestingness.
Example
{milk, bread} => {butter}
This rule says that customers buying milk and
bread together is also likely to buy butter.
For case study and implementation details,
please visit
http://www.theanalyticsuniverse.com/predictiveanalytic-s-market-basket-analysis-using-openrand-netezza
35
Some other popular Machine Learning
Algorithms that could be implemented in
Netezza and OpenR
K-Means Clustering
The K-means Clustering is a
grouping or clustering algorithm
used to group data based on
similar attributes.
For case study and
implementation details, please
visit
http://www.theanalyticsuniverse.com/predictive
-analytic-s-k-means-clustering-using-openrand-ibm-netezza
36
Some other popular Machine Learning
Algorithms that could be implemented in
Netezza and OpenR
Naïve Bayes
In machine learning, naive Bayes classifiers are a family ofsimple probabilistic
classifiers based on applying Bayes' theorem with strong (naive) independence
assumptions between the features.
Actual
Predicted
Iris-setosa Iris-versicolor Iris-virginica
Iris-setosa
22
0
0
Iris-versicolor
0
20
0
Iris-virginica
0
2
14
For case study and implementation details, please visit
http://www.theanalyticsuniverse.com/predictive-analytic-s-naive-bayesusing-openr-ibm-netezza
37
Where can machine Learning take us?
38
Thank You
Abhik Roy, Database
Technologies,
Experian
Dong Yan, DA Analytics,
Experian
Check out E09 – Taking Netezza on a test drive, Ma y25 08:00
AM – 09:00 AM, Room San Antonio. The abstract will cover
creating a Netezza and R analytical platform in your pc!
Josh Evanoff,
Internal
Communications,
Experian
39
Abhik Roy
Experian
[email protected]
LinkedIn : https://www.linkedin.com/in/abhik-roy-98620412
Technical Blogs: www.theanalyticsuniverse.com
Please fill out your session
evaluation before leaving!
Session : E03
Title : Succeeding with Predictive Analytics and
unlocking the power of Data Science with R on
Netezza
Photo by Steve from Austin, TX, USA