Transcript Slide Deck

More value from data using Data
Mining
Allan Mitchell
SQL Server MVP
Who am I
•
•
•
•
•
•
SQL Server MVP
SQL Server Consultant
Joint author on Wrox Professional SSIS book
Worked with SQL Server since version 6.5
www.SQLDTS.com and www.SQLIS.com
Partner of SQL Know How
Today’s Schedule
•
•
•
•
what is data mining (Overview)
data mining terminology
myths around data mining
excel AddIn to Office2007
–
–
–
–
–
Demo Setup
Demo Key Influencers
Demo Categories
Demo Make a Prediction
Demo “Other stuff” – if time
• Questions and answers
What is Data Mining
• The process of using statistical techniques to discover
subtle relationships between data items, and the
construction of predictive models based on them. The
process is not the same as just using an OLAP tool to
find exceptional items. Generally, data mining is a very
different and more specialist application than OLAP,
and uses different tools from different vendors.
Normally the users are different, too. OLAP vendors
have had little success with their data mining efforts.
OLAP REPORT
What does Data Mining Do?
Explores
Your Data
Finds
Patterns
Performs
Predictions
Query, Reporting, Analysis
Data Mining
What
Why
How
Comparative Benefits
Predictive Projects versus Nonpredictive Projects
80%
70%
60%
50%
40%
30%
20%
10%
0%
Technology
Predictive
Productivity
Business Process
Enhancement
Nonpredictive
Source: IDC, 2003
Data Mining terminology
•
•
•
•
•
mining structure
mining model
mining algorithm
training dataset
testing dataset
SQL Server 2005 Algorithms
Decision Trees
Sequence
Clustering
Clustering
Time Series
Association
Naïve Bayes
Neural Net
Plus: Linear
and Logistic Regression
Sequence Clustering
• Applied to
– Click stream analysis
– Customer segmentation with
sequence data
– Sequence prediction
• Mix of clustering and sequence
technologies
• Group individuals based on
their profiles including
sequence data
Time Series
• Applied to
– Forecast sales
– Web hits prediction
– Stock value estimation
• Patented technique from
Microsoft Research
• Uses regression tree
technology to describe and
predict series values
Clustering
• Applied to
– Segmentation: Customer grouping,
Mailing campaign
– Also support classification and
regression
• Expectation Maximization
– Probabilistic Clustering
• K-Means
– Distance based
• Clusters both discrete and continuous
values
– Discrete values are “binarized”
• Anomaly detection
• Check variable independence
– “Predict Only” attributes not used for
clustering
Clustering
Age
Discrete
Male
Son
Daughter
Parent
Female
Clustering
Age
Anomaly Detection
Male
Son
Daughter
Parent
Female
dm data flow
Model
Browsing
LOB
Application
Reporting
Data Transform (SSIS)
Historical
Dataset
Prediction
Mining Models
Cube
Cube
New
Dataset
the steps to a successful model
MS BOL
DMX
CREATE MINING MODEL CreditRisk
INSERT INTO CreditRisk
(CustID
LONG KEY,
Gender
TEXT DISCRETE,
(CustId, Gender, Income, Profession,
Risk)
Income
LONG CONTINUOUS,
Profession TEXT DISCRETE,
Risk
TEXT DISCRETE PREDICT)
Select
CustomerID, Gender, Income,
Profession,Risk
From Customers
USING Microsoft_Decision_Trees
Select NewCustomers.CustomerID, CreditRisk.Risk,
PredictProbability(CreditRisk)
FROM CreditRisk PREDICTION JOIN NewCustomers
ON CreditRisk.Gender=NewCustomer.Gender
AND CreditRisk.Income=NewCustomer.Income
AND CreditRisk.Profession=NewCustomer.Profession
Myths around data mining
• You have to be a propeller head
• It’s a new concept.
• Only works with SSAS cubes
Excel 2007 DMAddin
•
•
•
•
•
DM visualisation
table analysis
Create session models/permanent models
Connect to ssas for full blown models
intuitive interface
Demos
•
•
•
•
•
setup
key Influencers
categories
Make a prediction
other sexy stuff
Resources
• Loads to be honest (DMX, API to name two things)
• Big Subject but very sexy
Contact Details
[email protected]