Attend College
Download
Report
Transcript Attend College
DAT205
Advanced Data Mining Using
SQL Server 2000
ZhaoHui Tang
Program Manager
SQL Server Analysis Services
Microsoft Corporation
Agenda
• Microsoft Data Mining Algorithms
• OLE DB for DM Data mining query
• Data Mining Case Study: Click Stream
Analysis
– Customer Segmentation
– Site affiliation
– Target ads in banner
• Performance of Microsoft Data Mining
Algorithm
• Q&A
Data Mining Algorithms in SQL
Server 2000
Decision Tree
All Students
Attend College:
55% Yes
45% No
IQ=High
IQ < > High
Attend College:
79% Yes
21% No
Parent Income
= High
Attend College:
94% Yes
6% No
Attend College:
35% Yes
65% No
Parent Income
= Low
Attend College:
69% Yes
31% No
• Popular technique for
classification,
Prediction task
– Churn analysis
– Credit risk analysis
– …
• Easy to understand
– any path from node to
leaf forms a rule
• Fast to build
• Prediction based on
leaf node stats
• Variation: C4.5, C5,
CART, Chaid
How tree works
IQ
CollegeP
lan
Parent
Encouragement
Parent
Income
Gender
High
Medium
Low
True
False
High
False
Male
Female
Yes
300
500
200
700
300
400
600
500
500
No
100
1000
900
400
1600
400
1600
1100
900
1000
1800
900
1600
800
700
1400
600
1200
500
1000
400
800
300
600
200
400
100
200
0
0
IQ=High
IQ=Medium
IQ=Low
Yes
PE=TRUE
PE=FALSE
1200
1800
1600
1000
1400
800
1200
1000
600
800
No
400
600
400
200
200
0
0
PI=High
PI=FALSE
Male
Female
Split recursively
All Students
College Plan
33% Yes
67% No
Parent Encouragement = True
Parent Encouragement = False
College Plan
63% Yes
37% No
College Plan
16% Yes
84% No
IQ
CollegeP
lan
Parent
Encouragement
Parent
Income
Gender
High
Medium
Low
True
False
High
False
Male
Female
Yes
200
400
100
700
0
300
400
400
250
No
50
250
100
400
0
100
300
250
150
Microsoft Decision Trees
• Probabilistic Classification Tree
• Splitting methods: Bayesian score and
Entropy
• Forward pruning
• Tree shape: Binary and Nary tree
• Scalable framework
Clustering Algorithm (EM)
• A popular method for customer
segmentation, mailing list, profiling…
• Algorithm process
– Assign a set of Initial Points
– Assign initial cluster to each points
– Assign data points to each cluster with a
probability
– Computer new central point based on weighted
computation
– Cycle until convergence
EM Illustration
X
X
X
Microsoft Clustering Algorithm
(Scalable EM)
Fill Buffer
Data
Build/Update
Model
Stop?
Final Model
Identify Data to
be Compressed
Compressed date
Sufficient stats
OLE DB for Data Mining
OLE DB for DM
• Industry standard for data mining
• Based on existing technologies
– SQL
– OLE DB
• Define common concepts for DM
–
–
–
–
–
Case, Nested Case
Mining Model
Model Creation
Model Training
Prediction
• Language based API
Customer Table
Customer ID
Profession
Income
Gender
Risk
1
Engineer
85
Male
No
2
Worker
40
Male
Yes
3
Doctor
90
Female
No
4
Teacher
50
Female
No
5
Worker
45
Male
No
…
…
…
…
…
DM Query Language
Create Mining Model CreditRisk
(CustomerID long key,
Gender
text discrete,
Insert into CreditRisk
Income long continuous,
(CustomerId, Gender, Income,
Profession text discrete, Select NewCustomers.CustomerID,
Profession, Risk)
CreditRisk.Risk, PredictProbability(CreditRisk)
Risk
text discrete predict)
Select
From CreditRisk Prediction Join NewCustomers
Using Microsoft_Decision_Trees
CustomerID, Gender, Income,
Profession,Risk On CreditRisk.Gender=NewCustomer.Gender
From Customers And CreditRisk.Income=NewCustomer.Income
And
CreditRisk.Profession=NewCustomer.Profession
Schema Rowsets
• Tabular data to provide meta data
information
• List of Schema Rowsets in OLE DB for DM
–
–
–
–
–
–
Mining_Services
Mining_Service_Parameters
Mining_Models
Mining_Columns
Mining_Model_Contents
Model_Content_PMML
Mining Model Contents Schema
Rowsets
Schema Rowsets & Thin Client
Browser
Case Study: Click Stream
Analysis
Schema
Customer
CustomerGuid
DayTimeOnLine
NightTimeOnLin
e
WebClick
CustomerGuid
BrowserType
URLCategory
EmailTime
Time
ChatTime
Duration
GeoLocation
ReferPage
Web Customer Segmentation
Web Visitors Segmentation
Segmentation based on Customer
table
Create Mining Model CustomerClustering
(CustomerID text key,
DayTimeOnline long continuous
NightTimeOnline long continuous,
BrowserType
text discrete,
ChatTime
long
continuous,
EmailTime
long continuous,
GeoLocation
text
discrete
)
Using Microsoft_Clustering
Segmentation based on Customer
and WebClick
Create Mining Model CustomerClustering
(CustomerID text key,
DayTimeOnline long continuous,
NightTimeOnline long continuous,
BrowserType
text discrete,
ChatTime
long continuous,
EmailTime
long continuous,
GeoLocation
text
WebClick
table (
discrete
UrlCategory text key )
)Using Microsoft_Clustering
MSFTies Segmentation
Web Site Affiliation
Association analysis using
Microsoft Decision Trees
Insurance
Business
Loan
No Loan
Stock
No Stock
Stock
Business
No Loan
Loan
No Insurance
Insurance
Loan
No Business
Shopping
No Shopping
Stock
Insurance
No Stock
No Insurance
Association analysis using
Microsoft Decision Trees
Insurance
Business
Loan
No Loan
Stock
No Stock
Stock
Business
No Loan
Loan
No Insurance
Insurance
Loan
No Business
Shopping
No Shopping
Stock
Insurance
No Stock
No Insurance
Site Affiliation
Site Affiliation
Create Mining Model SiteAffiliation
(CustomerID text key,
WebClick
table predict (
UrlCategory text key )
)Using Microsoft_Decision_Trees
Insert into SiteAffiliation (CustomerID,WebClick (skip,
UrlCategory)
OpenRowset(‘MSDataShape’, 'data
provider=SQLOLEDB;Server=myserver;UID=me;
PWD=mypass' ,
'Shape{Select CustomerID from Customer}
Append ( {Select customerid, URLCategory
from WebClick }
relate CustomerID to CustomerID) as WebClick’
)
Path Prediction
Path Prediction
Singleton Prediction
Select Flattened
Topcount((select URLCategory, $adjustedProbability as
prob
From Predict([Web Click], INCLUDE_STATISTICS,
EXCLUSIVE)), prob, 5)
From
WebLog PREDICTION JOIN (select (select 'Business'
as URLCategory) union (select ‘Telecom’ as
URLCategory) as WebClick) as input
On
WebLog.[Web Click].URLCategory =
input.WebClick.URLCategory
Architecture
ADO/DSO
DM Provider
Web
Customer
Internet
IIS
ASP
DMM
Real Time
Predictio
n
Performance of DM Algorithms
DM Performance Study
• Joint effort between Unisys & Microsoft
•
Two parts of the white paper:
First part: Use AS2k to build DM Models for a
banking business scenario
•
•
Second Part: Performance results of DM
algorithms study
Some results in this session…
Details in the paper and SQL Server
magazine articles…
Data Source for DMMs
Training Performance Results…
Sample Business Question for
Non Nested MDT
1
Identify those customers that are
most likely to churn (leave) based
on
customer
demographical
information.
Non Nested: Training Times for varying Number of Input attributes
Assumptions:
•
1 mm cases
•
25 states
•
1 predictable attribute
Observations:
10
20
Training Time
4.08
7.27
50
31.54
100
40.55
Training Time (minutes)
I/P Attributes
140.00
120.00
100.00
80.00
Training Time
60.00
40.00
20.00
0.00
0
200
129.35
50
100
150
Number of Attributes
200
250
Non Nested: Training Times for varying Number of Cases
Assumptions:
•
20 attributes
•
25 states
•
1 predictable attribute
Observations:
Cases
Training
Time
10,000
0.38
1,000,000
11.32
5,000,000
34.19
10,000,000
100.53
Sample Business Question for
Nested MDT
2
Find the list of other products that the
customer may be interested in based on the
products the customer has purchased.
Nested Cases: Training Times for varying Sample size of Case Table
Assumptions:
•
Avg. customer
purchases=25
•
States in nested=200
•
Nested key predictable
Master Cases
Training
Time
10,000
15.09
50,000
67.79
100,000
120.88
200,000
240.62
300
Training Time (minutes)
Observations:
Training Time
250
200
150
Training Time
100
50
0
0
50000 100000 150000 200000 250000
Number of Master Cases
Nested Cases: Training Times for varying Number of Products
purchased per customer
Assumptions:
•
200000 cases
•
1000 products in nested
Observations:
Nested Cases
Training
Time
10
85.26
25
120.82
50
172.96
100
281.65
For more info…
• DM URL
– www.microsoft.com/data/oledb
– www.microsoft.com/data/oledb/DMResKit.htm
• News Group:
– Microsoft.public.SQLserver.datamining
– Communities.msn.com/AnalysisServicesDataMining
• White papers:
– Performance paper:
www.unisys.com/windows2000/default-07.asp
www.microsoft.com/SQL/evaluation/compare/analysisdmwp.asp
Don’t forget to complete the
on-line Session Feedback form
on the Attendee Web site
https://web.mseventseurope.com/teched/