DAT439: Data Mining Algorithms and Usage

Download Report

Transcript DAT439: Data Mining Algorithms and Usage

Data Mining
in SQL Server 2000
and Yukon
Richard Lees
[email protected]
RichardLees.com.au
Agenda

What isn’t Data Mining


Demo
What is Data Mining

Demo



What’s Coming in Yukon


Create a data mine
4 ways to view data mine
Demo
Questions

Throughout
Which Questions are Data Mining?






Who are our biggest customers?
What are customers buying with cigars?
What are the customer retention levels of our
branches?
Which customers have bought olives, feta cheese
but no ciabatta bread?
Which regions have the highest male/female ratio of
single 20 somethings?
Which region has lowest customer retention levels
and list out lost customers?
Demonstration



Ad hoc query
Drill through to details
Business Intelligence tool
History of OLAP and Data
Mining
19xx
Custom
Data
Mining
available
to
Fortune
100
1993
Codd’s
Defined
12 rules
for
OLAP
1998
Microsoft
SQL 7
• OLAP v1
1999
2000
OLAP on
the Web
• ThinSlicer
• Many
others
Microsoft
SQL 2000
• OLAP v2
• Data Mining
• English
Query
SAS and SPSS offer Data Mining tools
To those who can afford
Future
Data Mining
V2
• SQL 2005
• BI Tools
Sample Data I Will be Using
 Wellington
Libraries Loan DB
 We wanted sample data for data mining
 They were just writing off a data
warehouse project

“The experts have spent 12 months trying
to import data!”
“How could Microsoft help us?
The data are in IBM databases!”
What is Data Mining?
“Data mining is the use of powerful software tools to
discover significant traits or relationships, from databases or
data warehouses and often used to predict future events”

It exploits


statistical algorithms such as decision trees, clustering,
sequence clustering, association, naïve bayes, neural
network and time series algorithms
Once the “knowledge” is extracted it:


Can be used to discover
Can be used to predict values of other cases
OLAP versus Data Mining

OLAP




Data Mining




Is about fast ad hoc querying
Analysis by dimensions and measures
Gives precise answers
May use rdbms or OLAP source
Is about discovering and predicting
Gives imprecise answers
OLAP is not a prerequisite for data mining, but it almost always
comes first
(learning to ride a bike before a car)
Clusters
Annual
Income
Age
Library Clusters
Decision Trees

Input data



About cases
Discovering relationships
Predicting outcomes
Data Mining
Demo with real data

Build a data mine
View data mine


1.
2.
3.
4.
5.

Browse dependencies
Browse decision trees
Query using MDX
Query using ThinMiner
Batch update
Elite
Embedded
Uses of Data Mining






Risk assessment
Claim likelihood
Customer profitability predictions
Fraud detection
Treatment efficacy
Product suggestions
 Web shopping
 Call centre tool
Successful Data Mining Projects
Two additional Critical Success Factors
1.
2.
Discover something interesting
Profit from discovery
For example
ComputerFleet
(Localhost)
What’s Coming in Yukon
Decision Trees
Sequence Clustering
Clustering
Association
Time Series
Naïve Bayes
Confusion
Matrix
Neural Networks
Lift Charts
Naïve Bayes
NOK
OK
.90 (.27)
.27 /.41
J NOK
(.3x.9)+(.7x.2)
=.41
=.67
.14 /.41
=.33
.30
.10 (.03)
.03 /.59
.70
.80 (.56)
Actual
=.05
.20 (.14)
Actual declared
J OK
(.3x.1)+(.7x.8
)
=.59
Judged
.56 /.59
=.95
Posterior
(actual)
Demonstration

Yukon





Development
New algorithms
Lift chart
Profit curve
Query tool
Questions:
References
Microsoft Research http://Research.Microsoft.com/research/pubs
Richard Lees
[email protected]
http://RichardLees.com.au