KDD systems & DBMS
Download
Report
Transcript KDD systems & DBMS
Data Mining Systems
and Languages
CS240A Notes
1
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
2
DM Experience for DBMS:
Dreams vs. Reality
Decision Support and business intelligence:
OLAP & data warehouses: resounding success for DBMS vendors,
via
Simple extensions of SQL (aggregates & analytics)
relational DBMS extensions for DM queries: a flop
OR-DBMS do not fare much better [Sarawagi’ 98].
Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was
suggested by who called for a quantum leap in functionality based
on:
Simple declarative extensions of SQL for Data Mining (DM)
Efficiency through DM query optimization techniques (yet to be invented)
The research area of Inductive DBMS was thus born, producing
Interesting language work: DMQL, Mine Rule, MSQL, …
Where implementation technology lacks generality & performance limitations
Real questions if optimizers will ever take us there.
3
DBMS Limitations
DBMSs were easily and very Successfully extended
for Data Warehouses with help of OLAP functions
Extending DBMSs for Mining has proven much
harder
Limited expressive power
Flexibility of the languages
Apriori in DB2 [Saravagi’ 98]
Because of lack of suitable primitives task proved
extremely difficult and not as efficient as the cachemining task
Cache-mining: move data from the database to cache
and then use PL algorithms to mine the cache.
4
Mining Systems Desiderata
Problem: How to efficiently support the vast
variety of online mining algorithms in an integrated
framework?
Generality over a wide spectrum of mining tasks
Ease of use for naïve users and flexibility and
customizability for experts
Efficiency, scalability
Databases: where the data is. But DBMS do not
support well the KDD tasks. Three approaches
1. Inductive DBMS
2.Commercial DBMS extensions
3. Dedicated KDD systems with DBMS connections.
5
Inductive DBMSs vs. Vendor Extensions
Imielinski & Manilla introduced the notion of
A high-level Data Mining Query Language for
DBMS
Optimization techniques for
Inductive DBMS a new research field
MSQL, DMQL, Mine Rule: DM query language
Performance and generality an open problem.
DBMS Vendors
Ad-hoc approaches based on mining libraries
6
DBMS extensions: DB2 Intelligent Miner
Model creation
Training
CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',
'TASK', 'ID', 'HeartClasTask',
'IDMMX.CLASSIFMODELS',
'MODEL', 'MODELNAME', 'HeartClasModel' );
Prediction
Stored procedures and virtual mining views
Most of the implementation outside the DBMS
(Cache Mining)
Data transfer delays
http://www-306.ibm.com/software/data/iminer/
7
Oracle Data Miner
Algorithms
Adaptive Naïve Bayes
SVM regression
K-means clustering
Association rules, text, mining, etc., etc.
PL/SQL with extensions for mining
Models as first class objects
Create_Model, Prediction, Prediction_Cost,
Prediction_Details, etc.
http://www.oracle.com/technology/products/bi/odm/index.html
21-Mar-08
8
OLE DB for DM by Microsoft
Model creation. Descriptive phase
Prediction joins
Other features
Nested cases
http://research.microsoft.com/dmx/DataMining/
PMML a descriptive XML language for
exchanging information between systems
9
OLE DB for DM (DMX) (cont.)
Mining objects as first class objects
Schema rowsets
Mining_Models
Mining_Model_Content
Mining_Functions
Other features
Column value distribution
Nested cases
http://research.microsoft.com/dmx/DataMining/
21-Mar-08
10
OLE DB for DM (DMX): 3 steps
Model creation
Create mining model MemCard_Pred (
CustomerId long key, Age long continuous,
Profession text discrete,
Income long continuous,
Risk text discrete predict)
Using Microsoft_Decision_Tree;
Training
Insert into MemCard_Pred OpenRowSet(
“‘sqloledb’, ‘sa’, ‘mypass’”,
‘SELECT CustomerId, Age,
Profession, Income, Risk from Customers’)
Prediction Join
Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)
From MemCard_Pred AS MP Prediction Join Customers AS C
Where MP.Profession = C.Profession and
AP.Income = C.Income AND MP.Age = C.Age;
21-Mar-08
11
Defining a Mining Model:
E.g., a model to predict students’ plan to attend
college
The format of “training cases” (top-level entity)
Attributes, Input/output type, distribution
Algorithms and parameters
Example
CREATE MINING MODEL CollegePlanModel
(
StudentID
Gender
ParentIncome
Encouragement
CollegePlans
LONG
TEXT
LONG
TEXT
TEXT
KEY,
DISCRETE,
NORMAL CONTINUOUS,
DISCRETE,
DISCRETE PREDICT
) USING Microsoft_Decision_Trees
12
Training
INSERT INTO CollegePlanModel
(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,
‘SELECT
StudentID,
Gender,
ParentIncome,
Encouragement,
CollegePlans
FROM CollegePlansTrainData’)
21-Mar-08
13
Prediction Join
SELECT t.ID, CPModel.Plan
FROM CPModel PREDICTION JOIN
OPENQUERY(…,‘SELECT * FROM
NewStudents’) AS t
ON CPModel.Gender = t.Gender AND
CPModel.IQ = t.IQ
CPModel
ID
Gender
IQ Plan
ID
Gender
21-Mar-08
IQ
NewStudents
14
Summary of Vendors’ Approaches
Built-in library of mining methods
Script language or GUI tools
Limitations
Closed systems (internals hidden from users)
Adding new algorithms or customizing old ones -Difficult
Poor integration with SQL
Limited interoperability across DBMSs
Predictive Markup Modeling Language (PMML)
as a palliative
21-Mar-08
15
PMML
Predictive Markup Model Language
XML based language for vendor independent
definition of statistical and data mining models
Share models among PMML compliant products
A descriptive language
Supported by all major vendors
21-Mar-08
16
PMML Example
21-Mar-08
17
Much Competion
Vendors
SAS Institute (Enterprise Miner)
IBM (DB2 Intelligent Miner for
Data)
Oracle (ODM option to Oracle
10g)
SPSS (Clementine)
Unica Technologies, Inc. (Pattern
Recognition Workbench)
Insightsful (Insightful Miner)
KXEN (Analytic Framework)
Prudsys (Discoverer and its
family)
Microsoft (SQL Server 2005)
Angoss (KnowledgeServer and its
family)
DBMiner (DB2)
Platforms
IBM
Oracle
SAS,
Tools
SPSS
Angoss
KXEN
Megaputer
FairIsaac
Insightful
Stand Alone Systems
WEKA is open-source java code created by
researchers at the University of Waikato
in New Zealand.
It provides many different machine
learning algorithms
Applicable to generic data described in
Attribute-Relation File Format (ARFF)
19
Weka
A comprehensive set of DM algorithms, and tools.
Generic algorithms over arbitrary data sets.
Independent on the number of columns in tables.
Open and extensible system based on Java.
* Also free …
21-Mar-08
20