KDD systems & DBMS

Download Report

Transcript KDD systems & DBMS

Data Mining Systems
and Languages
CS240A Notes
1
Knowledge Discovery (KDD) Process
Data mining—core of
knowledge discovery
process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Selection
Data Cleaning
Data Integration
Databases
2
DM Experience for DBMS:
Dreams vs. Reality
Decision Support and business intelligence:
 OLAP & data warehouses: resounding success for DBMS vendors,
via
 Simple extensions of SQL (aggregates & analytics)
 relational DBMS extensions for DM queries: a flop
 OR-DBMS do not fare much better [Sarawagi’ 98].
 Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was
suggested by who called for a quantum leap in functionality based
on:
Simple declarative extensions of SQL for Data Mining (DM)
Efficiency through DM query optimization techniques (yet to be invented)
 The research area of Inductive DBMS was thus born, producing
 Interesting language work: DMQL, Mine Rule, MSQL, …
Where implementation technology lacks generality & performance limitations
Real questions if optimizers will ever take us there.
3
DBMS Limitations
 DBMSs were easily and very Successfully extended
for Data Warehouses with help of OLAP functions
 Extending DBMSs for Mining has proven much
harder
Limited expressive power
Flexibility of the languages
Apriori in DB2 [Saravagi’ 98]
 Because of lack of suitable primitives task proved
extremely difficult and not as efficient as the cachemining task
 Cache-mining: move data from the database to cache
and then use PL algorithms to mine the cache.
4
Mining Systems Desiderata
 Problem: How to efficiently support the vast
variety of online mining algorithms in an integrated
framework?
Generality over a wide spectrum of mining tasks
Ease of use for naïve users and flexibility and
customizability for experts
Efficiency, scalability
 Databases: where the data is. But DBMS do not
support well the KDD tasks. Three approaches
1. Inductive DBMS
2.Commercial DBMS extensions
3. Dedicated KDD systems with DBMS connections.
5
Inductive DBMSs vs. Vendor Extensions
Imielinski & Manilla introduced the notion of
 A high-level Data Mining Query Language for
DBMS
 Optimization techniques for
Inductive DBMS a new research field
MSQL, DMQL, Mine Rule: DM query language
Performance and generality an open problem.
DBMS Vendors
 Ad-hoc approaches based on mining libraries
6
DBMS extensions: DB2 Intelligent Miner
 Model creation
 Training
CALL IDMMX.DM_buildClasModelCmd('IDMMX.CLASTASKS',
'TASK', 'ID', 'HeartClasTask',
'IDMMX.CLASSIFMODELS',
'MODEL', 'MODELNAME', 'HeartClasModel' );
 Prediction
 Stored procedures and virtual mining views
 Most of the implementation outside the DBMS
(Cache Mining)
Data transfer delays
 http://www-306.ibm.com/software/data/iminer/
7
Oracle Data Miner
 Algorithms
Adaptive Naïve Bayes
SVM regression
K-means clustering
Association rules, text, mining, etc., etc.
 PL/SQL with extensions for mining
 Models as first class objects
Create_Model, Prediction, Prediction_Cost,
Prediction_Details, etc.
 http://www.oracle.com/technology/products/bi/odm/index.html
21-Mar-08
8
OLE DB for DM by Microsoft
Model creation. Descriptive phase
Prediction joins
Other features
Nested cases
 http://research.microsoft.com/dmx/DataMining/
PMML a descriptive XML language for
exchanging information between systems
9
OLE DB for DM (DMX) (cont.)
Mining objects as first class objects
Schema rowsets
Mining_Models
Mining_Model_Content
Mining_Functions
Other features
Column value distribution
Nested cases
 http://research.microsoft.com/dmx/DataMining/
21-Mar-08
10
OLE DB for DM (DMX): 3 steps
 Model creation
Create mining model MemCard_Pred (
CustomerId long key, Age long continuous,
Profession text discrete,
Income long continuous,
Risk text discrete predict)
Using Microsoft_Decision_Tree;
 Training
Insert into MemCard_Pred OpenRowSet(
“‘sqloledb’, ‘sa’, ‘mypass’”,
‘SELECT CustomerId, Age,
Profession, Income, Risk from Customers’)
 Prediction Join
Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)
From MemCard_Pred AS MP Prediction Join Customers AS C
Where MP.Profession = C.Profession and
AP.Income = C.Income AND MP.Age = C.Age;
21-Mar-08
11
Defining a Mining Model:
E.g., a model to predict students’ plan to attend
college
The format of “training cases” (top-level entity)
Attributes, Input/output type, distribution
Algorithms and parameters
Example
CREATE MINING MODEL CollegePlanModel
(
StudentID
Gender
ParentIncome
Encouragement
CollegePlans
LONG
TEXT
LONG
TEXT
TEXT
KEY,
DISCRETE,
NORMAL CONTINUOUS,
DISCRETE,
DISCRETE PREDICT
) USING Microsoft_Decision_Trees
12
Training
INSERT INTO CollegePlanModel
(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,
‘SELECT
StudentID,
Gender,
ParentIncome,
Encouragement,
CollegePlans
FROM CollegePlansTrainData’)
21-Mar-08
13
Prediction Join
SELECT t.ID, CPModel.Plan
FROM CPModel PREDICTION JOIN
OPENQUERY(…,‘SELECT * FROM
NewStudents’) AS t
ON CPModel.Gender = t.Gender AND
CPModel.IQ = t.IQ
CPModel
ID
Gender
IQ Plan
ID
Gender
21-Mar-08
IQ
NewStudents
14
Summary of Vendors’ Approaches
 Built-in library of mining methods
Script language or GUI tools
 Limitations
Closed systems (internals hidden from users)
Adding new algorithms or customizing old ones -Difficult
Poor integration with SQL
Limited interoperability across DBMSs
 Predictive Markup Modeling Language (PMML)
as a palliative
21-Mar-08
15
PMML
 Predictive Markup Model Language
XML based language for vendor independent
definition of statistical and data mining models
Share models among PMML compliant products
A descriptive language
 Supported by all major vendors
21-Mar-08
16
PMML Example
21-Mar-08
17
Much Competion
Vendors
 SAS Institute (Enterprise Miner)
 IBM (DB2 Intelligent Miner for
Data)
 Oracle (ODM option to Oracle
10g)
 SPSS (Clementine)
 Unica Technologies, Inc. (Pattern
Recognition Workbench)
 Insightsful (Insightful Miner)
 KXEN (Analytic Framework)
 Prudsys (Discoverer and its
family)
 Microsoft (SQL Server 2005)
 Angoss (KnowledgeServer and its
family)
 DBMiner (DB2)
Platforms
IBM
Oracle
SAS,
Tools
SPSS
Angoss
KXEN
Megaputer
FairIsaac
Insightful
Stand Alone Systems
WEKA is open-source java code created by
researchers at the University of Waikato
in New Zealand.
 It provides many different machine
learning algorithms
 Applicable to generic data described in
Attribute-Relation File Format (ARFF)
19
Weka
 A comprehensive set of DM algorithms, and tools.
 Generic algorithms over arbitrary data sets.
Independent on the number of columns in tables.
 Open and extensible system based on Java.
* Also free …
21-Mar-08
20