The Data Mining Software Vendors

Download Report

Transcript The Data Mining Software Vendors

Supporting Data Stream Mining
Applications: in DBMS&DSMS
Carlo Zaniolo
UCLA CSD
4/4/2017
1
DM Experience for DBMS: Dreams vs. Reality
Decision Support and business intelligence:

OLAP & data warehouses: resounding success for DBMS vendors, via


relational DBMS extensions for DM queries: a flop



Simple extensions of SQL (aggregates & analytics)
OR-DBMS do not fare much better [Sarawagi’ 98].
Imielinski & Mannila proposed a ‘high-road’ approach [CACM’96] was
suggested by who called for a quantum leap in functionality based on:

Simple declarative extensions of SQL for Data Mining (DM)

Efficiency through DM query optimization techniques (yet to be invented)
The research area of Inductive DBMS was thus born, producing
 Interesting language work: DMQL, Mine Rule, MSQL, …


4/4/2017
Where implementation technology lacks generality & performance limitations
Real questions if optimizers will ever take us there.
http://wis.cs.ucla.edu
2
DM Experience for DBMS: Dreams vs. Reality
The
Low-Road Approach by Commercial DBMS



Approaches Largely based on a Cache Mining
Stored procedures and virtual mining views
Outside the DBMS
 Data transfer delays

No move toward standarization
IBM DB2
http://www-306.ibm.com/software/data/iminer/
Intelligent Miner no longer supported.
4/4/2017
http://wis.cs.ucla.edu
3
Oracle Data Miner

Algorithms






PL/SQL with extensions for mining
Models as first class objects


Adaptive Naïve Bayes
SVM regression
K-means clustering
Association rules, text, mining, etc., etc.
Create_Model, Prediction, Prediction_Cost,
Prediction_Details, etc.
http://www.oracle.com/technology/products/bi/odm/index.html
4/4/2017
http://wis.cs.ucla.edu
4
MS: OLE DB for DM (DMX): 3 steps

Model creation
Create mining model MemCard_Pred (
CustomerId long key, Age long continuous,
Profession text discrete,
Income long continuous,
Risk text discrete predict)
Using Microsoft_Decision_Tree;

Training
Insert into MemCard_Pred OpenRowSet(
“‘sqloledb’, ‘sa’, ‘mypass’”,
‘SELECT CustomerId, Age,
Profession, Income, Risk from Customers’)

Prediction Join
Select C.Id, C.Risk, PredictProbability(MemCard_Pred.Risk)
From MemCard_Pred AS MP Prediction Join Customers AS C
Where MP.Profession = C.Profession and AP.Income =
C.Income
AND MP.Age = C.Age;
4/4/2017
http://wis.cs.ucla.edu
5
MS: Defining a Mining Model:

E.g., a model to predict students’ plan to attend college




The format of “training cases” (top-level entity)
Attributes, Input/output type, distribution
Algorithms and parameters
Example
CREATE MINING MODEL CollegePlanModel
(
StudentID
Gender
ParentIncome
Encouragement
CollegePlans
LONG
TEXT
LONG
TEXT
TEXT
KEY,
DISCRETE,
NORMAL CONTINUOUS,
DISCRETE,
DISCRETE PREDICT
) USING Microsoft_Decision_Trees
4/4/2017
http://wis.cs.ucla.edu
6
Training
INSERT INTO CollegePlanModel
(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,
‘SELECT
StudentID,
Gender,
ParentIncome,
Encouragement,
CollegePlans
FROM CollegePlansTrainData’)
4/4/2017
http://wis.cs.ucla.edu
7
Prediction Join
SELECT t.ID, CPModel.Plan
FROM CPModel PREDICTION JOIN
OPENQUERY(…,‘SELECT * FROM
NewStudents’) AS t
ON CPModel.Gender = t.Gender AND
CPModel.IQ = t.IQ
CPModel
4/4/2017
ID
Gender
IQ Plan
ID
Gender
IQ
NewStudents
http://wis.cs.ucla.edu
8
OLE DB for DM (DMX) (cont.)

Mining objects as first class objects

Schema rowsets




Other features



Mining_Models
Mining_Model_Content
Mining_Functions
Column value distribution
Nested cases
http://research.microsoft.com/dmx/DataMining/
4/4/2017
http://wis.cs.ucla.edu
9
Summary of Vendors’ Approaches

Built-in library of mining methods


Limitations





Script language or GUI tools
Closed systems (internals hidden from users)
Adding new algorithms or customizing old ones -Difficult
Poor integration with SQL
Limited interoperability across DBMSs
Predictive Markup Modeling Language (PMML) as
a palliative
4/4/2017
http://wis.cs.ucla.edu
10
PMML

Predictive Markup Model Language




XML based language for vendor independent definition
of statistical and data mining models
Share models among PMML compliant products
A descriptive language
Supported by all major vendors
4/4/2017
http://wis.cs.ucla.edu
11
PMML Example
4/4/2017
http://wis.cs.ucla.edu
12
The Data Mining World According to
The Data Mining Software Vendors
Market Competition
Disclaimer
Disclaimer
This presentation contains preliminary information that may be changed substantially prior to final
commercial release of the software described herein.
The information contained in this presentation represents the current view of Microsoft Corporation on the
issues discussed as of the date of the presentation. Because Microsoft must respond to changing
market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft
cannot guarantee the accuracy of any information presented after the date of the presentation.
This presentation is for informational purposes only. MICROSOFT MAKES NO WARRANTIES,
EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.
Microsoft may have patents, patent applications, trademarks, copyrights, or other intellectual property
rights covering subject matter in this presentation. Except as expressly provided in any written license
agreement from Microsoft, the furnishing of this information does not give you any license to these
patents, trademarks, copyrights, or other intellectual property.
© 2005 Microsoft Corporation. All rights reserved.
Major Data Mining Vendors
• Platforms
 IBM
 Oracle
 SAS
• Tools
 SPSS
 Angoss
 KXEN
 Megaputer
 FairIsaac
 Insightful
Competition
Product
SQL Server 2005
Oracle 10g
IBM
SAS
SQL Server Analysis
Services
Oracle Data Mining
DB2 Intelligent Miner,
WebSphere
Enterprise Miner
http://otn.oracle.com/pr
oducts/bi/odm/odminin
g.html
http://www306.ibm.com/software/data/imin
er/
http://www.sas.com/technologies/analytics/data
mining/miner/factsheet.pdf
Link
API
OLEDB/DM, DMX,
XMLA, ADOMD.Net
Java DM, PL/SQL
SQL MM/6 based on UDF,
SQL SPROC
SAS Script
Algorithms
7 (+2)
8
6
8+
Text Mining
Yes
Yes
Yes
Yes
Marketing Pages
N/A
18
10
Dozens
Client Tools
Embeddable Viewers,
Reporting Services
Analysis tools, Webbased targeted
reports
WebSphere Portal (vertical
solution)
None
Discoverer
Excel AddIn
IM Visualization
Distribution
Included
Additional Package
Additional Packages
Separate Product
Target
Developers
Developers
DB2 IM Scoring module is for
developers; Other modules
are for analysts.
Analysts
Strengths
Powerful yet simple
API
Good credibility with
enterprise customers
Integration with other
BI technologies
New GUI, Leader of
JDM API
Mature, Market Leader. Extensive
customization and modelling abilities. Robust,
industry tested and accepted algorithms and
methodologies. Export to DB2 Scoring.
New GUI
CRM Integration
Mature product (6 years).
Good service model.
Scoring inside relational
engine. Strong partnership
with SAS
Not in-process with
relational engine Lacking
statistical functions
Poor Analyst experience
API overly complex
High price. Standard
Functionality. Poor API
(SQL MM). Confusing
product line.
Expensive. Proprietary. Customer relations
range from congenial to hostile.
Weaknesses
Inconsistent
Major DM
Vendors
 SAS Institute (Enterprise Miner)
 IBM (DB2 Intelligent Miner for Data)
 Oracle (ODM option to Oracle 10g)
 SPSS (Clementine)
 Unica Technologies, Inc. (Pattern
Recognition Workbench)
 Insightsful (Insightful Miner)
 KXEN (Analytic Framework)
 Prudsys (Discoverer and its family)
 Microsoft (SQL Server 2005)
 Angoss (KnowledgeServer and its
family)
 DBMiner (DBMiner)
 etc…
Platforms
 IBM
 Oracle
 SAS,
Tools
 SPSS
 Angoss
 KXEN
 Megaputer
 FairIsaac
 Insightful
ORACLE
Strengths
 Oracle Data Mining (ODM) Integrated into relational engine
– Performance benefits
– Management integration
– SQL Language integration
 ODM Client
– “Walks through” Data Mining Process
– Data Mining tailored data preparation
– Generates code
 Integration into Oracle CRM
– “EZ” Data Mining for customer churn, other applications
 Full suite of algorithms
– Typical algorithms, plus text mining and bioinformatics
 Nice marketing/user education
ORACLE
Weaknesses
 Additional Licensing Fees (base $400/user, $20K proc)
 Confusing API Story
– Certain features only work with Java API
– Certain features only work with PL/SQL API
– Same features work differently with different API’s
 Difficult to use
– Different modeling concepts for each algorithm
 Poor connectivity – ORACLE only
SAS
• Entrenched Data Mining Leader
 Market Share
 Mind Share
• “Best of Breed”
 Always will attract the top ?% of customers
• Overall poor product
 Only for the expert user (SAS Philosophy)
 Integration of results generally involves source code
• Integrated with ETL, other SAS tools
• Partnership with IBM
 Model in SAS, deploy in DB2
My View ...





4/4/2017
DBMS pachyderms have made some progress
toward high level data models and integration
with SQL, but
Closed systems,
Lacking in coverage and user-extensibility.
Not as popular as dedicated, stand-alone, opensoftware DM systems, such as Weka
OS experience again?
http://wis.cs.ucla.edu
21
Weka

A comprehensive set of DM algorithms, and tools.

Generic algorithms over arbitrary data sets.


Independent on the number of columns in tables.
Open and extensible system based on Java.
* These are the desiderata for a DSMS (or a CEP
system) that support the data stream mining task
4/4/2017
http://wis.cs.ucla.edu
22
References







Tomasz Imielinski and Heikki Mannila. A database perspective on
knowledge discovery. Communication ACM, 39(11):58, 1996.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule
mining with relational database systems: Alternatives and implications.
In SIGMOD, 1998.
[Imielinski’ 96] Tomasz Imielinski and Heikki Mannila. A database
perspective on knowledge discovery. Commun. ACM, 39(11):58–64, 1996.
T. Imielinski and A. Virmani. MSQL: a query language for database
mining. Data Mining and Knowledge Discovery, 3:373--408, 1999.
J. Han, Y. Fu, W. Wang, K. Koperski, and O. R. Zaiane. DMQL: A data
mining query language for relational databases. In Workshop on
Research Issues on Data Mining and Knowledge Discovery (DMKD),
pages 27--33, Montreal, Canada, June 1996.
R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. In VLDB, pages 122--133, Bombay, India, 1996.
Marco Botta, Jean-Francois Boulicaut, Cyrille Masson, and Rosa Meo.
Query languages supporting descriptive rule mining: A comparative
study. In Database Support for Data Mining Applications, pages 24--51,
2004.
4/4/2017
http://wis.cs.ucla.edu
23
Road Map for Next Three Weeks


Fast& Light Algorithms for Mining Data Streams

Classifiers and Classifier Ensembles,

Clustering methods,

Association Rules,

Time series
Supporting the mining task in a DSMS


4/4/2017
Data Mining Query Languages and support for
the mining process
Toward a Data stream mining workbench
http://wis.cs.ucla.edu
24
References







IBM. DB2 Intelligent Miner
www.306.ibm.com/software/data/iminer
ORACLE. Oracle Data Miner
Release10gr2:http://www.oracle.com/technology/products/bi/od
m.
Z. Tang, J. Maclennan, and P. Kim. Building data mining solutions
with OLE DB for DM and XML analysis. SIGMOD Record,
34(2):80–85, 2005
Data Mining Group (DMG). Predictive model markup language
(pmml). http://sourceforge.net/projects/pmml.
Carlo Zaniolo: Mining Databases and Data Streams with Query
Languages and Rules: Invited Talk, Fourth International
Workshop on Knowledge Discovery in Inductive Databases, KDID
2005.
Hetal Thakkar Mozafari and Carlo Zaniolo: Designing an
Hetal Thakkar, Nikolay Laptev, Hamid Mousavi, Barzan Mozafari,
Vincenzo Russo, Carlo Zaniolo:SMM: A data stream management
system for Knowledge Discovery. ICDE 2011: 757-768
4/4/2017
http://wis.cs.ucla.edu
25
Thank you!
4/4/2017
http://wis.cs.ucla.edu
26
Supporting DM Tasks and the Process
in DSMS or a CEP System



I had a dream: WEKA for Data Streams! But with a
DSMS we have to starting from SQL rather than
Java!
Case Study: Naïve Bayesian Classifiers—arguably the
simplest mining algorithm. It is doable in SQL/DBMS.
Is it also doable in SQL/DSMS?
What about the various CEP systems, which claim to
be powerful (e.g., support rules). Can they support
NBC? In general, can they be extended to support
generic versions of NBC, and perhaps other data
stream mining methods?
4/4/2017
http://wis.cs.ucla.edu
27
Assignment: due on May 10th.
Download a DSMS or a CEP system of your choice and (after explaining why you have
selected this and not the others) explore how you can implement the following
tasks:
1. Testing of a Naïve Bayesian Classifier: you can assume that the NBC has already
been trained and you can read it from the input, or a DB, a file, or memory.
2. Assume now that you also have a stream of pre-classified samples. Use this to
determine the accuracy of your current classifier, at periodic intervals. Output
the accuracy, and if this falls below a certain threshold repeat Step 1.
3. Periodically retrain a new NBC from the stream of pre-classified tuples; then
use the newly built classifier to predict the class of unclassified tuples (Step 1).
4. See if you can generalize your software, and e.g., design/develop generic NBCs,
ensemble methods, other classifiers, etc.
It is understood that the limitations of DSMS and CEP systems will probably
prevent you from completing all these tasks (listed in order of increasing
difficulty). So, you should make sure that you (1) download a good system (but
not Stream Mill), (2) write clear report explaining your efforts, and the reasons
that prevented you from going further. (For test sets, refer to CS240A.)
4/4/2017
http://wis.cs.ucla.edu
28