An Investigation into Commercial Data Mining

Download Report

Transcript An Investigation into Commercial Data Mining

An Evaluation of
Commercial Data
Mining
Proposed and Presented by
Emily Davis
Supervisor: John Ebden
Statement of the Problem

An Evaluation of Commercial Data
Mining Capabilities, for example
Oracle9i’s Data Mining Suite.
Background
Data mining is a relatively new offshoot of
database technology which has arisen as a result
of the ability of computers to:
 Store vast quantities of data in data warehouses.
 Implement ingenious algorithms for the mining
of data.
 Use these algorithms to analyse these vast
quantities of data in a reasonable amount of
time.
Data mining discovers the patterns in data
that represent knowledge.
 It is of interest what algorithms data mining
suites use and how well each category of
data mining algorithm performs on data
and what kind of results are produced.
 Another important issue is usability of the
algorithm.


Random Number Example taken from
http://www.saltspring.com/brochmann/math/mini
ng/mining1.html
#
data a
data b
data c
1.00000000
2.00000000
3.00000000
4.00000000
5.00000000
6.00000000
7.00000000
8.00000000
9.00000000
10.0000000
11.0000000
12.0000000
13.0000000
14.0000000
0.71132700
0.62219935
0.33872289
0.54262732
0.50631348
0.00132503
0.76211535
0.91026206
0.92640874
0.49323546
0.04501477
0.49180000
0.06747225
0.84239974
0.15379400
0.83119106
0.80881084
0.35427095
0.71599532
0.22447315
0.94620700
0.89499186
0.47156928
0.27673696
0.30142353
0.17909135
0.85629071
0.41916601
1.88403600
3.73797189
3.10387831
2.14806749
3.16061290
0.67606951
4.36285170
4.50549970
3.26752532
1.81668179
0.99430013
1.52087404
2.70381663
2.94229750
49.0000000 0.07845276 0.69584199 2.24443147
50.0000000 0.07548299 0.52973340 1.74016616
51.0000000 0.72301849 0.97594044
????????
Data A and B random numbers generated in
Excel.
Data c = 2*(data a) + 3*(data b).

51st value calculated by Excel:4.37385831

Value calculated using Knowledge Miner –
a Macintosh data mining tool:
4.34791231 and the equation :
1.97*(data a) + 2.96*(data b) + 0.0324
Experiment repeated using three columns
of random numbers and this equation:
Data d = 23*(data a)-4.5*(data b)+(data a
+ data c) .
 The last five entries for Data D were
missing from the column.

These were generated by Excel:
14.7314558
12.0720505
22.0008992
7.52633344
5.25167700
 These are what Knowledge Miner predicted:
14.7341613
12.0731391
22.0080223
7.52465867
5.24861860

Plan of Action
Literature Survey (and other resources)
 Install Software for Oracle
 Get to know the Oracle Suite
 Evaluate Oracle9i’s Data Mining Suite

Install Software for Oracle
Including JDeveloper
 May be extended to the installation of
other commercial data mining suites eg.
DB2’s Intelligent Miner
Informix’s Data Mine

Investigate Oracle9i’s Data
Mining Suite


Two major algorithm types – supervised and
unsupervised learning.
A Medical Example:
Supervised learning – researchers input medical
profiles into a leukaemia model to predict
propensity for the disease.
Unsupervised learning – searches for clusters of
related information in data sets to reveal
insights about diseases and patient populations.
Get to know the Oracle DM Suite
(a major task).
Explore JDeveloper, Oracle9i’s Java
based API.
 JDeveloper complies with JDM (Java Data
Mining) used by Oracle, Sun, IBM and
others.
 Explore DM4J( Data Mining for Java) the
new Graphical User Interface for Oracle
DM.

Addressing the Problem:
Run the different algorithms available in
the data mining suite.
 Document and analyse results in terms of
performance and effectiveness of
algorithm.

Expected Results:

The ability to say conclusively whether
Oracle's data mining capabilities are
inferior or superior to anything else in the
market place and why this can be stated.
Possible Extensions to the
Project:




To have sufficient knowledge of the topic to give
recommendations or feedback:
to Oracle regarding their data mining suite.
to IT customers wanting to purchase data mining
suites.
Explore the field of Random stereograms- could
a computer see them? If not, why not?
Literature Survey



Principles of data mining by David Hand, Heikki
Mannila and Padhraic Smyth, Cambridge
Massachusetts, MIT Press, 2001 – algorithmic
concepts
Data mining: concepts and techniques by Jiawei
Han and Micheline Kamber, San Francisco,
California, Morgan Kauffmann, 2001 –
algorithmic evaluations
Data mining: a tutorial- based primer by Richard
J. Roiger and Michael W. Geatz, Boston,
Massachusetts, Addison Wesley, 2003 practical knowledge and processing
Data Mining by Pieter Adriaans and Dolf
Zantinge, Harlow, England, Addison
Wesley, 1996 – real life application
 Data Mining and Statistical Analysis Using
SQL by Robert P. Trueblood and John N.
Lovett, Jnr., USA, Apress, 2001 –
statistical principles
 Data Mining Using SAS Applications by
George Fernandez, USA, Chapman and
Hall/CRC, 2003 - methodologies

Mastering Data Mining: The Art and
Science of Customer Relationship
Management by Michael J.A. Berry and
Gordon S. Linoff, USA, Wiley Computer
Publishing, 2000 – building effective
models
 Data Preparation for Data Mining by
Dorian Pyle, San Francisco, California,
Morgan Kauffman, 2000 – Demo code,
10 Golden Rules.

The White Paper: Data Mining- Beyond
Algorithms by Dr Akeel Al-Attar, available
at http://www.attar.com/tutor/mining.htm
 Summary from the KDD-03 Panel—Data
Mining: The Next Ten Years available at
http://www.acm.org/sigs/sigkdd/exploration
s/issue5-2/pnl_10yrs_final1.pdf
 Oracle Website
 Oracle Magazine
