Integrated Rule-Based Data Mining System
Download
Report
Transcript Integrated Rule-Based Data Mining System
C.-C. Chan
Department of Computer Science
University of Akron
Akron, OH 44325-4003
USA
[email protected]
UA Faculty Forum 2008 by C.-C. Chan
1
Outline
Overview of Data Mining
Software Tools
A Rule-Based System for Data Mining
Concluding Remarks
UA Faculty Forum 2008 by C.-C. Chan
2
Data Mining (KDD)
From Data to Knowledge
Process of KDD (Knowledge Discovery in Databases)
Related Technologies
Comparisons
UA Faculty Forum 2008 by C.-C. Chan
3
Why KDD?
We are drowning in information, but starving for
knowledge John Naisbett
Growing Gap between Data Generation and Data
Understanding:
Automation of business activities:
Telephone calls, credit card charges, medical tests, etc.
Earth observation satellites:
Estimated will generate one terabyte (1015 bytes) of data per day. At a rate of
one picture per second.
Biology:
Human Genome database project has collected over gigabytes of data on the
human genetic code [Fasman, Cuticchia, Kingsbury, 1994.]
US Census data:
NASA databases:
…
World Wide Web:
UA Faculty Forum 2008 by C.-C. Chan
4
Process of KDD
[1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997.
[2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an
overview," in Advances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996.
UA Faculty Forum 2008 by C.-C. Chan
5
Process of KDD
1.
Selection
2.
Pre-Processing
3.
Choosing the functions and algorithms of data mining
Association rules, classification rules, clustering rules
Interpretation and Evaluation
6.
Data reduction and projection
Data Mining
5.
Data cleaning and preprocessing
Transformation
4.
Learning the application domain
Creating a target dataset
Validate and verify discovered patterns
Using discovered knowledge
UA Faculty Forum 2008 by C.-C. Chan
6
Typical Data Mining Tasks
Finding Association Rules [Rakesh Agrawal et al, 1993]
Each transaction is a set of items.
Given a set of transactions, an association rule is of the
form X Y
where X and Y are sets of items.
e.g.: 30% of transactions that contain beer also contain diapers;
2% of all transactions contain both of these items.
Applications:
Market basket analysis and cross-marketing
Catalog design
Store layout
Buying patterns
UA Faculty Forum 2008 by C.-C. Chan
7
Finding Sequential Patterns
Each data sequence is a list of transactions.
Find all sequential patterns with a user-specified minimum support.
e.g.: Consider a book-club database
A sequential pattern might be
5% of customers bought “Harry Potter I”, then “Harry Potter II”,
and then “Harry Potter III”.
Applications:
Add-on sales
Customer satisfaction
Identify symptoms/diseases that precede certain diseases
UA Faculty Forum 2008 by C.-C. Chan
8
Finding Classification Rules
Finding discriminant rules for objects of different classes.
Approaches:
Finding Decision Trees
Finding Production Rules
Applications:
Process loans and credit cards applications
Model identification
UA Faculty Forum 2008 by C.-C. Chan
9
Text Mining
Web Usage Mining
Etc.
UA Faculty Forum 2008 by C.-C. Chan
10
Related Technologies
Database Systems
MS SQL server
Transaction databases
OLAP (Data Cubes)
Data Mining
Decision Trees
Clustering Tools
Machine Learning/Data Mining Systems
CART (Classification And Regression Trees)
C 5.x (Decision Trees)
WEKA (Waikato Environment for Knowledge Analysis)
LERS
ROSE 2
Rule-Based Expert System Development Environments
CLIPS, JESS
EXSYS
Web-based Platforms
Java
MS .Net
UA Faculty Forum 2008 by C.-C. Chan
11
Comparisons
PreProcessing
Learning
Data Mining
Inference
Engine
End-User
Interface
Web-Based
Access
Reasoning
with
Uncertainties
MS SQL
Server
N/A
Decision Trees
Clustering
N/A
N/A
N/A
N/A
CART
C 5.x
N/A
Decision Trees
Built-in
Embedded
N/A
N/A
WEKA
Yes
Trees, Rules,
Clustering,
Association
N/A
Embedded
Need
Programming
N/A
CLIPS
JESS
N/A
N/A
Built-in
Embedded
Need
Programming
3rd parties
Extensions
UA Faculty Forum 2008 by C.-C. Chan
12
Rule-Based Data Mining System
Objectives
Develop an integrated rule-based data mining system
provides
Synergy of database systems, machine learning, and
expert systems
Dealing with uncertain rules
Delivery of web-based user interface
UA Faculty Forum 2008 by C.-C. Chan
13
Structure of Rule-Based Systems
M atcher
Working
Memory
Selector
Rule
Base
Execution
No
Answer
Yes
Inference
Result
UA Faculty Forum 2008 by C.-C. Chan
14
System Workflow
Input
Data Set
Data Preprocessing
Rule
Generator
UA Faculty Forum 2008 by C.-C. Chan
User
Interface
Generator
15
Input Data Set:
Text file with comma separated values (CSV)
It is assumed that there are N columns of values corresponding to N
variables or parameters, which may be real or symbolic values.
The first N – 1 variables are considered as inputs and the last one is
the output variable.
Data Preprocessing:
Discretize domains of real variables into a finite number of intervals
Discretized data file is then used to generate an attribute
information file and a training data file.
Rule Generator:
A symbolic learning program called BLEM2 is used to generate
rules with uncertainty
User Interface Generator:
Generate a web-based rule-based system from a rule file and
corresponding attribute file
UA Faculty Forum 2008 by C.-C. Chan
16
Architecture of RBC generator
Requests
Client
SQL DB server
Middle Tier
Responses
Workflow of RBC generator
Rule set File
Metadata File
RBC Generator
SQL Rule Table
Rule Table Definition
UA Faculty Forum 2008 by C.-C. Chan
17
Concluding Remarks
A system for generating rule-based classifier from data
with the following benefits:
No need of end user programming
Automatic rule-based system creation
Delivery system is web-based provides easy access
UA Faculty Forum 2008 by C.-C. Chan
18
Project Status
The current version 1.4 of our system provides
fundamental features for data mining from data
including:
Data Preprocessing
Management of preprocessed data files
Machine Learning tool to generate rules from data
Rule-Based Classifier system supporting uncertain rules
Web-Based access
UA Faculty Forum 2008 by C.-C. Chan
19
Future Work
More advanced features in Data Preprocessing such as
data cleansing, data transformation, and data statistics
Learning from multi-criteria inputs with preferential
rankings to support Multiple Criteria Decision
Making processes
Concept-Oriented information retrieval and search
UA Faculty Forum 2008 by C.-C. Chan
20
Thank You!
UA Faculty Forum 2008 by C.-C. Chan
21