Integrated Rule-Based Data Mining System

Download Report

Transcript Integrated Rule-Based Data Mining System

C.-C. Chan
Department of Computer Science
University of Akron
Akron, OH 44325-4003
USA
[email protected]
UA Faculty Forum 2008 by C.-C. Chan
1
Outline




Overview of Data Mining
Software Tools
A Rule-Based System for Data Mining
Concluding Remarks
UA Faculty Forum 2008 by C.-C. Chan
2
Data Mining (KDD)
 From Data to Knowledge
 Process of KDD (Knowledge Discovery in Databases)
 Related Technologies
 Comparisons
UA Faculty Forum 2008 by C.-C. Chan
3
Why KDD?
We are drowning in information, but starving for
knowledge  John Naisbett
Growing Gap between Data Generation and Data
Understanding:
Automation of business activities:
Telephone calls, credit card charges, medical tests, etc.
Earth observation satellites:
Estimated will generate one terabyte (1015 bytes) of data per day. At a rate of
one picture per second.
Biology:
Human Genome database project has collected over gigabytes of data on the
human genetic code [Fasman, Cuticchia, Kingsbury, 1994.]
US Census data:
NASA databases:
…
World Wide Web:
UA Faculty Forum 2008 by C.-C. Chan
4
Process of KDD
[1] Fayyad, U., Editorial, Int. J. of Data Mining and Knowledge Discovery, Vol.1, Issue 1, 1997.
[2] Fayyad, U., G. Piatetsky-Shapiro, and P. Smyth, "From data mining to knowledge discovery: an
overview," in Advances in Knowledge Discovery and Data Mining, Fayyad et al (Eds.), MIT Press, 1996.
UA Faculty Forum 2008 by C.-C. Chan
5
Process of KDD
1.
Selection


2.
Pre-Processing

3.

Choosing the functions and algorithms of data mining
Association rules, classification rules, clustering rules
Interpretation and Evaluation

6.
Data reduction and projection
Data Mining

5.
Data cleaning and preprocessing
Transformation

4.
Learning the application domain
Creating a target dataset
Validate and verify discovered patterns
Using discovered knowledge
UA Faculty Forum 2008 by C.-C. Chan
6
Typical Data Mining Tasks
 Finding Association Rules [Rakesh Agrawal et al, 1993]
 Each transaction is a set of items.
Given a set of transactions, an association rule is of the
form X  Y
where X and Y are sets of items.


e.g.: 30% of transactions that contain beer also contain diapers;
2% of all transactions contain both of these items.
Applications:




Market basket analysis and cross-marketing
Catalog design
Store layout
Buying patterns
UA Faculty Forum 2008 by C.-C. Chan
7
 Finding Sequential Patterns


Each data sequence is a list of transactions.
Find all sequential patterns with a user-specified minimum support.
 e.g.: Consider a book-club database
 A sequential pattern might be
 5% of customers bought “Harry Potter I”, then “Harry Potter II”,
and then “Harry Potter III”.
Applications:
 Add-on sales
 Customer satisfaction
 Identify symptoms/diseases that precede certain diseases
UA Faculty Forum 2008 by C.-C. Chan
8
 Finding Classification Rules

Finding discriminant rules for objects of different classes.
 Approaches:


Finding Decision Trees
Finding Production Rules
Applications:
 Process loans and credit cards applications
 Model identification
UA Faculty Forum 2008 by C.-C. Chan
9
 Text Mining
 Web Usage Mining
 Etc.
UA Faculty Forum 2008 by C.-C. Chan
10
Related Technologies
 Database Systems
 MS SQL server
 Transaction databases
 OLAP (Data Cubes)
 Data Mining


Decision Trees
Clustering Tools
 Machine Learning/Data Mining Systems





CART (Classification And Regression Trees)
C 5.x (Decision Trees)
WEKA (Waikato Environment for Knowledge Analysis)
LERS
ROSE 2
 Rule-Based Expert System Development Environments
 CLIPS, JESS
 EXSYS
 Web-based Platforms


Java
MS .Net
UA Faculty Forum 2008 by C.-C. Chan
11
Comparisons
PreProcessing
Learning
Data Mining
Inference
Engine
End-User
Interface
Web-Based
Access
Reasoning
with
Uncertainties
MS SQL
Server
N/A
Decision Trees
Clustering
N/A
N/A
N/A
N/A
CART
C 5.x
N/A
Decision Trees
Built-in
Embedded
N/A
N/A
WEKA
Yes
Trees, Rules,
Clustering,
Association
N/A
Embedded
Need
Programming
N/A
CLIPS
JESS
N/A
N/A
Built-in
Embedded
Need
Programming
3rd parties
Extensions
UA Faculty Forum 2008 by C.-C. Chan
12
Rule-Based Data Mining System
Objectives
 Develop an integrated rule-based data mining system
provides
 Synergy of database systems, machine learning, and
expert systems
 Dealing with uncertain rules
 Delivery of web-based user interface
UA Faculty Forum 2008 by C.-C. Chan
13
Structure of Rule-Based Systems
M atcher
Working
Memory
Selector
Rule
Base
Execution
No
Answer
Yes
Inference
Result
UA Faculty Forum 2008 by C.-C. Chan
14
System Workflow
Input
Data Set
Data Preprocessing
Rule
Generator
UA Faculty Forum 2008 by C.-C. Chan
User
Interface
Generator
15
 Input Data Set:
 Text file with comma separated values (CSV)
 It is assumed that there are N columns of values corresponding to N
variables or parameters, which may be real or symbolic values.
 The first N – 1 variables are considered as inputs and the last one is
the output variable.
 Data Preprocessing:
 Discretize domains of real variables into a finite number of intervals
 Discretized data file is then used to generate an attribute
information file and a training data file.
 Rule Generator:
 A symbolic learning program called BLEM2 is used to generate
rules with uncertainty
 User Interface Generator:
 Generate a web-based rule-based system from a rule file and
corresponding attribute file
UA Faculty Forum 2008 by C.-C. Chan
16
Architecture of RBC generator
Requests
Client
SQL DB server
Middle Tier
Responses
Workflow of RBC generator
Rule set File
Metadata File
RBC Generator
SQL Rule Table
Rule Table Definition
UA Faculty Forum 2008 by C.-C. Chan
17
Concluding Remarks
A system for generating rule-based classifier from data
with the following benefits:
 No need of end user programming
 Automatic rule-based system creation
 Delivery system is web-based provides easy access
UA Faculty Forum 2008 by C.-C. Chan
18
Project Status
The current version 1.4 of our system provides
fundamental features for data mining from data
including:
 Data Preprocessing
 Management of preprocessed data files
 Machine Learning tool to generate rules from data
 Rule-Based Classifier system supporting uncertain rules
 Web-Based access
UA Faculty Forum 2008 by C.-C. Chan
19
Future Work
 More advanced features in Data Preprocessing such as
data cleansing, data transformation, and data statistics
 Learning from multi-criteria inputs with preferential
rankings to support Multiple Criteria Decision
Making processes
 Concept-Oriented information retrieval and search
UA Faculty Forum 2008 by C.-C. Chan
20
Thank You!
UA Faculty Forum 2008 by C.-C. Chan
21