Introduction

Download Report

Transcript Introduction

An Introduction
Student Name: Riaz Ahmad
Program: MSIT(2005-2007)
Subject: Data warehouse & Data Mining
Concerned terms with my
Research
 KDD process
 Data warehouse
 Materialized or Indexed view
 Data Mining
 Data Mining Techniques
 Classification
 Decision Tree
 ID3 algorithm
 Objective
 Conclusion
KDD process
Pattern Evaluation
Data Mining
Task-relevant Data
Data Warehouse
Data Cleaning
Data Integration
Databases
Selection
Steps of a KDD Process
 Learning the application domain:
relevant prior knowledge and goals of application
 Creating a target data set: data selection
 Data cleaning and preprocessing: (may take 60% of effort!)
 Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant
representation.
 Choosing functions of data mining
summarization, classification, regression, association, clustering.
 Choosing the mining algorithm(s)
 Data mining: search for patterns of interest
 Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
 Use of discovered knowledge
Data Warehouse
 Data warehouse is a sub-oriented, integrated or consolidated, Nonvolatile or read only, time-variant collection of data designed to support
management DSS needs.
 Any read-only collection of accumulated historical data is called a data
warehouse.
 A data warehouse is a database specifically structured for query and
analysis.
 A data warehouse typically contains data representing the business
history of an organization.
Materialized or Indexed View
 A materialized view is a special type of summary table that is
constructed by aggregating one or more columns of data from a
a single table, or a series of tables that are joined together
 Materialized views can dramatically improve query performance,
and significantly decrease the load on the system.
 You need it in data warehouse environment more that OLTP
environment you need it in huge databases, and not a table that
has 3 records.
Data Mining
 Data mining (DM) is defined as the process of discovering patterns
in data. The process must be automatic or (more usually) semiautomatic.
The patterns discovered must be meaningful in that they lead to some
advantage.
 We simply define; data mining refers to extracting or mining “knowledge
from large amounts of data.
 There are many other terms such as knowledge mining from databases,
knowledge extraction, data/pattern analysis, data archaeology, and data
dredging.
 Data mining is concerned with finding hidden relationships present in
business data to allow businesses to make predictions for future use.
Data Mining Tasks or
Techniques

Classification

Regression

Segmentation

Association

Forecasting

Text Analysis

Advanced Data Exploration
We only select classification among different data mining techniques or tasks for
the research work.
Classification
 Classification is data analysis which can be used to extract models
describing important data classes or to predict future data trends.
 Given a number of pre-specified classes. Examine a new object,
record, or individual and assign it, based on a model, to one of
these classes

Examples
Which credit applicants are low, medium, high risk?
Which hotel customers are likely, unlikely to return?
Which residents are likely, unlikely to vote?
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based reasoning
• Neural Networks
• Genetic algorithms
• Bayesian networks
Among these classification techniques we only select the decision tree
Decision Tree
 A decision tree is a flow-chart-like tree structure, where each internal node
denotes a test on an attribute, each branch represents an outcome of the test, and leaf
nodes represent classes . The topmost node in a tree is the root node.
 A decision tree is a tree in which each branch node represents a choice between a
number of alternatives, and each leaf node represents a decision. Decision tree are
commonly used for gaining information for the purpose of decision -making.
Decision tree starts with a root node on which it is for users to take actions. From
this node, users split each node recursively according to decision tree learning
algorithm. The final result is a decision tree in which each branch represents a
possible scenario of decision and its outcome
Decision Tree
Generating Classification rules from
Decision Tree
IF age = “<30" AND student = no
THEN buys computer = no
IF age = “<30" AND student = yes
THEN buys computer = yes
IF age = “30-40"
THEN buys computer = yes
IF age = “>40" AND credit rating = excellent
THEN buys computer = yes
IF age = “>40" AND credit rating = fair
THEN buys computer = no
Generating Classification rules from
Decision Tree
IF age = “<30" AND student = no
THEN buys computer = no
IF age = “<30" AND student = yes
THEN buys computer = yes
IF age = “30-40"
THEN buys computer = yes
IF age = “>40" AND credit rating = excellent
THEN buys computer = yes
IF age = “>40" AND credit rating = fair
THEN buys computer = no
ID3 algorithm
Originator of the ID3 Algorithm
ID3 and its successors have been developed by Ross Quinlan, who
discovered it in the 1970s.
 Implementation of ID3 Algorithm
ID3 (Learning Sets S, Attributes Sets A, Attributes values V)
Return Decision Tree.
Begin
Load learning sets first, create decision tree root node 'root Node',
Add learning set S into Root node as its subset.
For root Node, we compute Entropy (rootNode. subset) first
If Entropy (rootNode. subset) = =0, then
RootNode. subset consists of records all with the same value for the categorical
Attribute, return a leaf node with decision attribute: attribute value;
If Entropy (rootNode. subset)! =0, then
Compute information gain for each attribute left (have not been used in splitting),
Find attribute A with Maximum (Gain(S, A)).
Create child nodes of this root Node and add to root Node in the decision tree.
For each child of the root Node, apply
ID3(S, A, V) recursively until reach Node that has entropy=0 or reach Leaf node.
End ID3.
Mathematical Formulae
The following mathematical formulae are used for the calculation of Entropy
and Gain.
 Entropy Equation
 Information Gain Equation
Classification Experiments
First we take a training dataset ( S ) For classification purpose
Step(1)
Entropy of Original Dataset
 Entropy calculation process of dataset is shown below.
First decides the number of records which have (No) class value that are five (5) while
the (yes) class value records are Nine (9). Total number of records is fourteen (14).
Relative frequency of No class: 5/14.
Relative frequency of Yes class: 9/14.
Entropy of S dataset is calculated by the above Entropy formula.
Entropy (5, 9) = -5/14 log2 5/14 – 9/14 log2 9/14
= 0.9403
Step (2)
Calculate The Gain of Each input Attribute in Dataset
The following information is required for the Gain Calculation of Outlook Attribute
For the calculation of attribute gain first checked the number of values for this attribute,
and then on the basis of each value the S dataset is classified. Outlook attribute have three
values like rain, overcast and sunny. There are three subset is possible of S dataset on the
basis of outlook attribute values.
I. First subset(S1) contains five (5) records on the basis of rain value of outlook attribute.
II. Second subset (S2) contains four (4) records on the basis of overcast value of outlook
attribute.
III. Thirds subset (S3) contains five (5) records on the basis of sunny value.
Proportionality measure for S1 is 5/14
Proportionality measure fro S2 is 4/14
Proportionality measure for S3 is 5/14
Step (2)
Calculate The Gain of Each input Attribute in Dataset
The following steps are required for the calculation of Gain value of Each Attribute
in original dataset (S)
 Calculate the Entropy of each subset ( S1, S2, S3)
 Calculate the attribute Entropy ( Outlook )
 Calculate the Gain of Attribute ( Outlook )

Calculate the Entropy of each subset (S1,S2,S3)
In first set S1 the three (3) yes class and (2) No class. Total is five records (5)
Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5
= 0.971
In second set S2 the four (4) yes class. Total is four records (4)
Entropy (4,4) = -4/4 log2 4/4
= 0
In third set S3 the three (3) NO class and two (2) Yes class. Total is five records (5).
Entropy (3, 2) = -3/5 log2 3/5 – 2/5 log2 2/5
= 0.971

Calculate the Entropy of the outlook Attribute
The following formula is used for calculation
Entropy (S1, S2, S3) = S1/S * Entropy (S1) + S2/S * Entropy (S2) + S3/S * Entropy (S3)
Entropy (5, 4, 5) or Entropy (outlook)
= 5/14 * 0.971 + 4/14 * 0 + 5/14 * 0.971
= 0.694

Calculate the Gain value of the outlook Attribute
The following formula is used for calculation
Gain (S, A)
= Entropy (S) – Entropy (outlook)
Gain(S, outlook)
= 0.9403 - 0.694
= 0.2463
The above Three steps Repeats for other three remaining input attributes. The
following tables contain the Gain of attributes for Original set, Rain subset, and Sunny
subset.
Attributes Along with Gain Information
(Original set, rain and sunny subsets)
Step (3) Select the Maximum Gain value Attribute for the
classification of dataset (S)
In the above Table the attribute which have the maximum gain value is the outlook.
The Gain value of outlook is 0.2463 which is the highest value. After this process we
can split the dataset into three different subsets on the basis of outlook attribute
values which are following. Rain, overcast and sunny. The classification is show the
complete classification process which generate the decision tree or classification
rules.
Decision Tree
Decision Tree developed in my work
Objectives
1. To integrate the decision tree with Data warehouse or database
2. To reduce the time of construction of decision tree at root node.
The computational process for constructing tree is highly complex and recursive in
nature. It includes calculating various values i.e. Entropy of dataset, Entropy and
Gain values of each input attributes in dataset repeatedly. Here, I have pre-computed
results required at least for the selection and classification of root node.
There is no change in the intelligence approach, only the values required are stored,
instead of calculating them at run time in memory. However, this integration has
given a jump start for the construction of the classification model, enhancing the
overall efficiency of the model.
Pre-Calculated Structure
Conclusion
Classification algorithms are memory resident, calculating various statistical values
at runtime. Storage of these statistical values, even for the selection of the best
attribute at the root node, greatly increases the performance of the classification
algorithms.
Materialized view will hold the input training dataset while these statistical values
will be stored in a dependent table. This table will be updated according to the policy
chosen. Modern data warehouses offer a many methods to update the materialized
view.
However, each time a new target class is introduced or new data is loaded in this
containing the statistical values will be updated accordingly. The accuracy of the
algorithm is in no way affected, not in a positive or negative direction. The
significant improvement introduced is in the efficiency, in selection of the root level
attribute.