ICT619 Intelligent Systems
Download
Report
Transcript ICT619 Intelligent Systems
ICT619 Intelligent
Systems
Topic 6: Data Mining
Data Mining
Introduction
Business Applications of Data Mining
Data Mining Activities
Data Mining Techniques
How to Apply Data Mining
Data Mining Development Methodology
ICT619
2
Why data mining?
“Customers who bought this title also bought … “
- from Amazon.com
Why? – More effective (targeted) marketing
How? – Targeting through association
Abundance of business data typically in terabytes
- point-of-sale (POS) devices, customer call detail
databases, web log files in e-commerce etc
Data is being collected mostly for improving efficiency
of underlying operations
But not for analysis
ICT619
3
Why data mining? (cont’d)
Useful information (business intelligence) to gain
competitive advantage can be extracted by "mine"-ing
data
Examples: underlying trends, associations or patterns
in market behaviour
According to (Hirji 2001),
“ … data mining is the analysis and non-trivial extraction
of data from databases for the purpose of discovering
new and valuable information, in the form of patterns
and rules, from relationships between data elements.”
ICT619
4
Data mining in perspective
OLAP with data warehouses tells us what is
happening and how
Data mining tells us what is likely to happen
Data mining is knowledge discovery in
(commercial) databases - KDD
Data mining is a process rather than a product
ICT619
5
Data mining in perspective (cont’d)
Statistical methods do not scale up to today's problems
New "intelligent" tools are needed
Data mining draws from artificial intelligence/soft
computing, database theory, data visualization,
marketing, statistics, and so on
Our objectives:
Understand the role of data mining in business
Distinguish between different data mining techniques
Understand how to go about making use of data mining
ICT619
6
Business Applications of Data
Mining
Fastest growing segment of business intelligence
market
Increasingly an integral and necessary component of
an organization’s portfolio of analytical techniques
Data mining for marketing
Uses data on customer behaviour to identify target
groups for marketing
Reduces cost by avoiding groups unlikely to respond
ICT619
7
Business Applications of Data
Mining (cont’d)
Data mining for customer relationship
management
Anticipating customers’ needs and responding to
them proactively
Data mining in R&D
Can lower costs during the R&D phase of the
product life cycle by analysing voluminous test data
Bioinformatics - data mining in biology and medicine
ICT619
8
Data Mining Activities
Two broad groups – directed and undirected data mining
Directed data mining
We know what we are looking for
We aim to find the value of a pre-identified target variable in
terms of a collection of input variables, eg, classifying
insurance claims
Undirected data mining
Finds patterns in data
Leaves it to the user to find the significance of these
patterns
Eg, identifying groups of customers with similar buying
patterns
ICT619
9
Different types of data mining tasks
ICT619
10
Data Mining Tasks
Classification
Assigns a given object to a predefined category (class)
based on the object’s attributes (features)
Objects to be classified are generally database
records.
Discrete outcomes – yes/no, low/medium/high etc,
Examples of classification tasks:
Assigning keywords to articles
Classifying credit applicants as low, medium and high risk
Assigning customers to predefined customer segments
ICT619
11
Data Mining Tasks
Estimation
Continuously varying outcomes
Eg income, probability of a customer leaving (known in data
mining circles as churning)
Outcomes can also be used for classification by ranking and
thresholding
Prediction
Classification or estimation task performed to predict some
future behaviour
Examples include:
- Predicting which customers will churn in the next six
months
- Predicting the size of a balance that will be transferred
ICT619
12
Data Mining Tasks (cont’d)
Finding affinity grouping or association rules
Finds out, which things go together, eg, in a supermarket
shopping trolley
Used for arranging items in shelves or catalogues
Identifying cross-selling opportunities
Clustering
Segments a group of diverse records into subgroups or
clusters containing similar records
No predefined classes in clustering; records grouped based
on similarities in their attributes
Eg, people with similar buying habits
Data miner must interpret clusters and decide what to do
ICT619
13
Data Mining Tasks (cont’d)
Description and visualisation
To help increase our understanding of people,
products or processes that produced the data
A good description can provide an explanation
of their behaviour
Data visualisation can be very effective in
explaining things by exploiting our ability to
utilise visual clues
ICT619
14
Data Mining Techniques
Our aim is a basic understanding of data mining
techniques to find out
When to apply them
How to interpret their results
How to evaluate their performance
Three major approaches are:
Decision trees
Automatic cluster detection
Artificial neural networks (supervised and unsupervised)
ICT619
15
Decision Trees
Visual representation of
a reasoning process
Particularly suitable for
solving classification
problems
Consists of internal
nodes, leaf nodes and
edges
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
ICT619
16
Decision Trees (cont’d)
Each leaf node is labelled
with a class label
The class label decided by
the class of the records
that ended up in that leaf
during training
A leaf node may also
contain a value depending
upon the average of the
values of such records
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
Group A contains any self-employed person aged <=40 and earning a
salary of more than $50,000.
ICT619
17
Decision Trees (cont’d)
Each edge originating
from an internal node is
labelled with a splitting
predicate involving that
node’s splitting attribute
The splitting predicate
forces any record to take
a unique path from the
root to exactly one leaf
node.
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
Group A contains any self-employed person aged less than 41 and
earning a salary of more than $50,000.
ICT619
18
How decision trees work
Each record with N attributes is a point in an Ndimensional record space
Each branch in a decision tree is a test on a single
variable that splits the space into two or more regions
With each successive test and split, the resulting
regions get more and more segregated with increasing
homogeneity among the records
Ultimately, the leaf nodes will contain the purest batch
of records
ICT619
19
How decision trees work
For example, in the
example decision
tree, any selfemployed person
aged <= 40 and
earning a salary of
more than $50,000
will be classified as
belonging to group
A.
ICT619
20
How decision trees work (cont’d)
Overfitting in decision trees
A decision tree that correctly classifies every single
record
Such a tree is unlikely to generalise to new data sets
To prevent overfitting, test data set are used to prune
decision trees once it has been built using the training
data set.
ICT619
21
How decision trees are built
Recursive partitioning
An iterative process of splitting the training data into
partitions (regions of record space)
Initially, all records are in a training set consisting of
pre-classified records
An algorithm splits up the data, using every possible
binary split on every field of the records
The best split is defined as one that creates partitions
where a single class predominates
ICT619
22
How decision trees are built
(cont’d)
Recursive partitioning (cont’d)
The most important task in building a decision tree is to
decide which of the attributes (independent fields in a
record) gives the best split
The measure used to evaluate a potential splitter is the
reduction in diversity (or increase in purity)
The best split has the largest reduction in diversity
One measure of diversity is the Gini index:
2p1 * (1 – p2)
ICT619
23
How decision trees are built
(cont’d)
Recursive partitioning (cont’d)
The splitting process is applied to each of the new
parts and so on until no more useful splits can be found
A node becomes a leaf node when no split can be
found that significantly decreases the diversity
Pruning
The full decision tree needs to be pruned to improve its
performance
Pruning is done by removing leaves and branches
(edges leading to leaves) that fail to generalise
There are a number of pruning methods
Eg, a tree is pruned back to the subtree that minimises
error on the test set.
ICT619
24
How decision trees are built
(cont’d)
Different types of decision trees
Types depend upon
the number of splits allowed at each level
how these splits are chosen when the tree is built
how the tree is pruned to prevent overfitting
More broadly, decision trees can be grouped
as:
Classification trees (leaves represent classes)
Regression trees (leaves represent a numeric
value)
ICT619
25
Algorithms for building decision
trees
Most notable are
- CHAID, C4.5/C5.0 and CART
Data mining software tools allow approximation
of any of these algorithms by providing choice
of
splitting criteria and pruning strategies
control parameters such as maximum tree depth
ICT619
26
Application of decision trees
Useful when the data mining task is
classification of records or prediction of
outcomes
Also chosen to generate understandable rules,
which can be explained and translated into
SQL or a natural language
For example,
IF age < 41
AND income < $50,000
AND employment = self
THEN belongs to group A
ICT619
27
Automatic Cluster Detection
Aims to discover structure in a complex data set as a
whole in order to carve it up into simpler groups
Examples of clustering
- finding products that should be grouped together in a
catalogue, or
- identifying groups of customers with similar tastes in
music
Many methods for finding clusters in data, a prominent
one is K-means clustering
ICT619
28
K-means clustering
Available in a wide
variety of commercial
data mining tools
Divides the data set
into a predetermined
number, k, of clusters
Initial clusters centred
at random points
(seeds) in the record
space
ICT619
29
K-means clustering (cont’d)
Records are assigned to the
clusters through an iterative
process
In the first step, k data points
are selected to be the seeds
Each seed is an embryonic
cluster with only one element
In the second step, each
record is assigned to the
cluster whose centroid is
nearest to that record
This forms the new clusters
with new intercluster
boundaries.
ICT619
30
K-means clustering (cont’d)
The centroid of a cluster of
records calculated by taking
average of each field for all
the records in that cluster
Euclidean distance most
commonly used for
measuring distance by data
mining software.
Distance between two points
P(x1, x2, .. , xn) and Q(y1,
y2, .. , yn) in n-dimensional
space is ((x1-y1)2 + (x2y2)2 + .. + (xn-yn)2).
ICT619
31
K-means clustering (cont’d)
In the k-means method, the original choice of the value
of k determines the number of clusters that will be
found
Unless advanced knowledge is available on the likely
number of clusters, value of k is determined by trialand-error
Best results are obtained when k matches the
underlying structure of the data.
ICT619
32
Interpreting clusters
Automatic clustering is undirected data mining
- We look for something useful without having to know
what we are looking for
Both an advantage and possible disadvantage!
The most frequently used approaches interpreting
clusters are
Building a decision tree with the cluster labels as target
variables, and
using it to derive rules explaining how to assign new
records to the correct cluster
Using visualisation to see how the clusters are affected
by changes in input variables.
Examining the differences in the distributions of variables
from cluster to cluster, one variable at a time.
ICT619
33
Application of clusters
Clustering is used
When natural groupings are suspected,
Eg, groups representing customers or products that
have a lot in common with each other
When there are many competing patterns in the data
making it hard to spot any single pattern
Creating clusters reduces the complexity within
clusters so that other data mining techniques are more
likely to succeed
ICT619
34
Artificial Neural Networks
Main generic application of artificial neural networks is
pattern recognition or classification
Estimation and prediction can be viewed as variants of
classification
The best ANN model for performing classification is the
backpropagation network (or the multilayer perceptron)
The ANN model particularly suited for clustering is the
Kohonen net or the self-organising map (SOM)
ICT619
35
Artificial Neural Networks (cont’d)
SOM learning algorithms are unsupervised
Clusters are represented in a SOM by groups of
adjacent neurons in output layer
SOM reduces dimensionality from N to 2
SOM can serve as a clustering tool as well as
visualisation tool for high-dimensional data
SOMs claimed to be often more effective than k-means
for complex shaped clusters
ICT619
36
Application of neural nets
Artificial neural networks can produce very good results
But require extensive data preparation involving
normalisation and conversion of categorical values to
numeric values
Do not work well when there are many hundreds or
thousands of input features - long training phases
Difficult to understand because they represent complex
non-linear models
Unlike decision trees, do not produce rules readily.
A good choice for most classification and prediction
tasks when the results are more important than
understanding how the model works
ICT619
37
How to Apply Data Mining
Four ways of utilising data mining expertise in business:
1. By purchasing readymade scores (such as on credit
worthiness for a loan applicant) from outside vendors.
2. By purchasing software that embodies data mining
expertise designed for a particular application such as
credit approval, fraud detection or churn prevention
3. By hiring outside consultants to perform data mining
for special projects
4. By developing own data mining skills within the
business organization
Purchasing scores is quick and easy, but the intelligence
limited to single score values
ICT619
38
Purchasing Software
Two possibilities:
Software may be an actual model
Eg, in the form of a set of rules for decision support, or
a fully-trained neural network applied to a particular
domain
Software may embody knowledge of the process of
building models for a particular domain in the form of a
model-creation wizard or template
Purchased models work well if the products, customers,
and market conditions match those used to develop the
model
ICT619
39
Tasks for the data mining model
builder
Model building software automate the process of
creating candidate models and selecting the ones that
perform best
Significant tasks left for the user:
Choosing a suitable business problem to be addressed by
data mining.
Identifying and collecting data that is likely to contain the
information needed to answer the business question.
Pre-processing the data so that the data mining tool can make
use of it
Transforming the database so that the input variables needed
by the model are available
Designing a plan of action based on the model and
implementing it in the marketplace
Measuring results of the actions and feeding them back into
the database for future mining.
ICT619
40
Hiring Outside Experts
Recommended approach if
Organization in early stages of integrating data
mining in its business
Data mining activity is to be an one-off process
Not if
it is to be an ongoing process, eg, data mining for
customer relationship management
Outside expertise for data mining is likely to be
available in three possible places
ICT619
41
Hiring Outside Experts (cont’d)
Outside expertise for data mining is likely to be available
in three possible places:
1. From a data mining software vendor
2. Data mining centres
Usually collaborations between universities and
private companies
Eg, Monash Data Mining Centre
3. Consulting companies
Consulting company chosen should have had
experience specifically in the area of interest to the
organisation
ICT619
42
Developing In-house Expertise for
data mining
Applies particularly to companies which have
many products and customers
Should be a core competency of all large scale
businesses
ICT619
43
Data Mining Development
Methodology
Best practice yet to emerge (Hirji 2001)
A proposed a five-stage model (Cabena
1998):
1.
2.
3.
4.
5.
Business objective determination
Data preparation
Data mining
Results analysis
Knowledge assimilation
ICT619
44
Data Mining Development
Methodology (cont’d)
Cabena’s five-stage model:
Business objective determination
Clearly identifying the business problem to be mined
Data preparation
Data selection, preprocessing and transformation
Data mining
Algorithm selection and execution
Results analysis
Has anything new or interesting been found
Knowledge assimilation
Formulate ways of exploiting the new information extracted
ICT619
45
A Case Study (Hirji 2001)
Involved a large fast food outlet
Brought out some deficiencies of the above
methodology
A new set of stages for data mining development and
use proposed:
1. Business objective determination
2. Data preparation
3. Data audit
4. Interactive data mining and results analysis
5. Back end data mining
6. Results synthesis and presentation
ICT619
46
A Case Study (Hirji 2001) (cont’d)
Case study used IBM’s Intelligent Miner for Data on AIX as
the data mining tool
Took 20 actual days of effort across the 6 stages above
Back end data mining involves data enrichment and
additional data mining algorithm execution by the data
mining specialist
Distribution of time required
45% taken up by stages 4, 5 and 6
30% required by the data preparation stage (70% predicted
in the earlier model)
Use of a data warehouse saved time needed for selecting,
cleaning, transforming, coding, and loading the data
ICT619
47
A Case Study (Hirji 2001) (cont’d)
Interactive data mining and results analysis stage
Linking data mining results with business
strategy and using application software such
as spreadsheets to perform sensitivity analysis
of results obtained
Aims to demonstrate how data mining results
support business strategy
ICT619
48
REFERENCES
Berry, M., & Linoff, G. Mastering Data Mining, Wiley Computer
Publishing, New York 2000.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A.
Discovering Data Mining: From Concept to Implementation.
Prentice Hall, Englewood Cliffs, NJ 1998.
Dhar, V., & Stein, R.,”Deriving Rules from Data” in Seven Methods
for Transforming Corporate Data into Business Intelligence.,
Prentice Hall 1997, pp. 167-189, 251-258.
Ganti, V., Gehrke, J., & Ramakrishnan, R. Mining Very Large
Databases, IEEE Computer, Vol.32 No.8, August 1999, pp.38-45.
Hirji, K., Exploring Data Mining Implementation, Communications
of the ACM, Vol.44, No.7, July 2001, pp. 87-93.
Web site on Data Mining and Web Mining http://www.kdnuggets.com/software/suites.html
ICT619
49