ICT619 Intelligent Systems

Download Report

Transcript ICT619 Intelligent Systems

ICT619 Intelligent
Systems
Topic 6: Data Mining
Data Mining






Introduction
Business Applications of Data Mining
Data Mining Activities
Data Mining Techniques
How to Apply Data Mining
Data Mining Development Methodology
ICT619
2
Why data mining?
 “Customers who bought this title also bought … “
- from Amazon.com
Why? – More effective (targeted) marketing
How? – Targeting through association
 Abundance of business data typically in terabytes
- point-of-sale (POS) devices, customer call detail
databases, web log files in e-commerce etc
 Data is being collected mostly for improving efficiency
of underlying operations
But not for analysis
ICT619
3
Why data mining? (cont’d)
 Useful information (business intelligence) to gain
competitive advantage can be extracted by "mine"-ing
data
 Examples: underlying trends, associations or patterns
in market behaviour
 According to (Hirji 2001),
“ … data mining is the analysis and non-trivial extraction
of data from databases for the purpose of discovering
new and valuable information, in the form of patterns
and rules, from relationships between data elements.”
ICT619
4
Data mining in perspective
 OLAP with data warehouses tells us what is
happening and how
 Data mining tells us what is likely to happen
 Data mining is knowledge discovery in
(commercial) databases - KDD
 Data mining is a process rather than a product
ICT619
5
Data mining in perspective (cont’d)
 Statistical methods do not scale up to today's problems
 New "intelligent" tools are needed
 Data mining draws from artificial intelligence/soft
computing, database theory, data visualization,
marketing, statistics, and so on
 Our objectives:
 Understand the role of data mining in business
 Distinguish between different data mining techniques
 Understand how to go about making use of data mining
ICT619
6
Business Applications of Data
Mining
 Fastest growing segment of business intelligence
market
 Increasingly an integral and necessary component of
an organization’s portfolio of analytical techniques
 Data mining for marketing
 Uses data on customer behaviour to identify target
groups for marketing
 Reduces cost by avoiding groups unlikely to respond
ICT619
7
Business Applications of Data
Mining (cont’d)
 Data mining for customer relationship
management
 Anticipating customers’ needs and responding to
them proactively
 Data mining in R&D
 Can lower costs during the R&D phase of the
product life cycle by analysing voluminous test data
 Bioinformatics - data mining in biology and medicine
ICT619
8
Data Mining Activities
 Two broad groups – directed and undirected data mining
 Directed data mining
 We know what we are looking for
 We aim to find the value of a pre-identified target variable in
terms of a collection of input variables, eg, classifying
insurance claims
 Undirected data mining
 Finds patterns in data
 Leaves it to the user to find the significance of these
patterns
 Eg, identifying groups of customers with similar buying
patterns
ICT619
9
Different types of data mining tasks
ICT619
10
Data Mining Tasks
Classification
 Assigns a given object to a predefined category (class)
based on the object’s attributes (features)
 Objects to be classified are generally database
records.
 Discrete outcomes – yes/no, low/medium/high etc,
 Examples of classification tasks:
 Assigning keywords to articles
 Classifying credit applicants as low, medium and high risk
 Assigning customers to predefined customer segments
ICT619
11
Data Mining Tasks
Estimation
 Continuously varying outcomes
 Eg income, probability of a customer leaving (known in data
mining circles as churning)
 Outcomes can also be used for classification by ranking and
thresholding
Prediction
 Classification or estimation task performed to predict some
future behaviour
 Examples include:
- Predicting which customers will churn in the next six
months
- Predicting the size of a balance that will be transferred
ICT619
12
Data Mining Tasks (cont’d)
Finding affinity grouping or association rules
 Finds out, which things go together, eg, in a supermarket
shopping trolley
 Used for arranging items in shelves or catalogues
 Identifying cross-selling opportunities
Clustering
 Segments a group of diverse records into subgroups or
clusters containing similar records
 No predefined classes in clustering; records grouped based
on similarities in their attributes
 Eg, people with similar buying habits
 Data miner must interpret clusters and decide what to do
ICT619
13
Data Mining Tasks (cont’d)
Description and visualisation
 To help increase our understanding of people,
products or processes that produced the data
 A good description can provide an explanation
of their behaviour
 Data visualisation can be very effective in
explaining things by exploiting our ability to
utilise visual clues
ICT619
14
Data Mining Techniques
 Our aim is a basic understanding of data mining
techniques to find out
 When to apply them
 How to interpret their results
 How to evaluate their performance
 Three major approaches are:
 Decision trees
 Automatic cluster detection
 Artificial neural networks (supervised and unsupervised)
ICT619
15
Decision Trees
 Visual representation of
a reasoning process
 Particularly suitable for
solving classification
problems
 Consists of internal
nodes, leaf nodes and
edges
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
ICT619
16
Decision Trees (cont’d)
 Each leaf node is labelled
with a class label
 The class label decided by
the class of the records
that ended up in that leaf
during training
 A leaf node may also
contain a value depending
upon the average of the
values of such records
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
Group A contains any self-employed person aged <=40 and earning a
salary of more than $50,000.
ICT619
17
Decision Trees (cont’d)
 Each edge originating
from an internal node is
labelled with a splitting
predicate involving that
node’s splitting attribute
 The splitting predicate
forces any record to take
a unique path from the
root to exactly one leaf
node.
Fig. A sample decision tree for catalogue mailing (Ganti et al. 1999).
Group A contains any self-employed person aged less than 41 and
earning a salary of more than $50,000.
ICT619
18
How decision trees work
 Each record with N attributes is a point in an Ndimensional record space
 Each branch in a decision tree is a test on a single
variable that splits the space into two or more regions
 With each successive test and split, the resulting
regions get more and more segregated with increasing
homogeneity among the records
 Ultimately, the leaf nodes will contain the purest batch
of records
ICT619
19
How decision trees work
For example, in the
example decision
tree, any selfemployed person
aged <= 40 and
earning a salary of
more than $50,000
will be classified as
belonging to group
A.
ICT619
20
How decision trees work (cont’d)
Overfitting in decision trees
 A decision tree that correctly classifies every single
record
 Such a tree is unlikely to generalise to new data sets
 To prevent overfitting, test data set are used to prune
decision trees once it has been built using the training
data set.
ICT619
21
How decision trees are built
Recursive partitioning
 An iterative process of splitting the training data into
partitions (regions of record space)
 Initially, all records are in a training set consisting of
pre-classified records
 An algorithm splits up the data, using every possible
binary split on every field of the records
 The best split is defined as one that creates partitions
where a single class predominates
ICT619
22
How decision trees are built
(cont’d)
Recursive partitioning (cont’d)
 The most important task in building a decision tree is to
decide which of the attributes (independent fields in a
record) gives the best split
 The measure used to evaluate a potential splitter is the
reduction in diversity (or increase in purity)
 The best split has the largest reduction in diversity
 One measure of diversity is the Gini index:
2p1 * (1 – p2)
ICT619
23
How decision trees are built
(cont’d)
Recursive partitioning (cont’d)
 The splitting process is applied to each of the new
parts and so on until no more useful splits can be found
 A node becomes a leaf node when no split can be
found that significantly decreases the diversity
Pruning
 The full decision tree needs to be pruned to improve its
performance
 Pruning is done by removing leaves and branches
(edges leading to leaves) that fail to generalise
 There are a number of pruning methods
Eg, a tree is pruned back to the subtree that minimises
error on the test set.
ICT619
24
How decision trees are built
(cont’d)
Different types of decision trees
 Types depend upon
 the number of splits allowed at each level
 how these splits are chosen when the tree is built
 how the tree is pruned to prevent overfitting
 More broadly, decision trees can be grouped
as:
 Classification trees (leaves represent classes)
 Regression trees (leaves represent a numeric
value)
ICT619
25
Algorithms for building decision
trees
 Most notable are
- CHAID, C4.5/C5.0 and CART
 Data mining software tools allow approximation
of any of these algorithms by providing choice
of
 splitting criteria and pruning strategies
 control parameters such as maximum tree depth
ICT619
26
Application of decision trees
 Useful when the data mining task is
classification of records or prediction of
outcomes
 Also chosen to generate understandable rules,
which can be explained and translated into
SQL or a natural language
 For example,
IF age < 41
AND income < $50,000
AND employment = self
THEN belongs to group A
ICT619
27
Automatic Cluster Detection
 Aims to discover structure in a complex data set as a
whole in order to carve it up into simpler groups
 Examples of clustering
- finding products that should be grouped together in a
catalogue, or
- identifying groups of customers with similar tastes in
music
 Many methods for finding clusters in data, a prominent
one is K-means clustering
ICT619
28
K-means clustering
 Available in a wide
variety of commercial
data mining tools
 Divides the data set
into a predetermined
number, k, of clusters
 Initial clusters centred
at random points
(seeds) in the record
space
ICT619
29
K-means clustering (cont’d)
 Records are assigned to the
clusters through an iterative
process
 In the first step, k data points
are selected to be the seeds
 Each seed is an embryonic
cluster with only one element
 In the second step, each
record is assigned to the
cluster whose centroid is
nearest to that record
 This forms the new clusters
with new intercluster
boundaries.
ICT619
30
K-means clustering (cont’d)
 The centroid of a cluster of
records calculated by taking
average of each field for all
the records in that cluster
 Euclidean distance most
commonly used for
measuring distance by data
mining software.
 Distance between two points
P(x1, x2, .. , xn) and Q(y1,
y2, .. , yn) in n-dimensional
space is ((x1-y1)2 + (x2y2)2 + .. + (xn-yn)2).
ICT619
31
K-means clustering (cont’d)
 In the k-means method, the original choice of the value
of k determines the number of clusters that will be
found
 Unless advanced knowledge is available on the likely
number of clusters, value of k is determined by trialand-error
 Best results are obtained when k matches the
underlying structure of the data.
ICT619
32
Interpreting clusters
 Automatic clustering is undirected data mining
 - We look for something useful without having to know
what we are looking for
 Both an advantage and possible disadvantage!
 The most frequently used approaches interpreting
clusters are
 Building a decision tree with the cluster labels as target
variables, and
using it to derive rules explaining how to assign new
records to the correct cluster
 Using visualisation to see how the clusters are affected
by changes in input variables.
 Examining the differences in the distributions of variables
from cluster to cluster, one variable at a time.
ICT619
33
Application of clusters
Clustering is used
 When natural groupings are suspected,
Eg, groups representing customers or products that
have a lot in common with each other
 When there are many competing patterns in the data
making it hard to spot any single pattern
 Creating clusters reduces the complexity within
clusters so that other data mining techniques are more
likely to succeed
ICT619
34
Artificial Neural Networks
 Main generic application of artificial neural networks is
pattern recognition or classification
 Estimation and prediction can be viewed as variants of
classification
 The best ANN model for performing classification is the
backpropagation network (or the multilayer perceptron)
 The ANN model particularly suited for clustering is the
Kohonen net or the self-organising map (SOM)
ICT619
35
Artificial Neural Networks (cont’d)
 SOM learning algorithms are unsupervised
 Clusters are represented in a SOM by groups of
adjacent neurons in output layer
 SOM reduces dimensionality from N to 2
 SOM can serve as a clustering tool as well as
visualisation tool for high-dimensional data
 SOMs claimed to be often more effective than k-means
for complex shaped clusters
ICT619
36
Application of neural nets
 Artificial neural networks can produce very good results
 But require extensive data preparation involving
normalisation and conversion of categorical values to
numeric values
 Do not work well when there are many hundreds or
thousands of input features - long training phases
 Difficult to understand because they represent complex
non-linear models
 Unlike decision trees, do not produce rules readily.
 A good choice for most classification and prediction
tasks when the results are more important than
understanding how the model works
ICT619
37
How to Apply Data Mining
Four ways of utilising data mining expertise in business:
1. By purchasing readymade scores (such as on credit
worthiness for a loan applicant) from outside vendors.
2. By purchasing software that embodies data mining
expertise designed for a particular application such as
credit approval, fraud detection or churn prevention
3. By hiring outside consultants to perform data mining
for special projects
4. By developing own data mining skills within the
business organization
Purchasing scores is quick and easy, but the intelligence
limited to single score values
ICT619
38
Purchasing Software
Two possibilities:
 Software may be an actual model
Eg, in the form of a set of rules for decision support, or
a fully-trained neural network applied to a particular
domain
 Software may embody knowledge of the process of
building models for a particular domain in the form of a
model-creation wizard or template
Purchased models work well if the products, customers,
and market conditions match those used to develop the
model
ICT619
39
Tasks for the data mining model
builder
 Model building software automate the process of
creating candidate models and selecting the ones that
perform best
 Significant tasks left for the user:
 Choosing a suitable business problem to be addressed by
data mining.
 Identifying and collecting data that is likely to contain the
information needed to answer the business question.
 Pre-processing the data so that the data mining tool can make
use of it
 Transforming the database so that the input variables needed
by the model are available
 Designing a plan of action based on the model and
implementing it in the marketplace
 Measuring results of the actions and feeding them back into
the database for future mining.
ICT619
40
Hiring Outside Experts
 Recommended approach if
 Organization in early stages of integrating data
mining in its business
 Data mining activity is to be an one-off process
 Not if
 it is to be an ongoing process, eg, data mining for
customer relationship management
 Outside expertise for data mining is likely to be
available in three possible places
ICT619
41
Hiring Outside Experts (cont’d)
Outside expertise for data mining is likely to be available
in three possible places:
1. From a data mining software vendor
2. Data mining centres
Usually collaborations between universities and
private companies
Eg, Monash Data Mining Centre
3. Consulting companies
Consulting company chosen should have had
experience specifically in the area of interest to the
organisation
ICT619
42
Developing In-house Expertise for
data mining
 Applies particularly to companies which have
many products and customers
 Should be a core competency of all large scale
businesses
ICT619
43
Data Mining Development
Methodology


Best practice yet to emerge (Hirji 2001)
A proposed a five-stage model (Cabena
1998):
1.
2.
3.
4.
5.
Business objective determination
Data preparation
Data mining
Results analysis
Knowledge assimilation
ICT619
44
Data Mining Development
Methodology (cont’d)
Cabena’s five-stage model:
 Business objective determination
 Clearly identifying the business problem to be mined
 Data preparation
 Data selection, preprocessing and transformation
 Data mining
 Algorithm selection and execution
 Results analysis
 Has anything new or interesting been found
 Knowledge assimilation
 Formulate ways of exploiting the new information extracted
ICT619
45
A Case Study (Hirji 2001)
 Involved a large fast food outlet
 Brought out some deficiencies of the above
methodology
 A new set of stages for data mining development and
use proposed:
1. Business objective determination
2. Data preparation
3. Data audit
4. Interactive data mining and results analysis
5. Back end data mining
6. Results synthesis and presentation
ICT619
46
A Case Study (Hirji 2001) (cont’d)
 Case study used IBM’s Intelligent Miner for Data on AIX as
the data mining tool
 Took 20 actual days of effort across the 6 stages above
 Back end data mining involves data enrichment and
additional data mining algorithm execution by the data
mining specialist
 Distribution of time required
 45% taken up by stages 4, 5 and 6
 30% required by the data preparation stage (70% predicted
in the earlier model)
 Use of a data warehouse saved time needed for selecting,
cleaning, transforming, coding, and loading the data
ICT619
47
A Case Study (Hirji 2001) (cont’d)
Interactive data mining and results analysis stage
 Linking data mining results with business
strategy and using application software such
as spreadsheets to perform sensitivity analysis
of results obtained
 Aims to demonstrate how data mining results
support business strategy
ICT619
48
REFERENCES
 Berry, M., & Linoff, G. Mastering Data Mining, Wiley Computer
Publishing, New York 2000.
 Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi, A.
Discovering Data Mining: From Concept to Implementation.
Prentice Hall, Englewood Cliffs, NJ 1998.
 Dhar, V., & Stein, R.,”Deriving Rules from Data” in Seven Methods
for Transforming Corporate Data into Business Intelligence.,
Prentice Hall 1997, pp. 167-189, 251-258.
 Ganti, V., Gehrke, J., & Ramakrishnan, R. Mining Very Large
Databases, IEEE Computer, Vol.32 No.8, August 1999, pp.38-45.
 Hirji, K., Exploring Data Mining Implementation, Communications
of the ACM, Vol.44, No.7, July 2001, pp. 87-93.
 Web site on Data Mining and Web Mining http://www.kdnuggets.com/software/suites.html
ICT619
49