5.Data Mining
Download
Report
Transcript 5.Data Mining
5.Data Mining/D.S.Jagli
1
Introduction
let us try to understand the technology in a business
context.
Like all decision support systems, data mining delivers
information.
5.Data Mining/D.S.Jagli
2
Decision support progresses to data mining.
5.Data Mining/D.S.Jagli
3
Results of Data Mining
1. Forecasting what may happen in the future
2. Classifying people or things into groups by
recognizing patterns
3. Clustering people or things into groups based on
their attributes
4. Associating what events are likely to occur
together
5. Sequencing what events are likely to lead to
later events
5.Data Mining/D.S.Jagli
4
Data mining is not
•Brute-force crunching of bulk
data
•“Blind” application of algorithms
•Going to find relationships where
none exist
•Presenting data in different
ways
•A database intensive task
•A difficult to understand
technology requiring an advanced
degree in computer science
5.Data Mining/D.S.Jagli
5
What is a Data Mining?
Data mining is the exploration and analysis of large
quantities of data in order to discover valid, novel,
potentially useful, and ultimately understandable
patterns in data.
5.Data Mining/D.S.Jagli
6
What is a Data Mining ?
Cont…
Valid: The patterns hold in general.
Novel: We did not know the pattern beforehand.
Useful: We can devise actions from the patterns.
Understandable: We can interpret and comprehend
the patterns.
5.Data Mining/D.S.Jagli
7
Why Use Data Mining Today?
Competitive pressure!
“The secret of success is to know something that nobody
else knows.” Aristotle Onassis
Human analysis skills are inadequate:
• Volume and dimensionality of the data
• High data growth rate
Competition on service, not only on price (Banks, phone
companies, hotel chains, rental car companies)
•
•
•
•
Personalization, CRM
The real-time enterprise
“Systemic listening”
Security, homeland defense
5.Data Mining/D.S.Jagli
8
Data Mining Features
1. It is a knowledge discovery process.
2.
Data mining helps to understand the substance of
the data in a special unsuspected way.
3.
It unearths patterns and trends in the raw data you
never knew existed.
5.Data Mining/D.S.Jagli
9
Data Mining Features cont…
4. ” Data mining centers around the automated
discovery of new facts and relationships in data.
5. Data mining tools enable you to uncover hidden
information.
4. The assumption is that more useful knowledge
lies hidden beneath the surface.
5.Data Mining/D.S.Jagli
10
Examples of What People are Doing with
Data Mining or Applications
Fraud/Non-Compliance
Recruiting/Attracting
Anomaly detection
customers
Isolate the factors that lead to Maximizing
profitability (cross
fraud, waste and abuse
selling, identifying
Target auditing and
profitable customers)
investigative efforts more
Service Delivery and
effectively
Customer Retention
Credit/Risk Scoring
Build profiles of
customers likely to
Intrusion detection
use which services
Parts failure prediction
5.Data Mining/D.S.Jagli
11
Data Mining versus OLAP
OLAP - On-line
Analytical Processing
Provides you with a
very good view of
what is happening,
but can not predict
what will happen in
the future or why it is
happening
5.Data Mining/D.S.Jagli
12
5.Data Mining/D.S.Jagli
13
The Knowledge Discovery Process
Data mining as the process of knowledge discovery.
Data mining discovers knowledge or information that
you never knew was present in your data.
5.Data Mining/D.S.Jagli
14
The Knowledge Discovery Process
Steps:
1. Identify business problem
2. Data mining
3. Action
4. Evaluation and measurement
5. Deployment and integration into businesses
processes
5.Data Mining/D.S.Jagli
15
Data Mining Step in Detail
1. Data preprocessing
Data selection: Identify target datasets and relevant
fields
• Data cleaning
1.
2.
3.
4.
2.
3.
Remove noise and outliers
Data transformation
Create common units
Generate new fields
Data mining model construction
Model evaluation
5.Data Mining/D.S.Jagli
16
5.Data Mining/D.S.Jagli
17
What is a Data Mining Model?
A data mining model is a description of a specific aspect
of a dataset. It produces output values for an assigned
set of input values.
Examples:
• Linear regression model
• Classification model
• Clustering
5.Data Mining/D.S.Jagli
18
Data Mining Models (Contd.)
A data mining model can be described at two
levels:
Functional level:
• Describes model in terms of its intended usage.
Examples: Classification, clustering
Representational level:
Specific representation of a model.
Example: Log-linear model, classification tree,
nearest neighbor method.
5.Data Mining/D.S.Jagli
19
Or Market Basket Analysis
Introduction
As noted earlier, huge amount of data is stored
electronically in many retail outlets due to barcoding of
goods sold.
It is good business sense to try to find some useful
information from this mountains of data.
A conceptually simple yet interesting technique is to find
association rules from these large databases.
7/21/2015
5.Association Rule mining/D.S.Jagli
21
Introduction
Association rules mining (or market basket analysis)
searches for interesting customer habits by looking at
associations.
The classical example is the one where a store in USA
was reported to have discovered that people buying
bread tend also to buy cheese. Not sure if this is actually
true.
Applications in marketing, store layout, customer
segmentation, medicine, finance, and many more.
Question: Suppose you are able to find that two products x and y (say bread and milk or crackers
and cheese) are frequently bought together. How can you use that information?
7/21/2015
5.Association Rule mining/D.S.Jagli
22
A Simple Example
Consider the following ten transactions of a shop selling only nine items. We
will return to it after explaining some terminology.
Transaction ID Items
7/21/2015
10
Bread, Cheese, Newspaper
20
Bread, Cheese, Juice
30
Bread, Milk
40
Cheese, Juice. Milk, Coffee
50
Sugar, Tea, Coffee, Biscuits, Newspaper
60
Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
70
Bread, Cheese
80
Bread, Cheese, Juice, Coffee
90
Bread, Milk
100
Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
5.Association Rule mining/D.S.Jagli
23
Terminology
Let the number of different items sold be n
Let the number of transactions be N
Let the set of items be {i1, i2, …, in}. The number of items
may be large, perhaps several thousands.
Let the set of transactions be {t1, t2, …, tN}. Each transaction
ti contains a subset of items from the itemset {i1, i2, …, in}.
These are the things a customer buys when they visit the
supermarket.
7/21/2015
5.Association Rule mining/D.S.Jagli
24
Terminology
We want to find a group of items that tend to occur
together frequently.
The association rules are often written as X→Y meaning
that whenever X appears Y also tends to appear.
X and Y may be single items or sets of items but the
same item does not appear in both.
7/21/2015
5.Association Rule mining/D.S.Jagli
25
Terminology
Suppose X and Y appear together in only 10% of the
transactions but whenever X appears there is 80% chance
that Y also appears.
The 1% presence of X and Y together is called the support
(or prevalence) of the rule and 80% is called the
confidence (or predictability) of the rule.
These are measures of interestingness of the rule.
7/21/2015
5.Association Rule mining/D.S.Jagli
26
Terminology
• Confidence denotes the strength of the association
between X and Y.
• Support indicates the frequency of the pattern. A
minimum support is necessary if an association is
going to be of some business value.
• Let the chance of finding an item X in the N
transactions is x% then we can say probability of X
is P(X) = x/100 since probability values are always
between 0.0 and 1.0.
Question: Why is minimum support
minimum confidence necessary?
7/21/2015
5.Association Rule mining/D.S.Jagli
necessary?
Why
is
27
Terminology
• Now suppose we have two items X and Y with
probabilities of P(X) = 0.2 and P(Y) = 0.1. What
does the product P(X) x P(Y) mean?
• What is likely to be the chance of both items X and
Y appearing together, that is P(X and Y) = P(X U
Y)?
7/21/2015
5.Association Rule mining/D.S.Jagli
28
Terminology
• Suppose we know P(X U Y) then what is the
probability of Y appearing if we know that X
already exists. It is written as P(Y|X)
• The support for X→Y is the probability of both X
and Y appearing together, that is P(X U Y).
• The confidence of X→Y is the conditional
probability of Y appearing given that X exists. It
is written as P(Y|X) and read as P of Y given X.
7/21/2015
5.Association Rule mining/D.S.Jagli
29
THE TASK & NAÏVE ALGORITHM
Want to find all associations which have at least p%
support with at least q% confidence such that
all rules satisfying any user constraints are found
the rules are found efficiently from large databases
the rules found are actionable
7/21/2015
5.Association Rule mining/D.S.Jagli
30
Applications
Although we are only considering basket
analysis, the technique has many other
applications as noted earlier.
For example we may have many patients coming
to a hospital with various diseases and from
various backgrounds.
We therefore have a number of attributes
(items) and we may be interested in knowing
which attributes appear together and may be
which attributes are frequently associated to
some particular disease.
7/21/2015
5.Association Rule mining/D.S.Jagli
31
Example
We have 9 items and 10 transactions. Find support and confidence for each pair.
How many pairs are there?
Transaction ID Items
Bread, Cheese, Newspaper
10
7/21/2015
20
Bread, Cheese, Juice
30
Bread, Milk
40
Cheese, Juice. Milk, Coffee
50
Sugar, Tea, Coffee, Biscuits, Newspaper
60
70
Sugar, Tea, Coffee, Biscuits, Milk, Juice, Newspaper
Bread, Cheese
80
Bread, Cheese, Juice, Coffee
90
Bread, Milk
100
Sugar, Tea, Coffee, Bread, Milk, Juice, Newspaper
5.Association Rule mining/D.S.Jagli
32
Example
There are 9 x 8 or 72 pairs, half of them duplicates. 36 is too many to
analyse in a class, so we use an even simpler example of n = 4 (Bread,
Cheese, Juice, Milk) and N = 4. We want to find rules with minimum
support of 50% and minimum confidence of 75%.
Transaction ID Items
7/21/2015
100
Bread, Cheese
200
Bread, Cheese, Juice
300
Bread, Milk
400
Cheese, Juice. Milk
5.Association Rule mining/D.S.Jagli
33
Example
N = 4 results in the following frequencies. All
items have the 50% support we require. Only two
pairs have the 50% support and no 3-itemset has
the support.
We will look at the two pairs that have the
minimum support to find if they have the required
confidence.
7/21/2015
5.Association Rule mining/D.S.Jagli
34
Example
Itemsets
7/21/2015
Frequency
Bread
3
Cheese
3
Juice
2
Milk
2
(Bread, Cheese)
2
(Bread, Juice)
1
(Bread, Milk)
1
(Cheese, Juice)
2
(Cheese, Milk)
1
(Juice, Milk)
1
(Bread, Cheese, Juice)
1
(Bread, Cheese, Milk)
0
(Bread, Juice, Milk)
0
(Cheese, Juice, Milk)
1
(Bread, Cheese, Juice, Milk)
0
5.Association Rule mining/D.S.Jagli
35
Terminology
Items or itemsets that have the minimum support are called
frequent.
In our example, all the four items and two pairs are frequent.
We will now determine if the two pairs {Bread, Cheese} and
{Cheese, Juice} lead to association rules with 75% confidence.
Every pair {A, B} can lead to two rules A → B and B → A if both
satisfy the minimum confidence. Confidence of A → B is given by
the support for A and B together divided by the support of A.
7/21/2015
5.Association Rule mining/D.S.Jagli
36
Rules
We have four possible rules and their confidence is given as
follows:
Bread Cheese with confidence of 2/3 = 67%
Cheese Bread with confidence of 2/3 = 67%
Cheese Juice with confidence of 2/3 = 67%
Juice Cheese with confidence of 100%
Therefore only the last rule Juice Cheese has confidence
above the minimum 75% and qualifies. Rules that have more
than user-specified minimum confidence are called
confident.
7/21/2015
5.Association Rule mining/D.S.Jagli
37
Problems with NAÏVE ALGORITHM
This simple algorithm works well with four items since
there were only a total of 16 combinations that we needed
to look at but if the number of items is say 100, the number
of combinations is much larger, in billions.
The number of combinations becomes about a million with
20 items since the number of combinations is 2n with n
items (why?). The naïve algorithm can be improved to deal
more effectively with larger data sets.
7/21/2015
5.Association Rule mining/D.S.Jagli
38
Improved NAÏVE ALGORITHM
Rather than counting all possible item combinations we can look at
each transaction and count only the combinations that actually occur
as shown below (that is, we don’t count itemsets with zero frequency).
Transaction ID
100
200
300
400
7/21/2015
Items
Bread, Cheese
Bread, Cheese,
Juice
Bread, Milk
Cheese, Juice, Milk
Combinations
{Bread, Cheese}
{Bread, Cheese}, {Bread, Juice},
{Cheese, Juice}, {Bread, Cheese, Juice}
{Bread, Milk}
{Cheese, Juice}, {Cheese, Milk}, {Juice,
Milk}, {Cheese, Juice, Milk}
5.Association Rule mining/D.S.Jagli
39
Improved naïve algorithm
Removing zero frequencies leads to the smaller
table below. We then proceed as before but the
improvement is not large and we need better
techniques.
Itemsets
Bread
Cheese
Juice
Milk
(Bread, Cheese)
(Bread, Juice)
(Bread, Milk)
(Cheese, Juice)
(Juice, Milk)
(Bread, Cheese, Juice)
(Cheese, Juice, Milk)
7/21/2015
Frequency
3
3
2
2
2
1
1
2
1
1
1
5.Association Rule mining/D.S.Jagli
40
Clasical
7/21/2015
5.Association Rule mining/D.S.Jagli
41
The Apriori Algorithm
The basic algorithm to find association rules was proposed in
1993,improved algorithm in 1994.
This classical Apriori algorithm may be simply described by a
two step approach:
Step 1 ─ discover all frequent (single) items that have
support above the minimum support required
Step 2 ─ use the set of frequent items to generate the
association rules that have high enough confidence level
A more formal description is given on the slide after the next.
7/21/2015
5.Association Rule mining/D.S.Jagli
42
Apriori Algorithm Terminology
A k-itemset is a set of k items.(eg: k=10 items
1,2,3…..10)
The set Ck is a set of candidate k-itemsets that
are potentially
frequent.(eg:1,2,3;3,2,1;2,3,4;4,3,2;)
The set Lk is a subset of Ck and is the set of kitemsets that are frequent.(eg: 1,2;2,1)
7/21/2015
5.Association Rule mining/D.S.Jagli
43
Step-by-step Apriori Algorithm
Step 1 – Computing L1:
Scan all transactions. Find all frequent items that have
support above the required p%.
Let these frequent items be labeled L1.
Step 2 – Apriori-gen Function :
Use the frequent items L1 to build all possible item pairs
like {Bread, Cheese} if Bread and Cheese are in L1.
set of these item pairs is called C2, the candidate set.
Question: Do all members of C2 have the support? Why?
7/21/2015
5.Association Rule mining/D.S.Jagli
44
The Apriori Algorithm
Step 3 – Pruning:
Scan all transactions and find all pairs in the
candidate pair setC2 that are frequent.
Let these frequent pairs be L2.
Step 4 – General rule :
A generalization of Step 2. Build candidate set of k
items Ck by combining frequent itemsets in the set
Lk-1.
7/21/2015
5.Association Rule mining/D.S.Jagli
45
The Apriori Algorithm
Step 5 – Pruning
Step 5, generalization of Step 3. Scan all transactions and
find all item sets in Ck that are frequent.
Let these frequent itemsets be Lk.
Step 6 – Continue
Continue with Step 4 unless Lk is empty.
Step 7 – Stop
Stop when Lk is empty.
7/21/2015
5.Association Rule mining/D.S.Jagli
46
An Example
This example is similar to the last example but
we have added two more items and another
transaction making five transactions and six items.
We want to find association rules with 50%
support and 75% confidence.
Transaction ID Items
7/21/2015
100
Bread, Cheese, Eggs, Juice
200
Bread, Cheese, Juice
300
Bread, Milk, Yogurt
400
Bread, Juice, Milk
500
Cheese, Juice, Milk
5.Association Rule mining/D.S.Jagli
47
Example
First find L1.
50% support requires that each frequent item appear in at
least three transactions.
Therefore L1 is given by:
7/21/2015
Item
Frequnecy
Bread
4
Cheese
3
Juice
4
Milk
3
5.Association Rule mining/D.S.Jagli
48
Example
The candidate 2-itemsets or C2 therefore has six pairs.
These pairs and their frequencies are:
Item Pairs
Frequency
(Bread, Cheese)
2
(Bread, Juice)
3
(Bread, Milk)
2
(Cheese, Juice)
3
(Cheese, Milk)
1
(Juice, Milk)
2
Question: Why did we not select (Egg, Milk)?
7/21/2015
5.Association Rule mining/D.S.Jagli
49
Deriving Rules
L2 has only two frequent item pairs {Bread, Juice} and
{Cheese, Juice}. After these two frequent pairs, there are no
candidate 3-itemsets (why?) since we do not have two 2itemsets that have the same first item (why?).
The two frequent pairs lead to the following possible rules:
Bread → Juice
Juice → Bread
Cheese → Juice
Juice → Cheese
7/21/2015
5.Association Rule mining/D.S.Jagli
50
Deriving Rules
The confidence of these rules is obtained by dividing the
support for both items in the rule by the support of the
item on the left hand side of the rule.
The confidence of the four rules therefore are
3/4 = 75% (Bread → Juice)
3/4 = 75% (Juice → Bread)
3/3 = 100% (Cheese → Juice)
3/4 = 75% (Juice → Cheese)
Since all of them have a minimum 75% confidence, they all
qualify. We are done since there are no 3–itemsets.
7/21/2015
5.Association Rule mining/D.S.Jagli
51
Questions
1. How do we compute C3 from L2 or more
generally Ci from Li-1?
2. If there were several members of the set C3,
how do we derive L3 by pruning C3?
3. Why did we not consider (Bread, Cheese, Juice)
in the last example?
4. Assume that the 3-itemset {A, B, M} has the
minimum support. What do we do next? How do
we derive association rules?
7/21/2015
5.Association Rule mining/D.S.Jagli
52
Answers
To compute C3 from L2, or more generally Ci from Li-1, we
join members of Li-1 to other members in Li-1 by first sorting
the itemsets in their lexicographic order and then joining
those itemsets in which the first (i-2) items are common.
Observe that if an itemset in C3 is (a, b, c) then L2 must
have had itemsets (a, b), (b, c) and (a, c) since all subsets
of C3 must be frequent. Why?
To repeat, itemsets in a candidate set Ci or a frequent set Li
will be frequent only if every subset is frequent.
7/21/2015
5.Association Rule mining/D.S.Jagli
53
Answers
For example, {A, B, M} will be frequent only if
{A, B}, {A,M} and {B, M} are frequent, which in
turn requires each of A, B and M also to be
frequent.
We did not consider (Bread, Cheese, Juice)
because the pairs (Bread, Cheese) and (Bread,
Juice) should both have been in L2 but (Bread,
Cheese) was not so the three-itemset could not
be frequent.
7/21/2015
5.Association Rule mining/D.S.Jagli
54
Example
If the set {A, B, M} has the minimum support, we
can find association rules that satisfy the conditions
of support and confidence by first generating all
nonempty subsets of {A, B, M} and using each of it
on the LHS and remaining symbols on the RHS.
Subsets of {A, B, M} are A, B, M, AB, AM, BM.
Therefore possible rules involving all three items are
A → BM, B → AM, M → AB, AB → M, AM → B and BM
→ A.
Now we need to test their confidence.
7/21/2015
5.Association Rule mining/D.S.Jagli
55
Answers
To test the confidence of the possible rules, we
proceed as we have done before. We know that
confidence(A → B) = P(B|A) = P(A U B)/
P(A)
This confidence is the likelihood of finding B if A
already has been found in a transaction. It is the
ratio of the support for A and B together and
support for A by itself.
The confidence of all these rules can thus be
computed.
7/21/2015
5.Association Rule mining/D.S.Jagli
56
Efficiency
Consider an example of a supermarket database
which might have several thousand items including
1000 frequent items and several million
transactions.
Which part of the apriori algorithm will be the
most expensive to compute? Why?
7/21/2015
5.Association Rule mining/D.S.Jagli
57
Efficiency
The algorithm to construct the candidate set Ci to
find the frequent set Li is crucial to the
performance of the Apriori algorithm. The larger
the candidate set, higher the processing cost of
discovering the frequent itemsets since the
transactions must be scanned for each. Given that
the numbers of early candidate itemsets are very
large, the initial iterations dominate the cost.
In a supermarket database with about 1000
frequent items, there will be almost half a million
candidate pairs C2 that need to be searched for to
find L .
7/21/2015 2
5.Association Rule mining/D.S.Jagli
58
Efficiency
Generally the number of frequent pairs out of half
a million pairs will be small and therefore (why?)
the number of 3-itemsets should be small.
Therefore, it is the generation of the frequent set
L2 that is the key to improving the performance of
the Apriori algorithm.
7/21/2015
5.Association Rule mining/D.S.Jagli
59
Classification
Classical problem in statistics and machine learning.
Separation of objects or ordering of objects into classes.
Usually a set of classes are pre-defined. A set of training
samples are used for building class models.
Each new object is then assigned to one of the classes
using the model.
Typical
Applications:
effectiveness analysis
21/07/2015
credit
5.Classification/D.S.Jagli
approval,
treatment
61
Apriori Classification
The classification based on training samples with
class labels known is called supervised learning.
New data is classified based on the model
developed during training.
Each object has a number of attributes. One special
attribute is the class label called the output or
dependent attribute. Value known for the training
sample but not for others. Other attributes are
independent attributes.
Attributes may be numerical (e.g. income) or
categorical (ordered e.g. course grade or unordered
e.g. gender, job title, race). The class label is
categorical.
21/07/2015
5.Classification/D.S.Jagli
62
Posteriori Classification
No training data with known class labels
available. May be classes not known apriori.
Usually clustering is used in such situations
although some unsupervised classification
techniques do exist.
21/07/2015
5.Classification/D.S.Jagli
63
Classification
Classification is a prediction technique. Given the
values of the independent attributes the class can
be predicted.
Regression is another prediction technique.
Classification is used to predict categorical values
while regression is used to predict continuous or
ordered values.
21/07/2015
5.Classification/D.S.Jagli
64
Two Phases
Model construction phase:
1) Each training sample belongs to a predefined class
given by the class label attribute. A model is built.
II. Usage and testing phase:
1) Each test object (not used for training) is classified
2) Accuracy of the model is estimated
I.
1.
2.
21/07/2015
If the class label of some test samples is known, it is
compared with the result from the model
Accuracy rate is the percentage of test samples correctly
classified by the model
5.Classification/D.S.Jagli
65
Quality evaluation
Many quality measures. Han and Kambler list these:
Predictive accuracy
Efficiency or speed to construct model and to use
Robustness in handling noise and missing values
Scalability
efficiency in handling large disk-resident data
Interpretability:
understanding and insight provided by the model
Goodness of rules (size, compactness of model)
21/07/2015
5.Classification/D.S.Jagli
66
Decision Tree
1. A decision tree is an approach like that needed
to support the game of twenty questions where
each internal node of the tree denotes a
question or a test on the value of an
independent attribute, each branch represents
an outcome of the test, and each leaf
represents a class.
2. Assume that each object has a number of
independent attributes and a dependent
attribute.
21/07/2015
5.Classification/D.S.Jagli
67
Example (From Han and Kamber)
age
<=30
<=30
30…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
21/07/2015
income
high
high
high
medium
low
low
low
medium
low
medium
medium
medium
high
medium
student
no
no
no
no
yes
yes
yes
no
yes
yes
yes
no
yes
no
credit_rating
fair
excellent
fair
fair
fair
excellent
excellent
fair
fair
fair
excellent
excellent
fair
excellent
5.Classification/D.S.Jagli
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
68
A decision Tree (From Han and Kamber)
age?
<=30 overcast
30..40
student?
yes
>40
credit rating?
no
yes
excellent
fair
no
yes
no
yes
21/07/2015
5.Classification/D.S.Jagli
69
Decision Tree
To classify an object, the appropriate attribute
value is used at each node, starting from the root,
to determine the branch taken. The path found by
tests at each node leads to a leaf node which is the
class the model believes the object belongs to.
Decision tree is an attractive technique since the
results are easy to understand.
The rules can often be expressed in natural
language e.g. if the student has GPA > 3.0 and
class attendance > 90% then the student is likely
to get a D.
21/07/2015
5.Classification/D.S.Jagli
70
Basic Algorithm
1. The training data is S. Discretise all continuous-valued
attributes. Let the root node contain S.
2. If all objects S in the root node belong to the same
class then stop.
3. Split the next leaf node by selecting an attribute A
from amongst the independent attributes that best
divides or splits the objects in the node into subsets
and create a decision tree node.
4. Split the node according to the values of A.
5. Stop if any of the following conditions is met otherwise
continue with 3.
-- data in each subset belongs to a single class.
-- there are no remaining attributes on which
the
sample may be further
divided.
21/07/2015
5.Classification/D.S.Jagli
71
Building A Decision Tree
The aim is to build a decision tree consisting of
a root node, a number of internal nodes, and a
number of leaf nodes. Building the tree starts
with the root node and then splitting the data into
two or more children nodes and splitting them in
lower level nodes and so on until the process is
complete.
The method uses induction based on the
training data. We illustrate it using a simple
example.
21/07/2015
5.Classification/D.S.Jagli
72
An Example (from the text)
Sore
Throat
Fever
Swollen
Glands
Yes
No
Yes
Yes
No
No
No
Yes
No
Yes
Yes
No
Yes
No
Yes
No
No
No
Yes
Yes
Yes
No
No
Yes
No
No
Yes
No
No
No
Congestion Headache
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
Yes
Yes
Yes
No
No
No
No
No
Yes
Yes
Yes
Diagnosis
Strep throat
Allergy
Cold
Strep throat
Cold
Allergy
Strep throat
Allergy
Cold
Cold
•First five attributes are symptoms and the last attribute is
diagnosis. All attributes are categorical.
• Wish to predict the diagnosis class.
21/07/2015
5.Classification/D.S.Jagli
73
An Example (from the text)
Consider each of the attributes in turn to see
which would be a “good” one to start
Sore
Throat
Diagnosis
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Allergy
Cold
Allergy
Strep throat
Cold
Strep throat
Cold
Strep throat
Allergy
Cold
• Sore throat does not predict diagnosis.
21/07/2015
5.Classification/D.S.Jagli
74
An Example (from the text)
Is symptom fever any better?
Fever
Diagnosis
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Allergy
Strep throat
Allergy
Strep throat
Allergy
Strep throat
Cold
Cold
Cold
Cold
Fever is better but not perfect.
21/07/2015
5.Classification/D.S.Jagli
75
An Example (from the text)
Try swollen glands
Swollen
Glands
Diagnosis
No
No
No
No
No
No
No
Yes
Yes
Yes
Allergy
Cold
Cold
Allergy
Allergy
Cold
Cold
Strep throat
Strep throat
Strep throat
Good. Swollen glands = yes means Strep Throat
21/07/2015
5.Classification/D.S.Jagli
76
An Example (from the text)
Try congestion
Congestion
Diagnosis
No
No
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Strep throat
Strep throat
Allergy
Cold
Cold
Allergy
Allergy
Cold
Cold
Strep throat
Not helpful.
21/07/2015
5.Classification/D.S.Jagli
77
An Example (from the text)
Try the symptom headache
Headache
Diagnosis
No
No
No
No
No
Yes
Yes
Yes
Yes
Yes
Cold
Cold
Allergy
Strep throat
Strep throat
Allergy
Allergy
Cold
Cold
Strep throat
Not helpful.
21/07/2015
5.Classification/D.S.Jagli
78
Decision Tree
Swollen
Glands
Continuing the process
one more step will find
Fever as the next split
attribute and the final
Result as shown.
No
Yes
Diagnosis = Strep Throat
Fever
No
Diagnosis = Allergy
21/07/2015
Yes
Diagnosis = Cold
5.Classification/D.S.Jagli
79
An Example
This approach does not work if there are many
attributes and a large training set. Need an
algorithm to select an attribute that best
discriminates among the target classes as the split
attribute.
The tree continues to grow until it is no longer
possible to find better ways to split the objects.
21/07/2015
5.Classification/D.S.Jagli
80
What is a clustering ?
Clustering is the process of grouping data
into classes, or clusters, so that objects
within a cluster have high similarity in
comparison to one another, but are very
dissimilar to objects in other clusters.
21/07/2015
5.Clustering Analysis/D.S.Jagli
82
Cluster Analysis terminology
1. Unsupervised classification
2. Aims to decompose or partition a data set in to
clusters .
3. Inter-cluster (between-groups) distance is maximized
and
intra-cluster
(within-group)
distance
is
minimized.
4. Its Different to classification as there are no
predefined classes, no training data, may not even
know how many clusters.
21/07/2015
5.Clustering Analysis/D.S.Jagli
83
Cluster Analysis features
Not a new field, a branch of statistics
Many old algorithms are available
Renewed interest due to data mining and machine
learning
New algorithms are being developed, in particular to
deal with large amounts of data
Evaluating a clustering method is difficult
21/07/2015
5.Clustering Analysis/D.S.Jagli
84
Cluster Analysis Desired features
1. Scalability
2. Only one scan of the dataset
3. Ability to stop and resume
4. Minimal input parameters
5. Robustness
6. Ability to discover different cluster shapes
7. Different data types
8. Result independent of data input order
21/07/2015
5.Clustering Analysis/D.S.Jagli
85
Questions
• How to define a cluster?
• How to find if objects are similar or not?
• How to define some concept of distance between
individual objects and sets of objects?
21/07/2015
5.Clustering Analysis/D.S.Jagli
86
Types of Data
Numerical data - continuous variables on a roughly
linear scale e.g. marks, age, weight.
Units can affect the analysis.
Eg:Distance in metres is obviously a larger number than
in kilometres. May need to scale or normalise data
Binary data – many variables are binary e.g. gender,
married or not, u/g or p/g.
21/07/2015
5.Clustering Analysis/D.S.Jagli
87
Types of Data
Nominal data - similar to binary but can take more
than two states e.g. colour, staff position.
Ordinal data - similar to nominal but the different
values are ordered in a meaningful sequence.
Ratio-scaled data - nonlinear scale data
21/07/2015
5.Clustering Analysis/D.S.Jagli
88
Distance
A simple, well-understood concept.
Distance has the following properties (assume x and y to
be vectors):
•
distance is always positive
•
distance x to x is zero
•
distance x to y cannot be greater than the sum of
distance x to z and z to y
•
distance x to y is the same as from y to x.
Examples: absolute value of the difference ∑|x-y|
(the Manhattan distance ) or ∑(x-y)**2 (the Euclidean
distance).
21/07/2015
5.Clustering Analysis/D.S.Jagli
89
Distance
Three common distance measures are;
1. Manhattan Distance or the absolute value of the
difference
D(x, y) = ∑|x-y|
2. Euclidean distance
D(x, y) = ( ∑(x-y)2 ) ½
3. Maximum difference or Chebychev distance
D(x, y) = maxi |xi – yi|
21/07/2015
5.Clustering Analysis/D.S.Jagli
90
Examples
Given two objects represented by the tuples (22, 1, 42, 10)
and (20, 0, 36, 8):
(a) Compute the Euclidean distance between the two objects.
(b) Compute the Manhattan distance between the two objects.
21/07/2015
5.Clustering Analysis/D.S.Jagli
91
Categorical Data
I.
Difficult to find a sensible distance measure
between say India and Australia or between a male
and a female or between a student doing MCA and
one doing BE
1. One possibility is to convert all categorical data
into numeric data (makes sense).
2. Another possibility is to treat distance between
categorical values as a function which only has
a value 1 or 0.
21/07/2015
5.Clustering Analysis/D.S.Jagli
92
Distance
Using the approach of treating distance between two
categorical values as a value 1 or 0
Essentially it is just comparing how many attribute
value are the same.
21/07/2015
5.Clustering Analysis/D.S.Jagli
93
Distance Measure
Need to carefully look at scales of different attributes’
values.
For example, the GPA varies from 0 to 4 while the age
attribute varies from say 20 to 50.
The total distance, the sum of distance between all
attribute values, is then dominated by the age attribute
and will be affected little by the variations in GPA.
To avoid such problems the data should be scaled
or normalized
21/07/2015
5.Clustering Analysis/D.S.Jagli
94
Clustering
It is not always easy to predict how many clusters to
expect from a set of data.
One may choose a number of clusters based on some
knowledge of data. The result may be an acceptable set
of clusters.
This does not however mean that a different number
of clusters will not provide an even better set of
clusters.
Choosing the number of clusters is a difficult
problem.
21/07/2015
5.Clustering Analysis/D.S.Jagli
95
Types of clustering Methods
1. Partitioning methods – given n objects, make k (n)
partitions (clusters) of data and use iterative
relocation method.
It is assumed that each cluster has at least one
object and each object belongs to only one cluster.
2. Hierarchical methods - start with one cluster and
then split it in smaller clusters or start with each
object in an individual cluster and then try to merge
similar clusters.
21/07/2015
5.Clustering Analysis/D.S.Jagli
96
Types of Methods
1.1 Density-based methods - for each data point in a
cluster, at least a minimum number of points must exist
within a given radius
1.2. K-Means Clustering:
3. Grid-based methods - object space is divided into a grid
4.Model-based methods - a model is assumed, perhaps
based on a probability distribution
21/07/2015
5.Clustering Analysis/D.S.Jagli
97
The K-Means Method
the most commonly used method.
1. Select the number of clusters and Let this number
be K
2. Pick k-seeds as a centroids of k- clusters. the seeds
may be picked up randomly .
3. Compute the Euclidean distance of each object in
the dataset from each of the centroids.
4. Allocate each object to the cluster it is nearest to
based on the distance computed in the previous
step .
21/07/2015
5.Clustering Analysis/D.S.Jagli
98
The K-Means Method
5. Compute the centroids of clusters by computing the
means of the attribute values of the objects in eeach
cluster.
6. Check if the stoping criterion has been met.(if
cluster member ship is un changed) .if yes go to step
7,if not go to step 3.
7. (optional) one may decide to stop at this stage or to
split a cluster or combine two clusters until a
stopping criterion is met.
21/07/2015
5.Clustering Analysis/D.S.Jagli
99
The K-Means Method
Essentially the algorithm tries to build cluster with
high level of similarity within clusters and low level of
similarity between clusters.
This method requires means of variables to be
computed.
The method is scalable and is efficient. The algorithm
does not find a global minimum, it rather terminates at
a local minimum.
21/07/2015
5.Clustering Analysis/D.S.Jagli
100
Example
Student
Age
Marks1
Marks2
Marks3
S1
18
73
75
57
S2
18
79
85
75
S3
23
70
70
52
S4
20
55
55
55
S5
22
85
86
87
S6
19
91
90
89
S7
20
70
65
65
S8
21
53
56
59
S9
19
82
82
60
S10
40
76
60
78
21/07/2015
5.Clustering Analysis/D.S.Jagli
101
Example
1. Let the three seeds be the first three records:
Student
Age
Mark1
Mark2
Mark3
S1
18
73
75
57
S2
18
79
85
75
S3
23
70
70
52
2. Now we compute the distances between the objects
based on the four attributes.
3. K-Means requires Euclidean distances but we use
the sum of absolute differences as well as to show
how the distance metric can change the results.
21/07/2015
5.Clustering Analysis/D.S.Jagli
102
Distances
Student
Age
Marks1
Marks2
Marks3
Manhattan
Euclidean
Distance
Distance
S1
18
73
75
57
Seed 1
C1
Seed 1
C1
S2
18
79
85
75
Seed 2
C2
Seed 2
C2
S3
23
70
70
52
Seed 3
C3
Seed 3
C3
S4
20
55
55
55
42/76/36
C3
27/43/22
C3
S5
22
85
86
87
57/23/67
C2
34/14/41
C2
S6
19
91
90
89
66/32/82
C2
40/19/47
C2
S7
20
70
65
60
23/41/21
C3
13/24/14
C1
S8
21
53
56
59
44/74/40
C3
28/42/23
C3
S9
19
82
82
60
20/22/36
C1
12/16/19
C1
S10
47
75
76
77
60/52/58
C2
33/34/32
C3
21/07/2015
5.Clustering Analysis/D.S.Jagli
103
Distances
Student
Age
Marks1
Marks2
Marks3
Manhattan
Euclidean
Distance
Distance
S1
18
73
75
57
Seed 1
C1
Seed 1
C1
S2
18
79
85
75
Seed 2
C2
Seed 2
C2
S3
23
70
70
52
Seed 3
C3
Seed 3
C3
S4
20
55
55
55
42/76/36
C3
27/43/22
C3
S5
22
85
86
87
57/23/67
C2
34/14/41
C2
S6
19
91
90
89
66/32/82
C2
40/19/47
C2
S7
20
70
65
60
23/41/21
C3
13/24/14
C1
S8
21
53
56
59
44/74/40
C3
28/42/23
C3
S9
19
82
82
60
20/22/36
C1
12/16/19
C1
S10
47
75
76
77
60/52/58
C2
33/34/32
C3
21/07/2015
5.Clustering Analysis/D.S.Jagli
104
Example
Note the different allocations by the two different
distance metrics. K-Means uses the Euclidean distance.
The means of the new clusters are also different as
shown on the next slide.
We now start with the new means and compute
Euclidean distances. The process continues until there
is no change.
21/07/2015
5.Clustering Analysis/D.S.Jagli
105
Cluster Means
Manhattan
Euclidean
21/07/2015
Cluster
Age
Mark1
Mark2
Mark3
C1
18.5
77.5
78.5
58.5
C2
26.5
82.8
80.3
82
C3
21
62
61.5
57.8
Seed 1
18
73
75
57
Seed 2
18
79
85
75
Seed 3
23
70
70
52
C1
19.0
75.0
74.0
60.7
C2
19.7
85.0
87.0
83.7
C3
26.0
63.5
60.3
60.8
5.Clustering Analysis/D.S.Jagli
106
Distances
Student
Age
Marks1
Marks2
Marks3
Distance
Cluster
C1
18.5
77.5
78.5
58.5
C2
24.8
82.8
80.3
82
C3
21
62
61.5
57.8
S1
18
73
75
57
8/52/36
C1
S2
18
79
85
75
30/18/63
C2
S3
23
70
70
52
22/67/28
C1
S4
20
55
55
55
46/91/26
C3
S5
22
85
86
87
51/7/78
C2
S6
19
91
90
89
60/15/93
C2
S7
20
70
65
60
19/56/22
C1
S8
21
53
56
59
44/89/22
C3
S9
19
82
82
60
16/32/48
C1
S10
47
75
76
77
52/63/43
C3
21/07/2015
5.Clustering Analysis/D.S.Jagli
107
Comments
Euclidean Distance used the example since that is
what K-Means requires.
Manhanttan Distance used the Method since that is s
called K-Median method
The results of using Manhattan distance and Euclidean
distance are quite different
The last iteration changed the cluster membership of
S3 from C3 to C1.
21/07/2015
5.Clustering Analysis/D.S.Jagli
108
Quite different than partitioning.
21/07/2015
5.Clustering Analysis/D.S.Jagli
109
Hierarchical Clustering
21/07/2015
21/07/2015
5.Clustering Analysis/D.S.Jagli
110
110
Hierarchical Clustering
This method attempts to capture the structure of data
by constructing tree of cluster
Types:
agglomerative :Involves gradually merging
different objects into clusters(Bottom-UpApproach)
divisive :dividing large clusters into smaller
ones(Top-Down-Approach)
21/07/2015
5.Clustering Analysis/D.S.Jagli
111
Distance Computing between 2 Clusters
1. Single Link method – nearest neighbor distance between two clusters is the distance
between the two closest points, one from
each cluster.
2. Complete Linkage method – furthest
neighbor – distance between two clusters is
the distance between the two furthest points,
one from each cluster.
21/07/2015
5.Clustering Analysis/D.S.Jagli
112
Distance between Clusters
3. Centroid method – distance between two clusters is
the distance between the two centroids or the two
canters of gravity of the two clusters.
4. Unweighted pair-group average method –distance
between two clusters is the average distance between
all pairs of objects in the two clusters.
This means p*n distances need to be computed if
p and n are the number of objects in each of the
clusters
21/07/2015
5.Clustering Analysis/D.S.Jagli
113
Single Link
B
Consider
two clusters
Each with
5 objects.
Need
the
smallest
distance
to be
computed.
21/07/2015
B
B
B
B
A
A
A
A
A
5.Clustering Analysis/D.S.Jagli
114
Complete Link
B
Consider
two clusters
each with
5 objects.
Need
the
largest
distance
to be
computed.
21/07/2015
B
B
B
B
A
A
A
A
A
5.Clustering Analysis/D.S.Jagli
115
Centroid Link
Consider
two clusters.
Need
the
distance
between the
centroids
to be
computed.
B
C
B
B
B
B
A
A
A C
A
21/07/2015
A
5.Clustering Analysis/D.S.Jagli
116
Average Link
Consider two clusters
each with
4 objects.
Need 16
distances
to be
computed
and average
found.
B
B
B
B
A
A
A
A
21/07/2015
5.Clustering Analysis/D.S.Jagli
117
Agglomerative Clustering
1. Each object is assigned to a cluster of
its own
2. Build a distance matrix by computing
distances between every pair of objects
3. Find the two nearest clusters
4. Merge them and remove them from the
list
5. Update matrix by including distance of
all objects from the new cluster
6. Continue until all objects belong to one
cluster
21/07/2015
5.Clustering Analysis/D.S.Jagli
118
Example
We now consider the simple example about students’ marks
and use the distance between two objects as the Manhattan
distance.
We will use the centroid method for distance between
clusters.
We first calculate a matrix of distances.
21/07/2015
5.Clustering Analysis/D.S.Jagli
119
Example
Student
Age
Marks1
Marks2
Marks3
971234
18
73
75
57
994321
18
79
85
75
965438
23
70
70
52
987654
20
55
55
55
968765
22
85
86
87
978765
19
91
90
89
984567
20
70
65
60
985555
21
53
56
59
998888
19
82
82
60
995544
47
75
76
77
21/07/2015
5.Clustering Analysis/D.S.Jagli
120
Distance Matrix
21/07/2015
5.Clustering Analysis/D.S.Jagli
121
Example
The matrix gives distance of each object with every other
object.
The smallest distance of 8 is between object s4 and object
s8. They are combined and put where object 4 was.
Compute distances from this cluster and update the
distance matrix.
21/07/2015
5.Clustering Analysis/D.S.Jagli
122
Updated Matrix
21/07/2015
5.Clustering Analysis/D.S.Jagli
123
Note
The smallest distance now is 15 between the objects S5 and
S6. They are combined in a cluster and S5 and S6 are
removed.
Compute distances from this cluster and update the
distance matrix.
21/07/2015
5.Clustering Analysis/D.S.Jagli
124
Updated Matrix
21/07/2015
5.Clustering Analysis/D.S.Jagli
125
Contd….
Look at shortest distance again.
S3 and S7 are at a distance 16 apart. We merge them and
put C3 where S3 was.
The updated distance matrix is given on the next slide. It
shows that C2 and S1 have the smallest distance and are
then merged in the next step.
We stop short of finishing the example.
21/07/2015
5.Clustering Analysis/D.S.Jagli
126
Updated matrix
21/07/2015
5.Clustering Analysis/D.S.Jagli
127
Result will be like this
21/07/2015
5.Clustering Analysis/D.S.Jagli
128
Thank you
Note: you can study search engine from my slides
21/07/2015
5.Data Mining/D.S.Jagli
129