Chapter 2 Data Mining - Faculty of Computer Science & Engineering

Download Report

Transcript Chapter 2 Data Mining - Faculty of Computer Science & Engineering

Chapter 2
Data Mining
Faculty of Computer Science and Engineering
HCM City University of Technology
October- 2010
1
Outline
1.
2.
3.
4.
5.
6.
7.
Overview of data mining
Association rules
Classification
Regression
Clustering
Other Data Mining problems
Applications of data mining
2
DATA MINING




Data mining refers to the mining or discovery of new
information in terms of patterns or rules from vast
amount of data.
To be practically useful, data mining must be carried out
efficiently on large files and databases.
This chapter briefly reviews the state-of-the-art of this
extensive field of data mining.
Data mining uses techniques from such areas as




machine learning,
statistics,
neural networks
genetic algorithms.
3
1.
OVERVIEW OF DATA MINING
Data Mining as a Part of the
Knowledge Discovery Process.

Knowledge Discovery in Databases, abbreviated as
KDD, encompasses more than data mining.

The knowledge discovery process comprises six
phases: data selection, data cleansing, enrichment,
data transformation or encoding, data mining and
the reporting and displaying of the discovered
information.
4
Example





Consider a transaction database maintained by a specially
consumer goods retails. Suppose the client data includes a
customer name, zip code, phone number, date of purchase, item
code, price, quantity, and total amount.
A variety of new knowledge can be discovered by KDD
processing on this client database.
During data selection, data about specific items or categories of
items, or from stores in a specific region or area of the country,
may be selected.
The data cleansing process then may correct invalid zip codes or
eliminate records with incorrect phone prefixes. Enrichment
enhances the data with additional sources of information. For
example, given the client names and phone numbers, the store
may purchases other data about age, income, and credit rating
and append them to each record.
Data transformation and encoding may be done to reduce the
amount of data.
5
Example (cont.)
The result of mining may be to discover the
following type of “new” information:



Association rules – e.g., whenever a customer buys video
equipment, he or she also buys another electronic gadget.
Sequential patterns – e.g., suppose a customer buys a camera,
and within three months he or she buys photographic supplies,
then within six months he is likely to buy an accessory items.
This defines a sequential pattern of transactions. A customer who
buys more than twice in the regular periods may be likely buy at
least once during the Christmas period.
Classification trees – e.g., customers may be classified by
frequency of visits, by types of financing used, by amount of
purchase, or by affinity for types of items, and some revealing
statistics may be generated for such classes.
6




We can see that many possibilities exist for discovering new
knowledge about buying patterns, relating factors such as age,
income group, place of residence, to what and how much the
customers purchase.
This information can then be utilized
 to plan additional store locations based on demographics,
 to run store promotions,
 to combine items in advertisements, or to plan seasonal
marketing strategies.
As this retail store example shows, data mining must be
preceded by significant data preparation before it can yield useful
information that can directly influence business decisions.
The results of data mining may be reported in a variety of
formats, such as listings, graphic outputs, summary tables, or
visualization.
7
Goals of Data Mining and Knowledge
Discovery

Data mining is carried out with some end goals.
These goals fall into the following classes:




Prediction – Data mining can show how certain attributes
within the data will behave in the future.
Identification – Data patterns can be used to identify the
existence of an item, an event or an activity.
Classification – Data mining can partition the data so that
different classes or categories can be identified based on
combinations of parameters.
Optimization – One eventual goal of data mining may be to
optimize the use of limited resources such as time, space,
money, or materials and to maximize output variables such
as sales or profits under a given set of constraints.
8
Data Mining: On What Kind of Data?




Relational databases
Data warehouses
Transactional databases
Advanced DB and information repositories






Object-oriented and object-relational databases
Spatial databases
Time-series data and temporal data
Text databases and multimedia databases
Heterogeneous and legacy databases
World Wide Web
9
Types of Knowledge Discovered
During Data Mining.



Data mining addresses inductive knowledge, which
discovers new rules and patterns from the supplied data.
Knowledge can be represented in many forms: In an
unstructured sense, it can be represented by rules. In a
structured form, it may be represented in decision trees,
semantic networks, or hierarchies of classes or frames.
It is common to describe the knowledge discovered
during data mining in five ways:

Association rules – These rules correlate the presence of a set
of items with another range of values for another set of variables.
10
Types of Knowledge Discovered (cont.)




Classification hierarchies – The goal is to work from an
existing set of events or transactions to create a hierarchy
of classes.
Patterns within time series
Sequential patterns: A sequence of actions or events is
sought. Detection of sequential patterns is equivalent to
detecting associations among events with certain temporal
relationship.
Clustering – A given population of events can be
partitioned into sets of “similar” elements.
11
Main function phases of the KD process









Learning the application domain:
 relevant prior knowledge and goals of application
Creating a target data set: data selection
Data cleaning and preprocessing: (may take 60% of effort!)
Data reduction and transformation:
 Find useful features, dimensionality/variable reduction, invariant
representation.
Choosing functions of data mining
 summarization, classification, regression, association, clustering.
Choosing the mining algorithm(s)
Data mining: search for patterns of interest
Pattern evaluation and knowledge presentation
 visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
12
Main phases of data mining
Pattern Evaluation/
Presentation
Data Mining
Patterns
Task-relevant Data
Data Warehouse
Selection/Transformation
Data
Cleaning
Data Integration
Data Sources
13
2. ASSOCIATION RULES
What Is Association Rule Mining?


Association rule mining is finding frequent patterns,
associations, correlations, or causal structures
among sets of items or objects in transaction
databases, relational databases, and other
information repositories.
Applications:





Basket data analysis,
cross-marketing,
catalog design,
clustering, classification, etc.
Rule form: “Body  Head [support, confidence]”.
14
Association rule mining

Examples.
buys(x, “diapers”)  buys(x, “beers”) [0.5%, 60%]
major(x, “CS”)  takes(x, “DB”)  grade(x, “A”) [1%, 75%]
Association Rule Mining Problem:
Given: (1) database of transactions, (2) each
transaction is a list of items (purchased by a
customer in a visit)
Find: all rules that correlate the presence of one set of
items with that of another set of items
 E.g., 98% of people who purchase tires and auto
accessories also get automotive services done.
15
Rule Measures: Support and Confidence




Let J = {i1, i2,…,im} be a set of items. Let D, the taskrelevant data, be a set of database transactions where
each transaction T is a set of items such that T  J. Each
transaction T is said to contain A if and only if A  T.
An association rule is an implication of the form A  B
where A  J, B  J and A  B = .
The rule A  B holds in the transaction set D with
support s, where s is the percentage of transactions in D
that contain A  B (i.e. both A and B). This is taken to be
the probability P(A  B ).
The rule A  B has the confidence c in the transaction
set D if c is the percentage of transactions in D
containing A that also contain B.
16
Support and confidence
That is.
support, s, probability that a transaction contains {A  B }
s = P(A  B )
confidence, c, conditional probability that a transaction
having A also contains B.
c = P(A|B).
 Rules that satisfy both a minimum support threhold
(min_sup) and a mimimum confidence threhold
(min_conf) are called strong.
17
Frequent item set



A set of items is referred as an itemset. An itemset that
contains k items is a k-itemset. The occurrence
frequency of an itemset is the number of transactions
that contain the itemset.
An itemset satisfies minimum support if the occurrence
frequency of the itemset is greater than or equal to the
product of min_suf and the total number of transactions
in D. The number of transactions required for the itemset
to satisfy minimum support is referred to as the minimum
support count.
If an itemset satisfies minimum support, then it is a
frequent itemset. The set of frequent k-itemsets is
commonly denoted by Lk.
18
Example 2.1
Transaction-ID
Items_bought
------------------------------------------2000
A, B, C
1000
A, C
4000
A, D
5000
B, E, F
Let minimum support 50%, and minimum confidence
50%, we have
A  C (50%, 66.6%)
C  A (50%, 100%)
19
Types of Association Rules
Boolean vs. quantitative associations (Based on the types
of values handled)
buys(x, “SQLServer”)  buys(x, “DMBook”)  buys(x, “DBMiner”)
[0.2%, 60%]
age(x, “30..39”)  income(x, “42..48K”)  buys(x, “PC”) [1%, 75%]
Single dimension vs. multiple dimensional associations
The rule that references two or more dimensions, such as the dimensions
buys, income and age is a multi-dimensional association rule.
Single level vs. multiple-level analysis
Some methods for association rule mining can find rules at different levels
of abstractions. For example, suppose that a set of association rule
mined includes the following rules:
age(x, “30..39”)  buys(x, “laptop computer”)
age(x, “30..39”)  buys(x, “ computer”)
in which “computer” is a higher level abstraction of “laptop computer”.
20
How to mine association rules from large
databases?

Association rule mining is a two-step process:
1. Find all frequent itemsets (the sets of items that have minimum
support)
A subset of a frequent itemset must also be a frequent
itemset. (Apriori principle)
i.e., if {AB} is a frequent itemset, both {A} and {B} should
be a frequent itemset
Iteratively find frequent itemsets with cardinality from 1 to k (kitemset)
2.Generate strong association rules from the frequent itemsets.

The overall performance of mining association rules is
determined by the first step.
21
The Apriori Algorithm




Apriori is an important algorithm for mining frequent
itemsets for Boolean association rules.
Apriori algorithm employs an iterative approach known
as a level-wise search, where k-itemsets are used to
explore (k+1)-itemsets.
First, the set of frequent 1-itemsets is found. This set is
denoted L1. L1 is used to find L2, the set of frequent 2itemsets, which is used to find L3, and so on, until no
more frequent k-itemsets can be found. The finding of
each Lk requires one full scan of the database.
To improve the efficiency of the level-wise generation of
frequent itemsets, an important property called the
Apriori property is used to reduce the search space.
22
Apriori property



Apriori property: All nonempty subsets of a frequent
itemset must also be frequent.
The Apriori property is based on the following
observation. By definition, if an itemset I does not satisfy
the minimum support threhold, min_sup, then I is not
frequent, that is, P(I) < min_suf. If an item A is added to
the itemset I, then the resulting itemset, I  A, can not
occur more frequently than I. Therefore, I  A is not
frequent either, i.e., P(IA) < min_suf.
This property belongs to a special category of properties
called anti-monotone in the sense that if a set cannot
pass a test, all of its supersets will fail the same test as
well.
23
Finding Lk using Lk-1.

A two-step process is used in finding Lk using
Lk-1.


Join Step: Ck is generated by joining Lk-1 with itself
Prune Step: Any (k-1)-itemset that is not frequent
cannot be a subset of a frequent k-itemset
24
Pseudo code
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
begin
Ck+1 = apriori_gen(Lk, min_sup);
for each transaction t in database do // scan D for counts
increment the count of all candidates in Ck+1 that are contained
in t
Lk+1 = candidates in Ck+1 with min_support
end
return k Lk;
25
procedure apriori_gen(Lk:frequent k-itemset, min_sup:
minmum support threshold)
(1) for each itemset l1  Lk
(2) for each itemset l2  Lk
(3)
if(l1[1] = l2[1] l1[2] = l2[2] … l1[k-1] = l2[k-1] 
l1[k] < l2[k] then {
(4)
c = l1  l2;
(5)
if some k-subset s of c  Lk then
(6)
delete c; // prune step: remove unfruitful candidate
(7)
else add c to Ck;
(8)
}
(9) return Ck;
(10) end procedure
26
Example 2.2:
TID
List of item_Ids
----------------------------T100 I1, I2, I5
T200 I2, I4
T300 I2, I3
T400 I1, I2, I4
T500 I1, I3
T600 I2, I3
T700 I1, I3
T800 I1, I2, I3, I5
T900 I1, I2, I3

Assume that minimum transaction support count required is 2
(i.e. min_sup = 2/9=22%).
27
C1
Itemset
{I1}
{I2}
{I3)
{I4}
{I5}
C2
Itemset
{I1, I2}
{I1, I3}
{I1, I4}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
{I3, I4}
{I3, I5}
{I4, I5}
Sup.count
6
7
6
2
2
Sup.count
4
4
1
2
4
2
2
0
1
0
L1
Itemset
{I1}
{I2}
{I3)
{I4}
{I5}
L2
Itemset
{I1, I2}
{I1, I3}
{I1, I5}
{I2, I3}
{I2, I4}
{I2, I5}
Sup.count
6
7
6
2
2
Sup.count
4
4
2
4
2
2
28
C3
Itemset
{I1, I2, I3}
{I1, I2, I5}
{I1, I3, I5}
{I2, I3, I4}
{I2, I3, I5}
{I2, I4, I5}
Sup.count
2
2
X
X
X
X
C4 = {{I1, I2, I3, I5}}
L3
Itemset
{I1, I2, I3}
{I1, I2, I5}
Sup.count
2
2
L4 = 
29
Generating Association Rules from
Frequent Itemsets


Once the frequent itemsets from transactions in a
database D have been found, it is straightforward to
generate strong association rules from them.
This can be done using the following equation for
confidence, where the conditional probability is
expressed in terms of itemset support count:
confidence(A  B) = P(B|A) =
support_count(AB)/support_count(A)
where support_count(X) is the number of transactions
containing the itemsets X.
30

Based on this equation, association rules can be
generated as follows:



For each frequent itemset l, generate all nonempty subsets
of l.
For every nonempty subset s of l, output the rule “ s  (l –
s)” if
support_count(l)/support_count(s)  min_conf, where
min_conf is the minimum confidence threshold.
Since the rules are generated from frequent
itemsets, each one automatically satisfies minimum
support.
31
Example 2.3. From Example 2.2, suppose the data contain
the frequent itemset l = {I1, I2, I5}. The nonempty
subsets of l are {I1, I2}, {I1, I5}, {I2, I5}, {I1}, {I2} and {I5}.
The resulting association rules are as shown blow:
I1 I2  I5
I1 I5  I2
I2 I5  I1
I1 I2 I5
I2  I1 I5
I5  I1 I2
confidence = 2/4 = 50%
confidence = 2/2 = 100%
confidence = 2/2 = 100%
confidence = 2/6 = 33%
confidence = 2/7 = 29%
confidence = 2/2 = 100%
If the minimum confidence threshold is, say, 70%, then only
the second, third and last rules above are outputs.
32
Properties of Apriori algorithm


Generate several candidate itemsets

104 frequent 1-itemsets  more than 107 (≈104(104-1)/2)
candidate 2-itemsets

Each k-itemset needs at least 2k -1 candidate itemsets.
Examine the dataset several times

High cost when sizes of itemsets increase.

If k-itemsets are identified then the algorithm examines
the dataset k+1 times.
33
Improving the efficiency of Apriori





Hash-based technique: hashing itemsets into
corresponding buckets.
Transaction reduction: reducing the number of
transaction scanned in future iterations.
Partitioning: partitioning the data to find candidate
itemsets.
Sampling: mining on a subset of the given data.
Dynamic itemset counting: adding candidate
itemsets at different points during a scan.
34
3. CLASSIFICATION



Classification is the process of learning a model that
describes different classes of data. The classes are
predetermined.
Example: In a banking application, customers who apply
for a credit card may be classify as a “good risk”, a “fair
risk” or a “poor risk”. Hence, this type of activity is also
called supervised learning.
Once the model is built, then it can be used to classify
new data.
35





The first step, of learning the model, is accomplished by using a
training set of data that has already been classified. Each record
in the training data contains an attribute, called the class label,
that indicates which class the record belongs to.
The model that is produced is usually in the form of a decision
tree or a set of rules.
Some of the important issues with regard to the model and the
algorithm that produces the model include:
 the model’s ability to predict the correct class of the new data,
 the computational cost associated with the algorithm
 the scalability of the algorithm.
Let examine the approach where the model is in the form of a
decision tree.
A decision tree is simply a graphical representation of the
description of each class or in other words, a representation of
the classification rules.
36
Example 3.1



Example 3.1: Suppose that we have a database of customers
on the AllEletronics mailing list. The database describes
attributes of the customers, such as their name, age, income,
occupation, and credit rating. The customers can be classified
as to whether or not they have purchased a computer at
AllElectronics.
Suppose that new customers are added to the database and
that you would like to notify these customers of an upcoming
computer sale. To send out promotional literature to every
new customers in the database can be quite costly. A more
cost-efficient method would be to target only those new
customers who are likely to purchase a new computer. A
classification model can be constructed and used for this
purpose.
The figure 2 shows a decision tree for the concept
buys_computer, indicating whether or not a customer at
AllElectronics is likely to purchase a computer.
37
Each internal node
represents a test on
an attribute. Each leaf
node represents a
class.
A decision tree for the concept buys_computer, indicating whether or not
a customer at AllElectronics is likely to purchase a computer.
38
Algorithm for decision tree induction
Input: set of training data records: R1, R2, …, Rm and set of
Attributes A1, A2, …, An
Ouput: decision tree
Basic algorithm (a greedy algorithm)
- Tree is constructed in a top-down recursive divideand-conquer manner
- At start, all the training examples are at the root
- Attributes are categorical (if continuous-valued, they
are discretized in advance)
- Examples are partitioned recursively based on
selected attributes
- Test attributes are selected on the basis of a heuristic
or statistical measure (e.g., information gain)
39
Conditions for stopping partitioning
- All samples for a given node belong to the same
class
- There are no remaining attributes for further
partitioning – majority voting is employed for
classifying the leaf
- There are no samples left.

40
Procedure Build_tree(Records, Attributes);
Begin
(1) Create a node N;
(2) If all Records belong to the same class, C then
(3) Return N as a leaf node with the class label C;
(4) If Attributes is empty then
(5)
Return N as a leaf node with the class label C, such that the majority of
Records belong to it;
(6) select attributes Ai (with the highest information gain) from Attributes;
(7) label node N with Ai;
(8) for each known value aj of Ai do
begin
(9)
add a branch for node N for the condition Ai = aj;
(10) Sj = subset of Records where Ai = aj;
(11) If Sj is empty then
(12)
Add a leaf L with class label C, such that the majority of Records belong to
it and return L
else
(13)
Add the node return by Build_tree(Sj, Attributes – Ai);
end
end
41
Attribute Selection Measure

The expected information gain needed to classify training data of
s samples, where the Class attribute has m values (a1, …, am) and si
is the number of samples belong to Class label ai is given by:
I(s1, s2,…, sm) = -
m
 p log
i
2
( pi )
i 1
where pi is the probability that a random sample belongs to the
class with label ai. An estimate of pi is si/s.
Consider an attribute A with values {a1, …, av } used as the test
attribute for splitting in the decision tree. Attribute A partitions the
samples into the subsets S1,…, Sv where samples in each Si have
a value of ai for attribute A. Each Si may contain samples that
belong to any of the classes. The number of samples in Si that
belong to class j can be denoted as sij. Entropy of A is given by:
v
s1 j  ...  smj
E(A) =
I ( s1 j,..., smj)

j 1
s
42
I(s1j,…,smj) can be defined using the formulation for
I(s1,…,sm) with pi being replaces by pij = sij/sj. Now the
information gain by partitioning on attribute A is defined
as:
Gain(A) = I(s1, s2,…, sm) – E(A).



Example 3.1: Table 1 presents a training set of data tuples taken
from the AllElectronics customer database. The class label
attribute, buys_computer, has two distinct values; therefore two
distinct classes (m = 2). Let class C1 correspond to yes and
class C2 corresponds to no. There are 9 samples of class yes
and 5 samples of class no.
To compute the information gain of each attribute, we first use
Equation (1) to compute the expected information needed to
classify a given sample:
I(s1, s2) = I(9,5) = - (9/14) log2(9/14) – (5/9)log2(5/14) = 0.94
43
Training data tuples from the AllElectronics customer
database
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
Class
No
No
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
Yes
No
44
Next, we need to compute the entropy of each attribute. Let’s start
with the attribute age. We need to look at the distribution of yes
and no samples for each value of age. We compute the expected
information for each of these distributions.
For age =”<= 30”:
s11 = 2 s21 = 3 I(s11, s21) = -(2/5)log2(2/5) – (3/5)log2(3/5)=
0.971
For age = “31…40”
s12 = 4 s22 = 0 I(s12, s22) = -(4/4)log2(4/4) – (0/4)log2(0/4) = 0
For age = “>40”:
s13 = 3 s23 = 2 I(s13, s23) = -(3/5)log2(3/5) – (2/5)log2(2/5)=
0.971
Using Equation (2), the expected information needed to classify a
given sample if the samples are partitioned according to age is
E(age) = (5/14)I(s11, s21) + (4/14) I(s12, s22) + (5/14)I(s13, s23) =
(10/14)*0.971 = 0.694.
45

Hence, the gain in information from such a partitioning
would be
Gain(age) = I(s1, s2) – E(age) = 0.940 – 0.694 = 0.246


Similarly, we can compute Gain(income) = 0.029,
Gain(student) = 0.151, and Gain(credit_rating) =
0.048. Since age has the highest information gain
among the attributes, it is selected as the test
attribute. A node is created and labeled with age, and
branches are grown for each of the attribute’s values.
The samples are then partitioned accordingly, as
shown in Figure 3.
46
age?
<= 30
>40
31…40
income
student credit_rating
class
high
no
fair
no
high
no
excellent
no
medium
no
fair
no
low
yes
fair
yes
medium
yes
excellent
yes
income
high
low
medium
high
income
medium
low
low
medium
medium
student
no
yes
no
yes
student credit_rating
no
fair
yes
fair
yes
excellent
yes
fair
no
excellent
class
yes
yes
no
yes
no
credit_rating class
fair
yes
excellent
yes
excellent
yes
fair
yes
47
Extracting Classification Rules from Trees
Represent the knowledge in the form of IF-THEN rules

One rule is created for each path from the root to a leaf

Each attribute-value pair along a path forms a conjunction

The leaf node holds the class prediction

Rules are easier for humans to understand.
Example
IF age = “<=30” AND student = “no” THEN buys_computer = “no”
IF age = “<=30” AND student = “yes” THEN buys_computer = “yes”
IF age = “31…40”
THEN buys_computer = “yes”
IF age = “>40” AND credit_rating = “excellent” THEN
buys_computer = “no”
IF age = “>40” AND credit_rating = “fair” THEN buys_computer =
“yes”

48
Neural Networks and Classification


Neural network is a technique derived from AI that uses
generalized approximation and provides an iterative
method to carry it out. ANNs use the curve-fitting
approach to infer a function from a set of samples.
This technique provides a “learning approach”; it is
driven by a test sample that is used for the initial
inference and learning. With this kind of learning method,
responses to new inputs may be able to be interpolated
from the known samples. This interpolation depends on
the model developed by the learning method.
49
ANN and classification


ANNs can be classified into 2 categories: supervised
and unsupervised networks. Adaptive methods that
attempt to reduce the output error are supervised
learning methods, whereas those that develop internal
representations without sample outputs are called
unsupervised learning methods.
ANNs can learn from information on a specific
problem. They perform well on classification tasks and
are therefore useful in data mining.
50
Information processing at a neuron in an
ANN
51
A multilayer Feed-Forward Neural Network
52
Backpropagation algorithm
53
54
Classification with ANN
Output vector
Err j  O j (1  O j ) Errk w jk
k
Output nodes
 j   j  (l) Err j
wij  wij  (l ) Err j Oi
Err j  O j (1  O j )(T j  O j )
Hidden nodes
wij
Input nodes
Oj 
1
I
1 e j
I j   wij Oi   j
i
Input vector: xi
55
Example:
56
Example:
57
58
Other Classification Methods





k-nearest neighbor classifier
case-based reasoning
Genetic algorithm
Rough set approach
Fuzzy set approaches
59
The k-Nearest Neighbor Algorithm




All instances (samples) correspond to points in the n-dimensional
space.
The nearest neighbor are defined in terms of Euclidean distance.
The Euclidean distance of two points, X = (x1, x2, …,xn) and Y = (y1,
y2, …,yn) is
n
d(X,Y) =   (xi –yi)2
1
When given an unknown sample, the k-Nearest Neighbor classifier
search for the space for the k training samples that are closest to the
unknown sample xq. The unknown sample is assigned the most
common class among its k nearest neighbors. The algorithm has to
vote to determine the most common class among the k nearest
neighbor. When k = 1, the unknown sample is assigned the class of
the training sample that is closest to it in the space.
Once we have obtained xq’s k-nearest neighbors using the distance
function, it is time for the neighbors to vote in order to determine xq’s
class.
60
Two approaches are common.
 Majority voting: In this approach, all votes are equal
For each class, we count how many of the k neighbors
have that class. We return the class with the most votes.
 Inverse distance-weighted voting: In this approach,
closer neighbors get higher votes.
While there are better-motivated methods, the simplest
version is to take a neighbor’s vote to be the inverse of
its distance to xq:
1
w
d ( xq , xi )2
Then we sum the votes and return the class with the
highest vote.
61
Genetic Algorithms



GA: based on an analogy to biological evolution
Each rule is represented by a string of bits.
Example: The rule “IF A1 and Not A2 then C2“ can be
encoded as the bit string “100”, where the two left bits
represent attributes A1 and A2, respectively, and the rightmost
bit represents the class. Similarly, the rule “IF NOT A1 AND
NOT A2 THEN C1” can be encoded as “001”.




An initial population is created consisting of randomly generated
rules
Based on the notion of survival of the fittest, a new population is
formed to consists of the fittest rules and their offsprings
The fitness of a rule is represented by its classification accuracy
on a set of training examples
Offsprings are generated by crossover and mutation.
62
4. Regression

Predictive data mining

Descriptive data mining
Definition: (J. Han et al., 2001&2006) Regression is a
method used to predict continuous values for given input.
63
Regression

Regression analysis can be used to model the
relationship between one or more independent or
predictor variables and one or more response or
dependent variables.

Categories

Linear regression and nonlinear regression

Uni-variate and multi-variate regression

parametric, nonparametric and semi-parametric
64
Regression function

Regression function: Y = f(X, β)

X: predictor/independent variables

Y: response/dependent variables

β: regression coefficients

X: used to explain the changes of response variable Y.

Y: used to describe the target phenomenon.

The relationship between Y and X can be representeb
y the functional dependence of Y to X.

β describes the influence of X to Y.
65
Regression with a single predictor variable
Given N observed objects, this linear model is described in
the following form with εi representing the part of response Y
that can not be explained from X:
- Line form:
-Parabola form:
66
Linear regression with single predictor variable

Estimate the parameter set β (
obtain linear regression model:
) in order to
residual
Sum of squared
residuals
 minimize
Estimated value β
67
Linear multiple regression

This linear regression model analyses the relationship
between response/dependent variable and two or more
independent variables.
yi = b0 + b1xi1 + b2xi2 + … + bkxik
i = 1..n (n is the number of observed objects)
k = the number of predictor variables (the number
of attributes)
Y = dependent variables
X = independent variables
b0 = Y’s value when X = 0
b1..k = regression coefficients
68
Linear multiple regression
Estimated value of Y
Estimated values
of b
yˆ  b0  b1 x1  b2 x2 


1
 bk xk
b X X X Y
1 x1,1
 Y1 
1 x
Y 
2,1
Y   2, X  

 

 
Yn 
1 xn,1
T
x1,2
x2,2
xn,2
T
x1,k 
b0 
b 
x2, k 
, b   1

 

 
xn,k 
bk 
69
Linear multiple regression

Example: a sales manager of Tackey Toys, needs to
predict sales of Tackey products in selected market area.
He believes that advertising expenditures and the
population in each market area can be used to predict
sales. He gathered sample of toy sales, advertising
expenditures and the population as below. Find the linear
multiple regression equation which the best fit to the data.
70
Linear multiple regression
Market
Area
Advertising Expenditures
(Thousands of Dollars) x1
Population
(Thousands) x2
Toy sales
(Thousands of Dollars) y
A
1.0
200
100
B
5.0
700
300
C
8.0
800
400
D
6.0
400
200
E
3.0
100
100
F
10.0
600
400
71
Linear multiple regression
Softwares: SPSS,
SAS, R
yˆ  6.3972  20.4921x1  0.2805 x2
72
Nonlinear regression


Y = f(X, β)

Y is a nonlinear function for the combining of the
coefficients β.

Examples: exponential function, logarithmic function,
Gauss function, …
Determine the optimal β: optimization algorithms

Local optimization

Global optimization for the sum of squared residuals
73
Applications of regression


Data Mining

Preprocessing stage

Data Mining stage

Descriptive data mining

Predictive data mining
Application areas: biology, agriculture, social
issues, economy, business, …
74
Some problems with regression

Some assumptions going along with regression.

Danger of extrapolation.

Evaluation of regression models.

Other advanced techniques for regression:

Artificial Neural Network (ANN)

Support Vector Machine (SVM)
75
5. CLUSTERING
What is Cluster Analysis?

Cluster: a collection of data objects



Cluster analysis



Similar to one another within the same cluster
Dissimilar to the objects in other clusters.
Grouping a set of data objects into clusters.
Clustering is unsupervised learning: no predefined
classes, no class-labeled training samples.
Typical applications


As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
76
General Applications of Clustering


Pattern Recognition
Spatial Data Analysis





create thematic maps in GIS by clustering feature spaces
detect spatial clusters and explain them in spatial data
mining
Image Processing
Economic Science (especially market research)
World Wide Web


Document classification
Cluster Weblog data to discover groups of similar access
patterns
77
Examples of Clustering Applications





Marketing: Help marketers discover distinct groups
in their customer bases, and then use this
knowledge to develop targeted marketing programs.
Land use: Identification of areas of similar land use
in an earth observation database.
Insurance: Identifying groups of motor insurance
policy holders with a high average claim cost.
City-planning: Identifying groups of houses
according to their house type, value, and
geographical location.
Earth-quake studies: Observed earth quake
epicenters should be clustered along continent
faults.
78
Similarity and Dissimilarity Between Objects


Distances are normally used to measure the
similarity or dissimilarity between two data objects
Some popular ones include: Minkowski distance:
d (i, j)  q (| x  x |q  | x  x |q ... | x  x |q )
i1
j1
i2
j2
ip
jp
where i = (xi1, xi2, …, xip) and j = (xj1, xj2, …, xjp) are
two p-dimensional data objects, and q is a positive
integer
If q = 1, d is Manhattan distance
d (i, j) | x  x |  | x  x | ... | x  x |
i1 j1
i2 j 2
ip
jp
79
Euclid distance



If q = 2, d is Euclidean distance:
Properties

d(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j)  d(i,k) + d(k,j)
Also one can use weighted distance, or other
disimilarity measures.
80
Type of data in cluster analysis

Interval-scaled variables/attributes

Binary variables/attributes

Categorical variables/attributes

Ordinal variables/attributes

Ratio-scaled variables/attributes

Variables/attributes of mixed types
81
Type of data

Interval-scaled variables/attributes
Mean absolute deviation
sf  1
n (| x1 f  m f |  | x2 f  m f | ... | xnf  m f |)
Mean
m f  1n (x1 f  x2 f
Z-score measurement
 ... 
xnf )
.
xif  m f
zif 
sf
82
Type of data

Binary variables/attributes
Object j
Object i
1
0
1
a
b
0
c
d
sum a  c b  d
sum
a b
cd
p (= a + b + c + d)
Dissimilarity (if symmetric):
d (i, j) 
bc
a bc  d
Dissimilarity (if asymmetric):
d (i, j) 
bc
a bc
83
Type of data

Binary variables/attributes

Example
Name
Jack
Mary
Jim



Gender
M
F
M
Fever
Y
Y
Y
Cough
N
N
P
Test-1
P
P
N
Test-2
N
N
N
Test-3
N
P
N
Test-4
N
N
N
gender: symmetric
Other binary attributes: asymmetric
Y, P  1, N  0
01
 0.33
2 01
11
d ( jack , jim ) 
 0.67
111
1 2
d ( jim , mary ) 
 0.75
11 2
d ( jack , mary ) 
84
Type of data

Variables/attributes of mixed types

General case:
 pf  1 ij( f ) dij( f )
d (i, j) 
 pf  1 ij( f )

if xif or xjf is missing) then

f (variable/attribute): binary (categorical)
dij(f) = 0 if xif = xjf , or dij(f) = 1 otherwise

f : interval-scaled

f : ordinal
dij(f) = |xif-xjf|/(maxhxhf-minhxhf)
zif

compute ranks rif and

zif becomes interval-scaled

r 1
M 1
if
f
85
Partitioning Algorithms: Basic Concept




Partitioning method: Construct a partition of a database
D of n objects into a set of k clusters
Given a k, find a partition of k clusters that optimizes the
chosen partitioning criterion.
- Global optimal: exhaustively enumerate all partitions
- Heuristic methods: k-means and k-medoids
algorithms
k-means (MacQueen’67): Each cluster is represented by
the center of the cluster
k-medoids or PAM (Partition around medoids) (Kaufman
& Rousseeuw’87): Each cluster is represented by one of
the objects in the cluster
86
The K-Means Clustering Method



Input: a database D, of m records, r1, r2,…,rm and
a desired number of clusters k.
Output: set of k clusters that minimizes the square
error criterion.
Given k, the k-means algorithm is implemented in 4
steps:




Step 1: Randomly choose k records as the initial cluster
centers.
Step 2: Assign each records ri, to the cluster such that the
distance between ri and the cluster centroid (mean) is the
smallest among the k clusters.
Step 3: recalculate the centroid (mean) of each cluster
based on the records assigned to the cluster.
Step 4: Go back to Step 2, stop when no more new
assignment.
87



The algorithm begins by randomly choosing k records to
represent the centroids (means), m1, m2,…,mk of the
clusters, C1, C2,…,Ck. All the records are placed in a
given cluster based on the distance between the record
and the cluster mean. If the distance between mi and
record rj is the smallest among all cluster means, then
record is placed in cluster Ci.
Once all records have been placed in a cluster, the mean
for each cluster is recomputed.
Then the process repeats, by examining each record
again and placing it in the cluster whose mean is closest.
Several iterations may be needed, but the algorithm will
converge, although it may terminate at a local optimum.
88
Square-error criterion


The terminating condition is usually the squarederror criterion.
Typically, the square-error criterion is used, defined
as:
k
E = i=1pCi |p – mi|2

where E is the sum of square-error for all objects in
the database, p is the point in space representing a
given object, and mi is the mean of cluster Ci.
This criterion tries to make the resulting clusters as
compact and as separate as possible.
89
Example 4.1: Consider the K-means clustering algorithm that
works with the (2-dimensional) records in Table 2. Assume that the
number of desired clusters k is 2.
RID
Age
Years of Service
-------------------------------------1
30
5
2
50
25
3
50
15
4
25
5
5
30
10
6
30
25
 Let the algorithm choose records with RID 3 for cluster C1 and RID
6 for cluster C2. as the initial cluster centroids.
 The first iteration:
distance(r1, C1) =  (50-30)2+(15-5)2 = 22.4; distance(r1, C2) = 32.0, so
r1 C1. distance(r2, C1) = 10.0 and distance(r2, C2) = 5.0 so r2  C2.
distance(r4, C1) = 25.5 and distance(r4, C2) = 36.6 so r4  C1
distance(r5, C1) = 20.6 and distance(r5, C2) = 29.2 so r5  C1
 Now the new means (centroids) for the two clusters are computed.

90

The means for a cluster Ci with n records of m
dimensions is the vector:
(1/n rj Ci rj1, … 1/n rj Ci rjm)
The new mean for C1 is (33.75, 8.75) and the new
mean for C2 is (52.5, 25).
The second iteration: r1, r4, r5  C1 and r2, r3, r6  C2.
 The mean for C1 and C2 are recomputed as (28.3,
6.7) and (51.7, 21.7).
In the next iteration, all records stay in their previous
clusters and the algorithm terminates.

91
Clustering of a set of objects based on the k-means method.
92
Comments on the K-Means Method

Strength



Relatively efficient: O(tkn), where n is # objects, k is # of
clusters, and t is # of iterations. Normally, k, t << n.
Often terminates at a local optimum. The global optimum
may be found using techniques such as: deterministic
annealing and genetic algorithms
Weakness




Applicable only when mean is defined, then what about
categorical data?
Need to specify k, the number of clusters, in advance
Unable to handle noisy data and outliers
Quality of clustering is dependent on the choice of initial
cluster centroids.
93
Partitioning around metroids (k-metroids)
94
K-metroids
Compute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom
95
K-metroids
Compute “total cost S of swapping Oj và Orandom” = ΣpCp/OiOrandom
Cp/OiOrandom
= d(p,Oi) – d(p,Oj)
Cp/OiOrandom
= d(p,Orandom) – d(p,Oj)
Cp/OiOrandom = 0
Cp/OiOrandom
= d(p,Orandom) – d(p,Oi)
96
Properties of k-metroids algorithm


Each cluster has its representative objective, the
medoid, or most centrally located objects, of its
cluster.

Reduce the influence of noise
(outlier/irregularitites/extrema).

The number of clusters k needs to be predetermined.
Complexity for each iteration: O(k(n-k)2)

Algorithm becomes very costly for large values of n and
k.
97
Hierarchical Clustering


A hierarchical clustering method works by grouping data
objects into a tree of clusters.
In general, there are two types of hierarchical clustering
methods:
 Agglomerative hierarchical clustering: This bottom-up strategy
starts by placing each object in its own cluster and then merges
these atomic clusters into larger and larger clusters, until all of the
objects are in a single cluster or until a certain termination conditions
are satisfied. Most hierarchical clustering methods belong to this
category. They differ only in their definition of intercluster similarity.
 Divisive hierarchical clustering: This top-down strategy does the
reverse of agglomerative hierarchical clustering by starting with all
objects in one cluster. It subdivides the cluster into smaller and
smaller pieces, until each object forms a cluster on its own or until it
satisfied certain termination condition, such as a desired number
clusters is obtained or the distance between two closest clusters is
above a certain threshold distance.
98
Agglomerative algorithm

Assume that we are given n data records r1, r2,…,rm and a function
D(Ci, Cj) for measuring the distance between two clusters Ci and Cj.
Then an agglomerative algorithm for clustering can be as follows:
for i = 1,…,n do let Ci = {ri};
while there is more than one cluster left do
begin
Let Ci and Cj be the clusters which minimize the distance D(Ck, Ch)
between any two clusters Ck, Ch;
Ci = Ci  Cj;
Remove cluster Cj
end
99
Example



Example 4.2: Figure 4 shows the application of AGNES
(Agglomerative NESting), an agglomerative hierarchical
clustering method, and DIANA (DIvisive ANAlysis), a divisive
hierarchical clustering method to a data set of five objects, {a, b,
c, d, e}.
Initially, AGNES places each object into a cluster of its own. The
clusters are then merged step-by-step according to some
criterion. For example, clusters C1 and C2 may be merged if an
object in C1 and an object in C2 form the minimum Euclidean
distance between any two objects from different clusters. This is
a single-link approach in that each cluster is represented by all of
the objects in the cluster, and the similarity between two clusters
is measured by the similarity of the closest pair of data points
belonging to different clusters.
The cluster merging process repeats until all of the objects are
eventually merged to form one cluster.
100
Agglomerative and divisive hierarchical clustering on data objects {a, b,
c, d, e}
101
Hierarchical Clustering



In DIANA, all of the objects are used to form one initial
cluster. The cluster is split according to some principle,
such as the maximum Euclidean distance between the
closest neighboring objects in the cluster. The cluster
splitting process repeats until, eventually, each new
cluster contains only a single objects.
In general, divisive methods are more computationally
expensive and tend to be less widely used than
agglomerative methods.
There are a variety of methods for defining the
intercluster distance D(Ck, Ch). However, local pairwise
distance measures (i.e., between pairs of clusters) are
especially suited to hierarchical methods.
102
Hierarchical Clustering


One of the most important of intercluster distances is the
nearest neighbor or single link method. This defines the
distance between two clusters as the distance between
the two closest points, one from each cluster;
Dsl(Ci, Cj) = min{d(x,y)| x Ci, y Cj}
where d(x, y) is the distance between objects x and y.
If the distance between two clusters is defined as the
distance between the two farthest points, one from each
cluster:
Dcl(Ci, Cj) = max{d(x,y)| x Ci, y Cj}
where d(x, y) is the distance between objects x and y
The method is called a complete-linkage algorithm.
103
Single-linkage
Complete-linkage
4
4
3
3
2
2
1
1
1
2
3
4
1
2
3
4
Criteria to merge two clusters: single-linkage and complete-linkage
104
Major weakness of hierarchical clustering methods:


1. They do not scale well: time complexity of at least
O(n2), where n is the number of total objects.
The agglomerative algorithm requires in the first iteration
that we locate the closest pair of objects. This takes O(n2)
time and so, in most cases, the algorithm require O(n2)
time, and frequently much more.
2. We can never undo what was done previously.
105
Some advanced hierarchical clustering
algorithms

BIRCH (Balanced Iterative Reducing and
Clustering using Hierarchies): partitions the
objects using two concepts: cluster-feature and
cluster-feature tree.

ROCK (Robust Clustering using linKs): clustering
algorithm for categorical/discrete attributes.,

Chameleon: A hierarchical clustering algorithm
using dynamic modeling.
106
Introduction to BIRCH
(1996)
 Hierarchial representation of a clustering
 The data set needs to be scanned only once
 Clustering decisions on-the-fly when new data
points are inserted.
 Outlier removal
 Good for very large dataset
Limited computational resources
107
Introduction to BIRCH
Features of a cluster:



Centroid: Euclidean mean
Radius: average distance from any member point to
centroid.
Diameter: Average pairwise distance between two data
points
108
Alternative of measures for closeness
of two clusters
Any of the following can be used as distance
metric to compare a new data point to
existing clusters: in BIRCH algorithm:
 D0: Euclidean distance between centroids
 D1: Manhattan distance between centroids
109
Alternative of measures for
closeness of two clusters
ANd for deciding whether to merge clusters:
 D2=Average Inter-cluster distance between 2
clusters
 D3=Average intra-cluster distance inside a
cluster (geometrically, the diameter of new
cluster if 2 clusters are merged)
 D4=Variance increase distance: amount by
which the intracluster distance variance
changes if 2 clusters are merged
110
Clustering feature
Maintained for each subcluster
 Enough information to calculate intra-cluster distances
111
Additivity theorem
Clustering features are additive
 Allows us to merge two subclusters
112
CF tree



Hierarchial representation of a clustering
Updated dynamically when new data points
are inserted
Each entry in a leaf node is not a data point,
but a subcluster
113
CF tree parameters
 The diameter of a leaf node has to be less than T
 A nonleaf node contains at most B entries
 A leaf node contains at most L entries
 The tree size is a function of T
 B and L are determined by the requirement that each
node fits into a memory page of given size
114
Example CF tree
115
CF-Tree
116
The BIRCH algorithm
1. Build an in-memory CF tree
2. Optional: condense into a smaller CF tree
3. Global clustering
4. Optional: cluster refining
117
Phase 1



Insert data points one at a time, building a
CF-tree dynamically
If the tree grows too large, increase the
threshold T and rebuild from the current tree
Optionally remove outliers when rebuilding
the tree
118
Insertion algorithm




Start from the root node and find the closest
leaf node and leaf entry
If the distance to the centroid is less than the
threshold T, the entry will absorb the new
data point
Otherwise create a new leaf entry
If there is no space on the leaf for new entry,
split the leaf node
119
CF-Tree Insertion




Choose the farthest pair of entries, and
redistribute the remaining entries
Insert a new nonleaf entry into the parent
node
We may have to split the parent node as well
If the root is split, the tree height increases by
one
120
CF-Tree Insertion
121
CF-Tree Insertion
122
CF-Tree Insertion
123
CF-Tree Insertion
124
CF-Tree Insertion
125
CF-Tree Insertion
126
CF-Tree Insertion
127
Phase 2




Scan the leaf entries to rebuild a smaller CFtree
Remove outliers
Group more crowded subclusters into larger
ones
The idea is to make next phase more
efficient
128
Phase 3


Leaf nodes do not necessarily represent
clusters
A regular clustering algorithm is used to
cluster the leaf entries
129
Phase 4

Redistribute data points to their closest
centroid
130
6. OTHER DATA MINING PROBLEMS
Discovering of Sequential Patterns




The discovery of sequential patterns is based on the concept
of a sequence of itemsets. We assume that transactions are
ordered by time of purchase. That ordering yields a sequence
of itemsets.
For example, {milk, bread, juice}, {bread, eggs}, {cookies,
milk, coffee} may be such a sequence of itemsets based on
three visits of the same customer to the store.
The support for a sequence S of itemsets is the percentage of
the given set U of sequences of which S is a subsequence.
In this example, {milk, bread, juice}, {bread, eggs} and {bread,
eggs}, {cookies, milk, coffee} are considered subsequences.
131


The problem of identifying sequential patterns, then, is to
find all subsequences from the given sets of sequences
that have a user-defined minimum support. The
sequence S1, S2, S3,… is a predictor of the fact that a
customer who buys itemset S1 is likely to buy itemset S2
and then S3, and so on. This prediction is based on the
frequency (support) of this sequence in the past.
Various algorithms have been investigated for sequence
detection.
132
Mining Time series
Discovering of Patterns in Time Series


A time series database consists of sequences of
values or events changing with time. The values are
typically measured at equal time intervals. Time
series databases are popular in many applications,
such as studying daily fluctuations of a stock market,
traces of scientific experiments, medical treatments,
and so on.
A time series can be illustrated as a time-series
graph which describes a point moving with the
passage of time.
133
Time series data: Stock price of IBM over
time
134
Categories of Time-Series Movements:




Long-term or trend movements
Cyclic movements or cycle variations, e.g.,
business cycles
Seasonal movements or seasonal variations
i.e, almost identical patterns that a time series
appears to follow during corresponding months
of successive years.
Irregular or random movements
135
Similarity Search in Time-Series
Analysis




Normal database query finds exact match.
Similarity search finds data sequences that differ only
slightly from the given query sequence.
Two categories of similarity queries

Whole matching: find a sequence that is similar to the query
sequence

Subsequence matching: find all pairs of similar sequences
Typical Applications

Financial market

Transaction data analysis

Scientific databases (e.g. power consumption analysis)

Medical diagnosis (e.g. cardiogram analysis)
136
Data transformation







For similarity analysis of time series data, Euclidean distance is
typically used as a similarity measure.
Many techniques for signal analysis require the data to be in the
frequency domain. Therefore, distance-preserving
transformations are often used to transform the data from time
domain to frequency domain.
Usually data-independent transformations are used where the
transformation matrix is determined a priori.
E.g., discrete Fourier transform (DFT), discrete wavelet transform
(DWT)
The distance between two signals in the time domain is the same
as their Euclidean distance in the frequency domain
DFT does a good job of concentrating energy in the first few
coefficients.
If we keep only first a few coefficients in DFT, we can compute
the lower bounds of the actual distance.
137
Multidimensional Indexing

Multidimensional index



Constructed for efficient accessing using the first
few Fourier coefficients
Use the index to retrieve the sequences that
are at most a certain small distance away
from the query sequence.
Perform post-processing by computing the
actual distance between sequences in the
time domain and discard any false matches.
138
Subsequence Matching




Break each sequence into a set of pieces of window
with length w.
Extract the features of the subsequence inside the
window
Map each sequence to a “trail” in the feature space
Divide the trail of each sequence into “subtrails” and
represent each of them with minimum bounding
rectangle.
(R-trees and R*-trees have been used to store minimal
bounding rectangles so as to speed up the similarity
search.)

Use a multipiece assembly algorithm to search for
longer sequence matches.
139
We can group clusters of
datapoints with “boxes”, called
Minimum Bounding Rectangles
(MBR).
R1
R2
R4
R5
R3
R6
R9
R7
R8
We can further recursively group
MBRs into larger MBRs….
140
…these nested MBRs are organized
as a tree (called a spatial access tree
or a multidimensional tree). Examples
include R-tree, Hybrid-Tree etc.
R10
R11
R10 R11 R12
R1 R2 R3
R12
R4 R5 R6
R7 R8 R9
Data nodes containing points
141
Discretization



Discretization of a time series is tranforming it into a
symbolic string.
The main benefit of this discretization is that there is
an enormous wealth of existing algorithms and data
structures that allow the efficient manipulations of
symbolic representations.
Lin and Keogh et al. (2003) proposed a method
called Symbolic Aggregate Approximation (SAX),
which allows the descretization of original time
series into symbolic strings.
142
Symbolic Aggregate
Approximation (SAX)
[Lin et al. 2003]
The first symbolic representation
of time series, that allow
 Lower bounding of Euclidean distance
 Dimensionality Reduction
 Numerosity Reduction
baabccbc
143
How do we obtain SAX
C
C
0
40
60
80
100
120
c
First convert the time
series to PAA
representation, then
convert the PAA to
symbols
It take linear time
20
c
c
b
b
a
0
20
b
a
40
60
80
100
baabccbc
120
144
Two parameter choices
C
The word size, in this
case 8
C
0
20
40
1
2
60
3
1
b
a
0
20
4
80
100
5
c
6
7
8
c
c
b
120
b
2
1
a
40
60
80
100
3
120
The alphabet size (cardinality), in this case 3
145
Time Series Data Mining tasks
• Similarity Search
• Classification
• Clustering
• Motif Discovery
• Novelty/Anomaly Detection
• Time series visualization
• Time series prediction
146
Some representative data mining tools








Acknosoft (Kate) Decision trees, case-based reasoning
DBMiner Technology (DBMiner)
OLAP analysis,
Associations, classification, clustering.
IBM (Intelligent Miner) Classification, Association rules,
predictive models.
NCR (Management Discovery Tool) Association rules
SAS (Enterprise Miner) Decision trees, Association rules, neural
networks, Regression, clustering
Silicon Graphics (MineSet) Decision trees, Association rules
Oracle (Oracle Data Mining) classification, prediction,
regression, clustering, association, feature selection, feature
extraction, anomaly selection.
Weka system (http://www.cs.waikato.ac.nz/ml/weka) University
of Waikato, Newzealand. The system is written in Java. The
platforms: Linux, Windows, Macintosh.
147
7. POTENTIAL APPLICATIONS OF DM

Database analysis and decision support

Market analysis and management


Risk analysis and management



target marketing, customer relation management, market
basket analysis, cross selling, market segmentation
Forecasting, customer retention, improved underwriting,
quality control, competitive analysis
Fraud detection and management
Other Applications


Text mining (news group, email, documents) and Web
analysis.
Intelligent query answering
148
Market Analysis and Management

Where are the data sources for analysis?


Target marketing


Credit card transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle studies
Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits, etc.
Determine customer purchasing patterns over time

Conversion of single to a joint bank account: marriage, etc.
149

Cross-market analysis



Customer profiling


data mining can tell you what types of customers buy what
products (clustering or classification)
Identifying customer requirements



Associations/co-relations between product sales
Prediction based on the association information
identifying the best products for different customers
use prediction to find what factors will attract new
customers
Provides summary information


various multidimensional summary reports
statistical summary information (data central tendency and
variation)
150
Corporate Analysis and Risk Management

Finance planning and asset evaluation




Resource planning:


cash flow analysis and prediction
contingent claim analysis to evaluate assets
cross-sectional and time series analysis (financial-ratio,
trend analysis, etc.)
summarize and compare the resources and spending
Competition:



monitor competitors and market directions
group customers into classes and a class-based pricing
procedure
set pricing strategy in a highly competitive market
151
Fraud Detection and Management

Applications


Approach


widely used in health care, retail, credit card services,
telecommunications (phone card fraud), etc.
use historical data to build models of fraudulent behavior
and use data mining to help identify similar instances
Examples



auto insurance: detect a group of people who stage
accidents to collect on insurance
money laundering: detect suspicious money transactions
(US Treasury's Financial Crimes Enforcement Network)
medical insurance: detect professional patients and ring of
doctors and ring of references
152
Fraud Detection and Management (cont.)

Detecting inappropriate medical treatment


Detecting telephone fraud



Australian Health Insurance Commission identifies that in
many cases blanket screening tests were requested (save
Australian $1m/yr).
Telephone call model: destination of the call, duration, time of
day or week. Analyze patterns that deviate from an expected
norm.
British Telecom identified discrete groups of callers with
frequent intra-group calls, especially mobile phones, and
broke a multimillion dollar fraud.
Retail

Analysts estimate that 38% of retail shrink is due to dishonest
employees.
153
Other Applications

Sports


Astronomy


IBM Advanced Scout analyzed NBA game statistics (shots
blocked, assists, and fouls) to gain competitive advantage
for New York Knicks and Miami Heat
JPL and the Palomar Observatory discovered 22 quasars
with the help of data mining
Internet Web Surf-Aid

IBM Surf-Aid applies data mining algorithms to Web access
logs for market-related pages to discover customer
preference and behavior pages, analyzing effectiveness of
Web marketing, improving Web site organization, etc.
154