Ch2-DataMiningTasks
Download
Report
Transcript Ch2-DataMiningTasks
Chapter 2
Data Mining Tasks
1
Data Mining Tasks
Prediction methods
Use some variables to predict unknown or future values
of the same or other variables.
Inference on the current data in order to make
prediction
Description methods
Find human interpretable patterns that describe data
Characterize the general properties of data in db
Descriptive mining is complementary to predictive mining
but it is closer to decision support than decision making
2
Cont’d
Association Rule Mining (descriptive)
Classification and Prediction (predictive)
Clustering (descriptive)
Sequential Pattern Discover (descriptive)
Regression (predictive)
Deviation Detection (predictive)
3
Association Rule Mining
Initially developed for market basket analysis
Goal is to discover relationships between
attributes
Data is typically stored in very large databases,
sometimes in flat files or images
Uses include decision support, classification and
clustering
Application areas include business, medicine and
engineering
4
Association Rule Mining
Given a set of transactions, each
of which is a set of items, find
all rules (XY) that satisfy
user specified minimum support
and confidence constraints
Support = (#T containing X and
Y)/(#T)
Confidence=(#T containing X
and Y)/ (#T containing X)
Applications
Cross selling and up selling
Supermarket shelf
management
Transaction
T1
T2
T3
T4
T5
Items
Bread, Jelly, Jem
Bread, Jem
Bread, Milk, Jem
Coffee, Bread
Coffee, Milk
Some rules discovered
Bread Jem
Sup=60%, conf=75%
Jelly Bread
Sup=60%, conf=100%
Jelly Jem
Sup=20%, conf=100%
Jelly Milk
Sup=0%
5
Association Rule Mining:
Definition
Given a set of records, each of which
contain some number of items from a given
collection:
Produce dependency rules which will predict
occurrence of an item based on occurrences of
other items
Example:
{Bread} {Jem}
{Jelly} {Jem}
6
Association Rule Mining:
Marketing and sales promotion
Say the rule discovered is
{Bread, …} {Jem}
Jem as a consequent: can be used to determine what
products will boost its sales.
Bread as antecedent: can be used to see which products
will be impacted if the store stops selling bread
Bread as an antecedent and Jem as a consequent: can be
used to see what products should be stocked along with
Bread to promote the sale of Jem.
7
Association Rule Mining:
Supermarket shelf management
Goal: To identify items that are bought
concomitantly by a reasonable fraction of
customers so that they can be shelved.
Data Used: Point-of sale data collected with
barcode scanners to find dependencies among
products.
Example
If customer buys jelly, then he is very likely to by Jem.
So don’t be surprised if you find Jem next to Jelly on an
aisle in the super market. Also salsa next to tortilla
chips.
8
Association Rule Mining
Association rule mining will produce LOTS of rules
How can you tell which ones are important?
High Support
High Confidence
Rules involving certain attributes of interest
Rules with a specific structure
Rules with support / confidence higher than expected
Completeness – Generating all interesting rules
Efficiency – Generating only rules that are interesting
9
Clustering
Determine object groupings such that objects within the
same cluster are similar to each other, while objects in
different groups are not
Typically objects are represented by data points in a
multidimensional space with each dimension
corresponding to one or more attributes. Clustering
problem in this case reduces to the following:
Given a set of data points, each having a set of attributes, and a
similarity measure, find cluster such that
Data points in one cluster are more similar to one another
Data points in separate clusters are less similar to one another
10
Cont’d
Similarity measures:
Euclidean distance (continuous attr.)
Other problem – specific measures
Types of Clustering
Group-Based Clustering
Hierarchical Clustering
11
Clustering Example
Euclidean distance
based clustering in 3D
space
Intra cluster distances
are minimised
Inter cluster distances
are maximised
12
Clustering: Market Segmentation
Goal: To subdivide a market into distinct subset of
customers where each subset can be targeted with
a distinct marketing mix
Approach:
Collect different attributes of customers based on their
geographical and lifestyle related information
Find clusters of similar customers
Measure the clustering quality by observing the buying
patterns of customers in the same cluster vs. those from
different clusters.
13
Clustering: Document Clustering
Goal: To find groups of documents that are similar
to each other based on important terms appearing
in them
Approach: To identify frequently occurring terms
in each document. Form a similarity measure
based on frequencies of different terms. Use it to
generate clusters.
Gain: Information Retrieval can utilize the clusters
to relate a new document or search to clustered
documents
14
Clustering:
Document Clustering Example
Clustering points: 3204 articles of LA Times
Similarity measure: Number of common words in
documents (after some word filtering)
Category
Financial
Foreign
National
Metro
Sports
Entertainment
Total articles
555
341
273
943
738
354
Correctly placed articles
364
260
36
746
573
278
15
Classification: Definition
Given a set of records (called the training set)
Each record contains a set of attributes. One of the
attributes is the class
Find a model for the class attribute as a function of
the values of other attributes
Goal: Previous unseen records should be assigned
to a class as accurately as possible
Usually, the given data set is divided into training and
test set, with training set used to build the model and
test set used to validate it. The accuracy of the model is
determined on the test set.
16
Classification: cont’d
Classifiers are created using labeled training samples
Classifiers are evaluated using independent labeled
samples (test set)
Training samples created by ground truth / experts
Classifier later used to classify unknown samples
Measurements must be able to predict the phenomenon!
Examples
Direct marketing
Fraud detection
Customer churn
Sky survey cataloging
Classifying galaxies
17
cla
ss
uo
us
co
nt
in
ca
te
go
ric
al
ca
te
go
ric
al
Classification Example
Tid
Refund
Marital
Status
Taxable
Income
Cheat
1
2
3
4
5
6
7
8
9
10
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Training
Set
Refund
Marital
Status
Taxable
Income
Cheat
Yes
No
No
Yes
No
No
Yes
No
No
No
Single
Married
Single
Married
Divorced
Married
Divorced
Single
Married
Single
125K
100K
70K
120K
95K
60K
220K
85K
75K
90K
No
No
No
No
Yes
No
No
Yes
No
Yes
Test
set
Learn
Classifier
Model
18
Classification: Direct Marketing
Goal: Reduce cost of mailing by targeting a set of
consumers likely to buy a new cell phone product
Approach:
Use the data collected for a similar product introduced in the recent
past.
Use the profiles of consumers along with their (buy, didn’t buy}
decision. The latter becomes the class attribute.
The profile of the information may consist of demographic,
lifestyle and company interaction.
Demographic – Age, Gender, Geography, Salary
Psychographic - Hobbies
Company Interaction – Recentness, Frequency, Monetary
Use these information as input attributes to learn a classifier model
19
Classification: Fraud Detection
Goal: Predict fraudulent cases in credit card
transactions
Approach:
Use credit card transactions and the information on its
account holders as attributes (important: when and
where the card was used)
Label past transactions as {fraud, fair} transactions.
This forms the class attribute
Learn a model for the class of transactions
Use this model to detect fraud by observing credit card
transactions on an account.
20
Regression
Predict the value of a given continuous valued
variable based on the values of other variables,
assuming a linear or non-linear model of
dependency
Extensively studied in the fields of Statistics and
Neural Networks
Predicting sales number of new product based on
advertising expenditure
Predicting wind velocities based on temperature,
humidity, air pressure, etc
Time series prediction of stock market indices
21
Deviation/Anomaly Detection
Some data objects do not comply with the general
behavior or model of the data. Data objects that
are different from or inconsistent with the
remaining set are called outliers
Outliers can be caused by measurement or
execution error. Or they represent some kind of
fraudulent activity
Goal of deviation/anomaly detection is to detect
significant deviations from normal behavior
22
Deviation/Anomaly Detection:
Definition
Given a set of n points or objects, and k, the
expected number of outliers, find the top k
objects that considerably dissimilar,
exceptional or inconsistent with the
remaining data
This can be viewed as two sub problems
Define what data can be considered as
inconsistent in a given data set
Find an efficient method to mine the outliers
23
Deviation:
Credit Card Fraud Detection
Goal: to detect fraudulent credit card transactions
Approach:
Based on past usage patterns, develop model for
authorized credit card transactions
Check for deviation from model, before authenticating
new credit card transactions
Hold payment and verify authenticity of “doubtful”
transaction by other means (phone call, etc.)
24
Anomaly detection:
Network Intrusion Detection
Goal: to detect intrusion of a computer
network
Approach:
Define and develop a model for normal user
behavior on the computer network
Continuously monitor behavior of users to
check if it deviates from the defined normal
behavior
Raise an alarm, if such deviation is found
25
Sequential pattern discovery:
definition
Given is a set of objects, with each object
associated with its own time of events, find
rules that predict strong sequential
dependencies among different events
Sequence discovery aims at extracting sets
of events that commonly occur over a
period of time
(A B) (C) (D E)
26
Sequential pattern discovery:
Telecommunication Alarm Logs
Telecommunication alarm logs
(Inverter_Problem Excessive_Line_Current)
(Rectifier_Alarm) (Fire_Alarm)
27
Sequential pattern discovery:
Point of Sell Up Sell / Cross Sell
Point of sale transaction sequences
Computer bookstore
(Intro_to_Visual_C) (C++ Primer)
(Perl_For_Dummies, Tcl_Tk)
60% customers who buy Intro toVisual C and C++
Primer also buy Perl for dummies and Tcl Tk within
a month
Athletic apparel store
(Shoes) (Racket, Racket ball) (Sport_Jacket)
28
Example: Data Mining(Weather
data)
By applying various data mining techniques, we
can find
associations and regularities in our data
Extract knowledge in the forms of rules, decision trees
etc.
Predict the value of the dependent variable in new
situation
Some example
Mining association rules
Classification by decision trees and rules
Prediction methods
29
Mining association rules
First, discretize the numeric attributes (a part of
the data preprocessing stage)
Group the temperature values in three intervals
(hot, mild, cool) and humidity values in two (high,
normal)
Substitute the values in data with the
corresponding names
Apply the Apriori algorithm and get the following
rules
30
Discretized weather data
Day
outlook
temperature
humidity
windy
play
1
sunny
hot
high
false
No
2
sunny
hot
high
true
No
3
overcast
hot
high
False
Yes
4
rainy
mild
high
False
Yes
5
rainy
cool
normal
False
Yes
6
rainy
cool
normal
True
No
7
overcast
cool
normal
True
Yes
8
sunny
mild
high
False
No
9
sunny
cool
normal
False
Yes
10
rainy
mild
normal
False
Yes
11
sunny
mild
normal
True
Yes
12
overcast
mild
high
True
Yes
13
overcast
hot
normal
False
Yes
14
rainy
mild
high
true
no
31
Cont’d
humidity=normal windy=false play=yes (4,1)
temperature=cool humidity=normal (4,1)
outlook=overcast play=yes (4,1)
temperature=cool play=yes humidity=normal (3,1)
outlook=rainy windy=false play=yes (3, 1)
outlook=rainy play=yes windy=false (3, 1)
outlook=sunny humidity=high play=no (3, 1)
outlook=sunny play=no humidity=high (3, 1)
temperature=cool windy=false humidity=normal play=yes (2,
1)
10. temperature=cool humidity=normal windy=false play=yes (2,
1)
1.
2.
3.
4.
5.
6.
7.
8.
9.
32
Cont’d
These rules show some attribute values sets
(itemsets) that appear frequently in the data
Support (the number of occurrences of the
itemset in the data)
Confidence (accuracy) of the rules
Rule 3 – the same as the one that is
produced by observing the data cube
33
Classification by Decision Trees
and Rules
Using ID3 algorithm, the following decision tree
is produced
Outlook=sunny
Humidity=high:no
Humidity=normal:yes
Outlook=overcast:yes
Outlook=rainy
Windy=true:no
Windy=false:yes
34
Cont’d
Decision tree consists of:
Decision nodes that test the values of their
corresponding attribute
Each value of this attribute leads to a subtree and so on,
until the leaves of the tree are reached
They determine the value of the dependent variable
Using a decision tree we can classify new tuples
35
Cont’d
A decision tree can be presented as a set of rules
Each rule represents a path through the tree from the
root to a leaf
Other data mining techniques can produce rules
directly: Prism algorithm
if outlook=overcast then yes
if humidity=normal and windy=false then yes
If temperature=mild and humidity=normal the yes
If outlook=rainy and windy=false then yes
If outlook=sunny and humidity=high then no
If outlook=rainy and windy=true then no
36
Prediction methods
DM offers techniques to predict the value of the
dependent variable directly without first
generating a model
The most popular approaches is based of statistical
methods
Uses the Bayes rule to predict the probability of
each value of the dependent variable given the
values of the independent variables
37
Cont’d
Eg: applying Bayes to the new tuple:
(sunny, mild, normal, false, ?)
P(play=yes| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.8
P(play=no| outlook=sunny, temperature=mild,
humidity=normal, windy=false) = 0.2
The predicted value must be “yes”
38
Data Mining : Problems and Challenges
Noisy
data
Large
Database
s
Dynamic
Database
s
39
Noisy data
many of attribute values will be inexact or
incorrect
erroneous instruments measuring some property
human errors occurring at data entry
two forms of noise in the data
corrupted values - some of the values in the training set
are altered from the original form
missing values - one or more of the attribute values
may be missing both for examples in the training set
and for object which are to be classified.
40
Difficult Training Set
Non-representative data
Learning are based on a few examples
Using large db, the rules probably representative
Absence of boundary cases
To find the real differences between two classes
Limited information
Two objects to be classified give the same conditional
attributes but are classified in the diff class
Not have enough information of distinguishing two
types of objects
41
Dynamic databases
Db change continually
Rules that reflect the content of the db at all
time (preferred)
If same changes are made, the whole
learning process may have to be conducted
again
42
Large databases
The size of db to be ever increasing
Machine learning algorithms – handling a
small training set (a few hundred examples)
Much care on using similar techniques in
larger db
Large db – provide more knowledge (eg.
rules may be enormous)
43
Data Mining – Issues in Data Mining
User Interaction / Visualization
Incorporation of Background Knowledge
Noisy or Incomplete Data
Determining Interestingness of Patterns
Efficiency and Scalability
Parallel and Distributed Mining
Incremental Learning / Mining Time-Changing Phenomena
Mining from Image / Video / Audio Data
Mining Unstructured Data
44