Transcript Data Mining

Knowledge Discovery & Data
Mining
• process of extracting previously unknown, valid,
and actionable (understandable) information from
large databases
• Data mining is a step in the KDD process of
applying data analysis and discovery algorithms
• Machine learning, pattern recognition, statistics,
databases, data visualization.
• Traditional techniques may be inadequate
– large data
Why Mine Data?
• Huge amounts of data being collected and
warehoused
– Walmart records 20 millions per day
– health care transactions: multi-gigabyte databases
– Mobil Oil: geological data of over 100 terabytes
• Affordable computing
• Competitive pressure
– gain an edge by providing improved, customized services
– information as a product in its own right
Data mining
• Pattern
– 1212121?
– ’12’ pattern is found often enough So, with some confidence we
can say ‘?’ is 2
– “If ‘1’ then ‘2’ follows”
– Pattern  Model
Confidence
– 12121?
– 12121231212123121212?
– 121212 3
• Models are created using historical data by detecting
patterns. It is a calculated guess about likelihood of
repetition of pattern.
Where are Models Used?
1.
Selection
Business trying to select prospective customers (Profitability)
A model that predicts the LD usage based on credit history.
2. Acquisition
Selection is who would you like to invite to a party. Acquisition is
about getting them to agree. Putting together a plan that will
make them say yes. Again a model.
3. Retention
Keeping your flock together! Sensing it before they jump the ship.
4. Extension
Extending services to existing customers. Cross-selling
Data Mining Techniques
•
•
•
•
Classification
Clustering
Association Rule Discovery
Sequential Pattern Discovery
Classification
• Data defined in terms of attributes, one of which is the
class
• Find a model for class attribute as a function of
the values of other(predictor) attributes, such
that previously unseen records can be assigned
a class as accurately as possible.
• Training Data: used to build the model
• Test data: used to validate the model (determine
accuracy of the model)
Given data is usually divided into training and test sets.
Classification:Example
Classification: Direct Marketing
• Goal: Reduce cost of soliciting (mailing) by
targeting a set of consumers likely to buy a new
product.
• Data
– for similar product introduced earlier
– we know which customers decided to buy and which did
not {buy, not buy} class attribute
– collect various demographic, lifestyle, and company
related information about all such customers - as
possible predictor variables.
• Learn classifier model
Classification: Fraud detection
• Goal: Predict fraudulent cases in credit card
transactions.
• Data
– Use credit card transactions and information on its accountholder as input variables
– label past transactions as fraud or fair.
• Learn a model for the class of transactions
• Use the model to detect fraud by observing credit
card transactions on a given account.
Clustering
• Given a set of data points, each having a set of
attributes, and a similarity measure among them,
find clusters such that
– data points in one cluster are more similar to one another
– data points in separate clusters are less similar to one
another.
• Similarity measures
– Euclidean distance if attributes are continuous
– Problem specific measures
Clustering: Market Segmentation
• Goal: subdivide a market into distinct subsets of
customers where any subset may conceivably be
selected as a market target to be reached with a
distinct marketing mix.
• Approach:
– collect different attributes on customers based on
geographical, and lifestyle related information
– identify clusters of similar customers
– measure the clustering quality by observing buying patterns
of customers in same cluster vs. those from different
clusters.
Association Rule Discovery
• Given a set of records, each of which contain
some number of items from a given collection
– produce dependency rules which will predict occurrence of
an item based on occurrences of other items
Association Rules:Application
• Marketing and Sales Promotion:
• Consider discovered rule:
{Bagels, … } --> {Potato Chips}
– Potato Chips as consequent: can be used to determine
what may be done to boost sales
– Bagels as an antecedent: can be used to see which
products may be affected if bagels are discontinued
– Can be used to see which products should be sold with
Bagels to promote sale of Potato Chips
Association Rules: Application
• Supermarket shelf management
• Goal: to identify items which are bought together
(by sufficiently many customers)
• Approach: process point-of-sale data (collected
with barcode scanners) to find dependencies
among items.
• Example
– If a customer buys Diapers and Milk, then he is very likely to
buy Beer
– so stack six-packs next to diapers?
Sequential Pattern Discovery
• Given: set of objects, each associated with its
own timeline of events, find rules that predict
strong sequential dependencies among different
events, of the form (A B) (C) (D E) --> (F)
•xg :max allowed time between consecutive
event-sets
• ng: min required time between consecutive
event sets
•ws: window-size, max time difference between
earliest and latest events in an event-set (events
within an event-set may occur in any order)
•ms: max allowed time between earliest and
latest events of the sequence.
Sequential Pattern Discovery:
Examples
• sequences in which customers purchase goods/services
• understanding long term customer behavior -- timely
promotions.
• In point-of--sale transaction sequences
– Computer bookstore:
(Intro to Visual C++) (C++ Primer) --> (Perl for Dummies,
TCL/TK)
– Athletic Apparel Store:
(Shoes) (Racket, Racketball) --> (Sports Jacket)