Transcript Data Mining

Data Mining
An Introduction
Definitions
• Data Mining is …
knowledge discovery using sophisticated blend of
techniques from traditional statistics, artificial
intelligence and computer graphics.
• Data mining is …
discovery of useful, possibly unexpected, patterns in
data.
Definitions
• Data mining is the analysis of (often large)
observational data sets to find unsuspected
relationships and to summarize the data in novel
ways that are both understandable and useful to the
data owner
• Data mining is the extraction of interesting (nontrivial, implicit, previously unknown and potentially
useful) information or patterns from data in large
databases.
Goals of Data Mining
• Explanatory
– To explain some observed event or condition, e.g., why
sales of trucks has increased in region A?
• Confirmatory
– To confirm a hypothesis, e.g., whether two-income families
are more likely to buy family medical coverage as
compared to single-income families
• Exploratory
– To analyze data for new or unexpected relationships, e.g.,
what spending patterns are likely to accompany credit card
fraud?
Data Mining Techniques
• Data mining can be performed on a data mart or
EDW.
• Depending on the nature of the data to be analyzed
and size of data set, any of the following techniques
can be used:
– Regression
• Test or discover relationships from historical data
– Decision tree induction
• Test or discover ‘if…Then’ rules for decision tendency
– Clustering and Signal Processing
• Discover subgroups or segments
Data Mining Techniques(Contd.)
– Affinity
• Discover things with strong mutual relationships
– Sequence Association
• Discover cycles of events & behaviours
– Case-based reasoning
• Derives rules from real-world case examples
– Rule discover
• Searches for patterns & correlations in large datasets
– Fractals
• Compresses large databases without losing information
– Neural nets
• Develops predictive models based on principles modeled after
human brain
Analytical Processing vs. Models
• To a database person, data-mining is a
powerful form of analytic processing --queries that examine large amounts of data.
– Result is the data that answers the query.
• To a statistician, data-mining is the inference
of models.
– Result is the parameters of the model
Data Mining & Knowledge Discovery in
Databases
CRISP-DM methodology
• Cross Industry Standard Process for Data Mining
• Six phases
–
–
–
–
–
–
Business Understanding
Data Understanding
Data Preparation
Modeling
Evaluation
Deployment
Data Mining Applications
• Profiling Populations
– Developing profiles of a high-value customers, credit risks,
and credit-card frauds
• Analysis of business trends
– Identifying markets with above/below average growth
• Target marketing
– Identifying customers for promotional activity
• Usage analysis
– Identifying usage patterns for products and services
• Campaign effectiveness
– Comparing campaign strategies for effectivesness
Data Mining Applications
• Product Affinity
– Identifying products that are purchased concurrently, or
the shoppers characteristics for certain product groups
• Customer retention and churn
– Examining the behaviour of customers who left for
competitors to prevent remaining customers from leaving
• Profitability analysis
– Determining which customers are profitable
• Customer Value Analysis
– Determining where valuable customers are at different
stages in their life
Data Mining Applications
• Up-selling
– Identifying new products and services to sell to a customer
based upon critical events and life-style changes
• Intelligence gathering
– total information awareness
• Web analysis
– Ranking web pages
• Financial
– predict price movements to make better investments
Frequent ItemSets
and
“Association Rules”
Some related concepts to
Data Mining
Slides by Jeffrey Ullman, Stanford
13
The Market-Basket Model
• A large set of items, e.g., things sold in a
supermarket.
• A large set of baskets, each of which is a small
set of the items, e.g., the things one customer
buys on one day.
14
Association Rule Mining
sales
records:
tran1
tran2
tran3
tran4
tran5
tran6
cust33
cust45
cust12
cust40
cust12
cust12
p2,
p5,
p1,
p5,
p2,
p9
p5, p8
p8, p11
p9
p8, p11
p9
market-basket
data
• Trend: Products p5, p8 often bought together
15
Support
• Simplest question: find sets of items that
appear “frequently” in the baskets.
• Support for itemset I = the number of baskets
containing all items in I.
• Given a support threshold s, sets of items that
appear in > s baskets are called frequent
itemsets.
16
Applications --- (1)
• Real market baskets: chain stores keep
terabytes of information about what
customers buy together.
– Tells how typical customers navigate stores, lets
them position tempting items.
– Suggests tie-in “tricks,” e.g., run sale on diapers
and raise the price of beer.
• High support needed, or no $$’s .
17
Applications --- (2)
• “Baskets” = documents; “items” = words in
those documents.
– Lets us find words that appear together unusually
frequently, i.e., linked concepts.
• “Baskets” = sentences, “items” = documents
containing those sentences.
– Items that appear together too often could
represent plagiarism.
18
Applications --- (3)
• “Baskets” = Web pages; “items” = linked
pages.
– Pairs of pages with many common references may
be about the same topic.
• “Baskets” = Web pages p ; “items” = pages
that link to p .
– Pages with many of the same links may be mirrors
or about the same topic.
19
Important Point
• “Market Baskets” is an abstraction that
models any many-many relationship between
two concepts: “items” and “baskets.”
– Items need not be “contained” in baskets.
• The only difference is that we count cooccurrences of items related to a basket, not
vice-versa.
20
Scale of Problem
• WalMart sells 100,000 items and can store
billions of baskets.
• The Web has over 100,000,000 words and
billions of pages.
21
Association Rules
• If-then rules about the contents of baskets.
• {i1, i2,…,ik} → j means: “if a basket contains all
of i1,…,ik then it is likely to contain j.
• Confidence of this association rule is the
probability of j given i1,…,ik.
22
Example
+
_
_
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
+
B6 = {m, c, b, j}
B8 = {b, c}
• An association rule: {m, b} → c.
– Confidence = 2/4 = 50%.
23
Interest
• The interest of an association rule is the
absolute value of the amount by which the
confidence differs from what you would
expect, where items are selected
independently of one another.
24
Example
B1 = {m, c, b}
B3 = {m, b}
B5 = {m, p, b}
B7 = {c, b, j}
B2 = {m, p, j}
B4 = {c, j}
B6 = {m, c, b, j}
B8 = {b, c}
• For association rule {m, b} → c, item c
appears in 5/8 of the baskets.
• Interest = | 2/4 - 5/8 | = 1/8 --- not very
interesting.
25
Relationships Among Measures
• Rules with high support and confidence may
be useful even if they are not “interesting.”
– We don’t care if buying bread causes people to
buy milk, or whether simply a lot of people buy
both bread and milk.
• But high interest suggests a cause that might
be worth investigating.
26
Finding Association Rules
• A typical question: “find all association rules
with support ≥ s and confidence ≥ c.”
– Note: “support” of an association rule is the
support of the set of items it mentions.
• Hard part: finding the high-support (frequent )
itemsets.
– Checking the confidence of association rules
involving those sets is relatively easy.
27
Computation Model
• Typically, data is kept in a “flat file” rather than
a database system.
– Stored on disk.
– Stored basket-by-basket.
– Expand baskets into pairs, triples, etc. as you read
baskets.
28
Computation Model --- (2)
• The true cost of mining disk-resident data is
usually the number of disk I/O’s.
• In practice, association-rule algorithms read
the data in passes --- all baskets read in turn.
• Thus, we measure the cost by the number of
passes an algorithm takes.
29
Main-Memory Bottleneck
• In many algorithms to find frequent itemsets
we need to worry about how main memory is
used.
– As we read baskets, we need to count something,
e.g., occurrences of pairs.
– The number of different things we can count is
limited by main memory.
– Swapping counts in/out is a disaster.
30
Finding Frequent Pairs
• The hardest problem often turns out to be
finding the frequent pairs.
31
Other Effective Approaches
• Naïve Algorithm
• A-Priori Algorithm