Transcript Data Mining

Chase Repp
 knowledge
discovery
 searching, analyzing, and
sifting through
large data sets to find new patterns,
trends, and relationships contained
within
 Data
mining differs from database
querying in the following manner:
database querying asks “what company
purchased $100,000 worth of widgets last
year?” while this asks “what company is
likely to purchase over $100,000 of
widgets next year and why?”
 coined
in the 1960s
 Data
mining was used to find basic
information from the collections of data
such as total revenue over the last three
years.
 classic
statistics
 artificial intelligence
 machine learning
 Predictive
Data Mining
• Target value
• Future trends
 Descriptive
Data Mining
• No target value
• Focuses on relations
focuses
on discovering a
relationship between independent
variables and a relationship
between dependent and
independent variables
used
to forecast specific things
 describes
a data set in a brief but
comprehensive way and gives interesting
characteristics of the data without having
any predefined target
 Focus
on relations
 patterns
are discovered based on a
relationship of a specific item with other
items in the same transaction
 Descriptive
 Example: groceries
 to
classify each item in a set of data into
one of the predefined sets of classes or
groups
 Often
used with machine learning
 Predictive
 Example: cat
or dog person?
 Different
from classification, the
clustering technique also defines the
classes and put objects in them
 Descriptive
 Example: a
library
 used
to predict numbers from data sets
that have known target values
 Predictive
 Example: sales, distance, temperature,
value, etc
 discovers
frequent sequences or
subsequences as patterns in a sequence
database
 Descriptive
 Derived
from association mining
 There
are three categories that the main
sequential pattern mining techniques fall
into.
 Apriori-based
 Pattern-growth
 Early-pruning
 follow
the apriori property - all
nonempty subsets of a frequent itemset
must also be frequent
 if
{AB} is a frequent itemset, both {A} and
{B} should be a frequent itemset
 AprioriAll, GSP, PSP, and
SPAM
 Transaction
data
 Assume:
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
 An
example frequent itemset:
{Chicken, Clothes, Milk}
 Association
[sup = 3/7] about 43%
rules from the itemset:
Clothes  Milk, Chicken [sup = 3/7, conf = 3/3]
…
…
Clothes, Chicken  Milk, [sup = 3/7, conf = 3/3]
 Two steps:
• Find all itemsets that have minimum support
(frequent itemsets).
• Use frequent itemsets to generate rules.
 E.g., a
frequent itemset
{Chicken, Clothes, Milk}
[sup = 3/7]
and one rule from the frequent itemset
Clothes  Milk, Chicken
conf = 3/3]
[sup = 3/7,
Dataset T
minsup=50%
itemset:count
1. scan T  C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
 F1:
 C2:
{1}:2, {2}:3, {3}:3,
{5}:3
TID
Items
T100 1, 3, 4
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T  C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
 F2:
 C3:
{1,3}:2,
{2,3}:2, {2,5}:3, {3,5}:2
{2, 3,5}
3. scan T  C3: {2, 3, 5}:2  F3: {2, 3, 5}
 divide-and-conquer
strategy
 to
focus the search on a restricted portion
of the initial database and generate as
few candidate sequences as possible
 FreeSpan, PrefixSpan, WAP-mine, and
Miner
FS-
 utilize
a sort of position induction to
prune candidate sequences very early in
the mining process and to avoid support
counting as much as possible
 LAPIN, HVSM, and
DISC-all
 searching
 content
for patterns in data through
mining
• Search engines
 structure
mining
• Hyper links (hits / page rank)
 usage
mining
• User’s browser data and forms submitted
 One
use is for finding user navigational
patterns on the World Wide Web by
extracting knowledge from web logs
 An
example of applying sequential
pattern mining
S
= {a, b, c, d, e, f}
 [P1,<abdac>]
[P2,<eaebcac>]
[P3,<babfaec>] [P4,<abfac>]
 Frequent
pattern of abac
 combines
traditional mining methods
and information visualization techniques
• user is directly involved
 VDMS
- simplicity, reliability, reusability,
availability, and security
 http://www.youtube.com/user/quiterian
 http://www.youtube.com/watch?v=MtJ4X
a4-J8g
 http://www.youtube.com/watch?v=_8Hz
wQCFFfw