Transcript Data Mining
Chase Repp
knowledge
discovery
searching, analyzing, and
sifting through
large data sets to find new patterns,
trends, and relationships contained
within
Data
mining differs from database
querying in the following manner:
database querying asks “what company
purchased $100,000 worth of widgets last
year?” while this asks “what company is
likely to purchase over $100,000 of
widgets next year and why?”
coined
in the 1960s
Data
mining was used to find basic
information from the collections of data
such as total revenue over the last three
years.
classic
statistics
artificial intelligence
machine learning
Predictive
Data Mining
• Target value
• Future trends
Descriptive
Data Mining
• No target value
• Focuses on relations
focuses
on discovering a
relationship between independent
variables and a relationship
between dependent and
independent variables
used
to forecast specific things
describes
a data set in a brief but
comprehensive way and gives interesting
characteristics of the data without having
any predefined target
Focus
on relations
patterns
are discovered based on a
relationship of a specific item with other
items in the same transaction
Descriptive
Example: groceries
to
classify each item in a set of data into
one of the predefined sets of classes or
groups
Often
used with machine learning
Predictive
Example: cat
or dog person?
Different
from classification, the
clustering technique also defines the
classes and put objects in them
Descriptive
Example: a
library
used
to predict numbers from data sets
that have known target values
Predictive
Example: sales, distance, temperature,
value, etc
discovers
frequent sequences or
subsequences as patterns in a sequence
database
Descriptive
Derived
from association mining
There
are three categories that the main
sequential pattern mining techniques fall
into.
Apriori-based
Pattern-growth
Early-pruning
follow
the apriori property - all
nonempty subsets of a frequent itemset
must also be frequent
if
{AB} is a frequent itemset, both {A} and
{B} should be a frequent itemset
AprioriAll, GSP, PSP, and
SPAM
Transaction
data
Assume:
t1: Beef, Chicken, Milk
t2: Beef, Cheese
t3: Cheese, Boots
t4: Beef, Chicken, Cheese
t5: Beef, Chicken, Clothes, Cheese,
Milk
t6: Chicken, Clothes, Milk
t7: Chicken, Milk, Clothes
minsup = 30%
minconf = 80%
An
example frequent itemset:
{Chicken, Clothes, Milk}
Association
[sup = 3/7] about 43%
rules from the itemset:
Clothes Milk, Chicken [sup = 3/7, conf = 3/3]
…
…
Clothes, Chicken Milk, [sup = 3/7, conf = 3/3]
Two steps:
• Find all itemsets that have minimum support
(frequent itemsets).
• Use frequent itemsets to generate rules.
E.g., a
frequent itemset
{Chicken, Clothes, Milk}
[sup = 3/7]
and one rule from the frequent itemset
Clothes Milk, Chicken
conf = 3/3]
[sup = 3/7,
Dataset T
minsup=50%
itemset:count
1. scan T C1: {1}:2, {2}:3, {3}:3, {4}:1, {5}:3
F1:
C2:
{1}:2, {2}:3, {3}:3,
{5}:3
TID
Items
T100 1, 3, 4
T200 2, 3, 5
T300 1, 2, 3, 5
T400 2, 5
{1,2}, {1,3}, {1,5}, {2,3}, {2,5}, {3,5}
2. scan T C2: {1,2}:1, {1,3}:2, {1,5}:1, {2,3}:2, {2,5}:3, {3,5}:2
F2:
C3:
{1,3}:2,
{2,3}:2, {2,5}:3, {3,5}:2
{2, 3,5}
3. scan T C3: {2, 3, 5}:2 F3: {2, 3, 5}
divide-and-conquer
strategy
to
focus the search on a restricted portion
of the initial database and generate as
few candidate sequences as possible
FreeSpan, PrefixSpan, WAP-mine, and
Miner
FS-
utilize
a sort of position induction to
prune candidate sequences very early in
the mining process and to avoid support
counting as much as possible
LAPIN, HVSM, and
DISC-all
searching
content
for patterns in data through
mining
• Search engines
structure
mining
• Hyper links (hits / page rank)
usage
mining
• User’s browser data and forms submitted
One
use is for finding user navigational
patterns on the World Wide Web by
extracting knowledge from web logs
An
example of applying sequential
pattern mining
S
= {a, b, c, d, e, f}
[P1,<abdac>]
[P2,<eaebcac>]
[P3,<babfaec>] [P4,<abfac>]
Frequent
pattern of abac
combines
traditional mining methods
and information visualization techniques
• user is directly involved
VDMS
- simplicity, reliability, reusability,
availability, and security
http://www.youtube.com/user/quiterian
http://www.youtube.com/watch?v=MtJ4X
a4-J8g
http://www.youtube.com/watch?v=_8Hz
wQCFFfw