Transcript Data Mining

Outline
•
•
•
•
•
Knowledge discovery in databases.
Data warehousing.
Data mining.
Different types of data mining.
The Apriori algorithm for generating
association rules .
Knowledge Discovery in
Databases
•What is it?
•Why do we need it?
•How does data mining fit in?
Steps of a KDD Process
•
•
•
•
•
•
•
Learn the application domain.
Create a target dataset.
Data cleaning and preprocessing.
Choose the type of data mining to perform.
Pick an algorithm.
Data mining!
Interpretation.
Data Warehousing
• How does it differ from a database?
– Databases provide support for:
• Queries over current data
• Persistent storage
• Atomic updates
– Data warehouses provide support for:
•
•
•
•
Storage of all data, current or not
Details or summaries
Metadata
Integrated data (to reduce time in cleaning data)
Types of Data Mining
•Classification Rules
•Clustering
•Sequence similarity
•Sampling/summarizing
•Association Rules
Classification Rules
• Rules that partition data into separate
groups.
• Used to classify people as good/bad credit
risks, etc.
• A variation is the BestN problem; decide the
best N of a set for a given problem (such as
find the best N people to mail ski vacation
packages to).
Clustering
• Goal is to put data tuples into a class.
• Map each data tuple to a point in n
dimensional space, and identify clusters
based on spatial proximity.
• Differs from classification rules because
here the idea is to group based on similarity
overall, not to find the ones that lead to a
similar outcome.
Sequence Similarity
• Used where there are time series or ordered
data.
• The goal is generally to give an example
and look for “similar” patterns.
• Example: “When AT&T stock goes up on 2
consecutive days and DEC stock does not
fall during this period, IBM stock goes up
the next day 75% of the time.”
Sampling/Summarization
• Sampling: finding samples of the data to
carry out more detailed analysis on. Goal is
to get the best sample.
• Summarization: finding summaries of all of
the data. Goal is to help people figure out
what to do with their data, or to prepare
reports.
Association Rules
•Rules that express when two or more
items are found in the same “basket”
of data.
•Used to try to find when certain
members of the data cause other
members: example, people who buy
diapers tend to buy beer.
Support and Confidence
• Association rules are measured in terms of
– support = a and b occur together in at lest
s% of the n baskets
– confidence = of all of the baskets
containing a, at least c% also contain b.
Association Rules Algorithms
• Focuses on measuring support for “itemsets” (the
number of transactions that contain the data set).
Confidence is an easier problem and is figured out
later.
• The naïve method: Start with all the itemsets of
size 1. Find the “large” itemsets. Combine to find
itemsets of size 2, etc.
• Apriori algorithm: Tests to make sure that not only
does each step combine only large itemsets, but
that every subset of the set is also a large itemset.
Apriori Algorithm
• In general, the set of large itemsets of size k
is referred to as Lk and Ck is the candidate
set of size k
• For each itemset in Lk-1, if all of the items in
the set are the same except the last item,
then the two itemsets are combined and this
is put into the set Ck, which is the list of
candidates for large itemsets of size k.
• Check to see if for each itemset in Ck all of
the subsets are in L k-1. This allows you to
discard some itemsets without having to
check for them in the dataset
• Count the support of the remaining items in
Ck and remove those without enough
support. This is now the set Lk.
• Bump k and repeat steps 2 through 4 until
there is only one set remaining in Lk.
When there is only one set in Lk you have
found all of the large itemsets.
Example
Find all Itemsets with support of 2
Items
134
235
1235
25
Itemsets of size 1
Itemset
{1}
{2}
{3}
Support
Items
134
235
{4}
1235
{5}
25
C2
Itemset
{1 2}
{1 3}
{1 5}
{2 3}
{2 5}
{3 5}
Items
134
235
1235
25
Finding C3
Itemset
Support
{1 3}
2
{2 3}
2
{2 5}
3
{3 5}
2