Transcript ppt
Data Warehousing
1
Data Warehouse
• A data warehouse is a huge database that
stores historical data
• Example: Store information about all sales of
products in all branches of a supermarket
• Example: Information about all telephone
calls made through a certain company
2
OLTP vs. OLAP
• The queries we have seen so far were Online
Line Transaction Processing, i.e., many small
queries with results needed immediately
• Typically, Online Line Analytical Processing
is done against a data warehouse.
– A lot of aggregation
– Results aren’t needed immediately (queries can
run for hours)
– Used for decision support (more about this later)
3
Creating a Data Warehouse
• Data from different sources is combined
– need to transform data to a single schema
– need to transform data to have same type of
measurements (e.g., same currency)
• Occasionally, data is refreshed (get new data)
• Occasionally, data is purged (throw out old
data)
• Need a meta-data repository to store
information for the above
4
Design
• Usually, a star schema:
– large fact table that continuously grows
– small dimension tables that remain basically
static
Sales is the fact table,
• Example:
–
–
–
–
the rest are dimensions
Sales(pid, timeid, locid, amount)
Products(pid, pname, category, price)
Locations(locid, city, start, country)
Times(timeid, date, week, month, holiday_flag)
5
Normal Forms
• Fact Table: In BCNF
• Dimension Tables: Generally not in BCNF
– few updates, insert, deletes – few anomalies
– data is relatively small so duplication doesn’t
matter
– Allow queries to be faster since less joins are
needed
6
Data Mining
7
Data Mining
• Data Mining is the process of finding
interesting trends or patterns in large
datasets in order to guide future decisions.
• Related to exploratory data analysis (area
of statistics) and knowledge discovery (area
in artificial intelligence, machine learning).
• Data Mining is characterized by having VERY
LARGE datasets.
8
Some Real Problems
• Associations between products bought in a
store
– milk
– beers and diapers
• Deciding on product discounts
• Figuring out how to keep customer at lowest
cost to company
• Personal recommendation page at Amazon
• Stock market trends
9
Data Mining vs. Machine
Learning
• Size: Databases are usually very large so
algorithms must scale well
• Design Purpose: Databases are not usually
designed for data mining (but for other
purposes), and thus, may not have
convenient attributes
• Errors and Noise: Databases almost always
contain errors
10
Data Mining Techniques
Technique
Course Where it is Taught
Decision Trees
Introduction to Artificial
Intelligence
Neural Networks
Neural Networks 1,
Neural Networks 2
Computational Geometry
K-Nearest Neighbor
Association Rules
(Sequence Patterns,
Classification Rules)
Introduction to Database
Systems!
11
Association Rules
• A market basket is a collection of items
bought by a single customer in a single
customer transaction
• Problem: Identify sets of items that are
purchased together
• Attempt to identify rules of the form
{pen} {ink}
Meaning: If a pen is bought in a transaction, it
is likely that ink will also be bought
12
Example Purchases Table
transid
item
111
111
111
111
pen
ink
diary
soap
112
112
112
pen
ink
diary
transid
113
113
114
item
pen
diary
pen
114
114
114
ink
soap
tissues
13
Measures for Association
Rules
• General form: LR, where L and R are sets
of items
• Support: The support for LR is the
percentage of transactions containing all
items from L and from R
• Confidence: The confidence for LR are
the percentage of transactions that contain R,
from among the transactions that contain L
14
Examples
• The rule {pen}{ink} has:
– support:
– confidence:
• The rule {tissues}{ink} has:
– support:
– confidence:
• The rule {pen}{soap} has:
– support:
– confidence:
15
Understanding the Measures:
The Intuition
• If a rule has low support then it might have
arisen by chance. There is not enough
evidence to draw a conclusion.
• If a rule LR has low confidence then it is
likely that there is no relationship between
buying L and buying R
• Note: LR and RL will always have the
same support, but may have different
confidence
16
Goal
• Find association rules with high support and
high confidence. Support and confidence
values are specified by the user.
• Remember: Finding such a rule does not
mean that there must be a relationship
between the left and right sides. A person
must evaluate such rules by hand.
– Example: {milk} {bread}
17
Algorithm
• Given values s and c, find all association
rules with support >= s and confidence >= c.
• Can be done in 2 steps.
• Step 1: Find frequent itemsets, i.e., sets of
items that appear with support >= s
• Step 2: For each frequent itemset F, try all
ways to partition F to L and R and check if the
rule generated will have confidence >= c.
18
Finding Frequent Itemsets
• How would you do this?
Think!
19
Finding Frequent Itemsets
• The A Priori Property: Every subset of a
frequent itemset must also be a frequent
itemset.
• This means that we can create frequent
itemsets iteratively, by taking frequent
itemsets of size n, and extending them to
frequent itemsets of size n+1.
20
Finding Frequent Itemsets (2)
Freq = {}
scan all transactions once and add to Freq the items that
have support > s
k=1
repeat
foreach Ik in Freq with k items
generate all itemsets Ik+1 with k+1 items,
such that Ik is contained in Ik+1
scan all transactions once and add to Freq the
k+1-itemsets that have support > s
k++
until no new frequent itemsets are found
21
Example
Run the algorithm with support = 0.7
• Level 1: Find frequent itemsets:
{pen}, {ink}, {diary}
• Level 2: Check each of:
{pen, ink}, {pen, diary}, {pen, soap},
{pen, tissues}, {ink, diary}, {ink, soap},
{ink, tissues}, {diary, soap}, {diary, tissues}
The frequent itemsets are:
{pen, ink}, {pen, diary}
22
Example
• Level 3: Check each of:
{pen, ink, diary}, {pen, ink, soap},
{pen, ink, tissues}, {pen, diary, soap},
{pen, diary, tissues}
No frequent itemsets!
• To summarize, the frequent itemsets were:
{pen}, {ink}, {diary}
{pen, ink}, {pen, diary}
23
Refinements to the Algorithm
• Note that we checked if {pen, tissues} is a
frequent itemset even though we already
knew that {tissues} is not a frequent itemset.
Refinement: Try to extend frequent itemsets
in a fashion that will ensure that all their
subsets are frequent itemsets
• Scanning the transactions can be expensive
if the table does not fit in main memory.
Refinement: Find rules over a sample of the
database that fits in memory
24
Deriving Association Rules
foreach frequent itemset I
foreach partition of I to two sets L, R
generate a candidate rule LR
foreach transaction T in the database
foreach candidate rule LR
if L in T then
lnum(LR)++
if R in T then rnum(LR)++
return all rules LR with
rnum(LR)/lnum(LR) > c
25
Other Types of Rules
• Sequential Patterns
– Each transaction for each customer is a set of
items
– A sequence of transactions is a sequence of sets.
For example: {pen, ink, soap}, {pen, ink diary,
soap}
– A subsequence is derived from a sequence by
deleting some itemsets and some items in other
itemsets
26
Sequence Patterns (2)
• {pen}, {ink, diary}, {pen, soap} is a
subsequence of {pen, ink}, {shirt}, {milk, ink,
diary}, {soap, pen, diary}
• {pen}, {ink, diary}, {pen, soap} is not a
subsequence of {pen, ink}, {shirt}, {soap, pen,
diary}, {milk, ink, diary}
27
Sequence Patterns (3)
• The support for a sequence pattern S is the
percentage of customer sequences that have
S as a subsequence.
• A sequence s1, s2, ..., sn with high support
means that someone who buys s1 is likely to
buy s2 later on, etc.
• Finding sequence patterns: In a similar
fashion to finding frequent itemsets. Start with
small sequences and try to extend them
28
Classification Rules
• Consider a table:
BankInfo(custid, balance, age, returnLoan)
• Would like to find rules of the type: “If age is
in a certain range and balance is in a certain
range, then a customer is likely to not return a
loan.”
• We can apply this rule to new customers who
ask for a loan.
29
Classification Rules (2)
• General Form:
(l1<X1<h1) and ... and (lk < Xk < hk)Y=c
• Use values for X1,...,Xk to “predict” value for Y
• We can define support and confidence as
before.
30