Transcript ppt
CS 345:
Topics in Data Warehousing
Tuesday, November 16, 2004
Review of Thursday’s Class
• Dimension Key Mapping Revisited
– Comments on Assignment #2
• Updating the Data Warehouse
– Incremental maintenance vs. drop & rebuild
– Self-maintainable views
• Approximate Query Processing
– Sampling-based techniques
• Computing confidence intervals
• Online vs. pre-computed samples
• Sampling and joins
– Alternative techniques
Outline of Today’s Class
• Data Mining
– What is data mining?
– Types of data mining
– Data mining pitfalls
• Decision Tree Classifiers
– What is a decision tree?
– Learning decision trees
– Entropy
– Information Gain
– Cross-Validation
Data Mining
• What is data mining?
– Many definitions…
– Basically: identify interesting patterns in data
– Most often, the term “data mining” refers to automatic detection
of patterns through machine learning
• Data mining is one part of the broader process of
knowledge discovery in databases (KDD)
– KDD: “the process of identifying valid, novel, potential useful,
and ultimately understandable patterns in data”
– This is what data warehousing is all about.
• Data mining is a field of research
– Draws from databases, artificial intelligence, statistics
– Relatively new research community
– Several conferences and journals
• ACM KDD, SIAM Data Mining, IEEE ICDM
Knowledge Discovery in Databases
Knowledge
Interpretation/
Evaluation
Data Mining
Preprocessing
Data
• Validation Tests
• Visualization
• Identify Patterns
• Generate Models
• Selection
• Cleaning
• Transformation
• Feature Extraction
Types of Data Mining
• OLAP
– Group-by aggregation queries are a simple type of data mining
– Summarize the data set
• Classification
– Build predictive model to categorize records into discrete classes
– Examples:
• Classify mortgage applicants as “will default” or “will not default”
• Face recognition in image database
• Identify likely terrorists vs. unlikely terrorists
• Regression
– Build predictive model to predict real-valued function
– Examples:
• Predict how much revenue each customer will generate
• Predict profitability of planned marketing campaign
• Clustering
– Separate data records into groups of similar items
– Clustering vs. Classification
• Classification is supervised, clustering is unsupervised
• Classification uses pre-defined class labels, clustering doesn’t.
• Classification has a “right answer”, clustering doesn’t.
Types of Data Mining
• Outlier detection
– Identify unusual or atypical data records
• Sometimes to investigate them further
• Sometimes to exclude them from a broad analysis
• Trend analysis / forecasting
– Identify changes in patterns of data over time
– Example: What will be next month’s revenue?
• Dependency detection
– Which attributes are correlated with one another?
– Which attribute values are likely to occur together?
• Popular technique: Association rule mining
• Also known as market basket analysis
• Find products that are often bought together as part of same transaction
• Temporal pattern detection / time series mining
– Recognize commonly recurring patterns in time series data
– Example: “Technical analysis” of financial markets
Data Mining Pitfalls
• Overfitting
– Spurious patterns may emerge by chance
– Don’t mistake coincidence for causality
– Example: ESP experiment
• Ask 10,000 test subjects to predict whether each of 10 face-down playing
cards is red or black
• 10 subjects predicted all 10 cards correctly!
• “Conclusion”: 1 out of every 1000 people have ESP
– Can be a particular concern in datasets with
• Lots of attributes
• Not too many records
• Reporting “obvious” patterns
– Learning cancer risk factors
• Women are more likely than men to have breast cancer
• Men are more likely than women to have prostate cancer
• These patterns are not “novel”
Data Mining Pitfalls
• Confusing correlation and causation
– Data mining can identify attributes that are correlated
– Correlation doesn’t necessarily imply causation
– Example: Studying causes of obesity
• Overweight people are more likely to drink diet soda
• “Conclusion”: Drinking diet soda causes obesity
• Moral of the story: Interpretation and evaluation
of patterns is crucial
– Data mining algorithms are not magical
– Patterns they identify must be examined carefully to
avoid drawing inappropriate conclusions
Decision Tree Classifiers
• Decision trees are one type of
classification model
• Internal nodes of decision tree
labeled with attributes
– Each internal node represents a
test
• Edges labeled with attribute values
– Edges represent the results of the
tests
High
Employed?
Credit
Score?
• Leafs labeled with class values
– Leafs represent the classifier’s Approve
predictions
• To classify a record, walk down the
tree starting at the root
– The path that is followed depends
on the attribute values of the record
being classified
Yes
No
Low
Reject
Income?
High
Low
Approve Reject
Decision Tree Learning
• We’re given a data set with unknown values for an attribute of
interest
– Example:
• Data set is Customer records
• Attribute of interest is “Will Close Account in Next 3 Months”
– Unknown attribute referred to as target attribute
– This data set is referred to as the test set
• We also have a second data set where the values of the target
attribute are known
– Referred to as the training set
• We would like to build a decision tree classifier to predict the value
of the target attribute
• Construct a decision tree that accurately classifies the records in the
training set
• Use the decision tree to predict the value of the target attribute for
the records from the test set
– Hopefully a classifier that works well on the training set will also work
well on the other data set!
Decision Tree Learning
• When does decision tree learning work well?
– Training set and test set are similar
• Patterns in the training set are also present in the test set
• Rules learned from one data set apply to the other
– Decision tree identifies general, globally valid patterns
• And not specific, idiosyncratic properties of the training records
• Need to avoid overfitting the model to the training set
• Occam’s razor: simple explanations are usually the best
– Simple (small) decision trees are usually preferable
• Easier for humans to interpret
• Usually less prone to overfitting
– Finding the smallest accurate decision tree is NP-Hard
• Decision trees are usually built top-down using greedy heuristic
• Idea: First test attributes that do best job of separating the classes
Decision Tree Learning
• Basic decision tree learning algorithm
– Do all records in training set belong to same class?
• Yes → Return leaf node with that class.
– Do all records in training set have the same values for
all attributes (other than target)?
• Yes → Return leaf node with most common class.
– Otherwise:
• Pick the single attribute that best separates records from
different classes
• Use that attribute for the root of the decision tree
• Children of root node are decision trees
– Build them recursively using same algorithm
Splitting Criterion
• How to decide which attribute is best to test first?
– Each attribute splits data into subsets
– Ideally, each subset should be as homogenous as possible
– Need metric for homogeneity of a data set
• Example:
– Two classes, +/– 100 records overall (50 +s and 50 -s)
– A and B are two binary attributes
• Records with A=0:
Records with A=1:
• Records with B=0:
Records with B=1:
48+, 22+, 4826+, 2424+, 26-
– Splitting on A is better than splitting on B
• A does a good job of separating +s and -s
• B does a poor job of separating +s and -s
Entropy
• Entropy is a good way to measure homogeneity
– Measures minimum number of bits per record needed to optimally
encode class values
• Entropy example:
–
–
–
–
–
•
Three classes (A,B,C)
A occurs ½ of the time
B and C each occur ¼ of the time
Optimal encoding: A = 0, B = 10, C = 11
Entropy = Average bits / record = 1.5
Entropy formula:
H (S ) pi log 2 pi
ci
– Entropy of data set S is denoted H(S)
– cis are the possible classes
– pi = fraction of records from S that have class ci
Entropy Examples
• Example:
–
–
–
–
–
–
10 records have class A
20 records have class B
30 records have class C
40 records have class D
Entropy = -[(.1 log .1) + (.2 log .2) + (.3 log .3) + (.4 log .4)]
Entropy = 1.85
• Earlier example revisited
– Two classes, +/– 100 records overall (50 +s and 50 -s)
– A and B are two binary attributes
• Records with A=0:
Records with A=1:
• Records with B=0:
Records with B=1:
48+, 2- Entropy = 0.24
2+, 48- Entropy = 0.24
26+, 24- Entropy = 0.99
24+, 26- Entropy = 0.99
– A is better than B because average entropy is less after splitting on A
Information Gain
• Information gain = Expected reduction in entropy
• Expected entropy after splitting on attribute A: H(S|A)
– H(S|A) = Sum [(percentage of records with A=ai)*(Entropy of
records with A=ai)]
– Sum is taken over all possible values of attribute A
– Computes weighted average entropy across all subsets
• Weight of subset = number of records in the subset
• Always split on attribute with greatest information gain
–
–
–
–
This is one possible splitting rule for building decision trees
However, other splitting criteria are also used sometimes
Gain ratio, Gini index, etc.
Alternative methods of measuring homogeneity
Decision Tree Example
State
Season
Barometer
Weather
AK
Winter
Down
Snow
HI
Winter
Down
Sun
HI
Summer
Up
Sun
CA
Summer
Up
Rain
AK
Winter
Up
Snow
CA
Winter
Down
Sun
AK
Summer
Down
Sun
CA
Winter
Up
Rain
HI
Summer
Down
Sun
Predicting the weather
Target attribute =
Weather
Source attributes =
State, Season, Barometer
Decision Tree Example
State
Season
Barometer
Weather
AK
Winter
Down
Snow
HI
Winter
Down
Sun
HI
Summer
Up
Sun
CA
Summer
Up
Rain
AK
Winter
Up
Snow
CA
Winter
Down
Sun
AK
Summer
Down
Sun
CA
Winter
Up
Rain
HI
Summer
Down
Sun
State:
AK: 2 Snow, 1 Sun → 0.92
HI: 3 Sun → 0.00
CA: 2 Rain, 1 Sun → 0.92
Entropy = 0.62
Season:
Winter: 2 Snow, 2 Sun, 1 Rain
→ 1.52
Summer: 3 Sun, 1 Rain → 0.81
CA: 2 Rain, 1 Sun → 0.92
Entropy = 1.20
Barometer:
Down: 1 Snow, 4 Sun → 0.72
Up: 1 Snow, 1 Sun, 2 Rain →
1.50
Entropy = 1.07
Decision Tree Example
State
Season
Barometer
Weather
AK
Winter
Down
Snow
AK
Winter
Up
Snow
AK
Summer
Down
Sun
HI
Winter
Down
Sun
HI
Summer
Up
Sun
HI
Summer
Down
Sun
CA
Summer
Up
Rain
CA
Winter
Down
Sun
CA
Winter
Up
Rain
State = AK:
Split on Season
Winter = Snow
Summer = Sun
State = HI:
Leaf node = Sun
State = CA:
Split on Barometer
Up = Rain
Down = Sun
Decision Tree Example
State
AK
HI
CA
Barometer
Sun
Season
Summer
Winter
Down
Up
Sun
Snow
Sun
Rain
Overfitting and Pruning
• Performance graph at right
exhibits typical phenomenon
Accuracy
– Accuracy on training data
increases decision tree grows
– Accuracy on test data initially
increases, then decreases.
Training Set Accuracy
• Why does this happen?
– Highly predictive attributes near
root of decision tree capture
general patterns
– Less predictive attributes added
later are mostly capturing
statistical noise
– Goal: Stop building the decision
tree before overfitting kicks in
• Pruning → eliminate lower
portions of the decision tree
• Replace sub-tree with a leaf node
Test Set Accuracy
Optimal
tree size
Decision
tree size
Pruning via Cross-Validation
• Cross-validation
– Separate training set into two parts
– Most of the training set is used to build tree
– Small holdout set is used to validate accuracy
• Post-pruning approach
– Build decision tree with training data (less holdout set)
– Traverse tree in bottom-up fashion
– For each sub-tree:
• Consider pruning sub-tree, replacing with leaf node
• If pruned tree is more accurate on holdout set, then use it
• Otherwise, stick with original sub-tree
• Idea behind pruning
– Portion of tree that models general patterns works well on holdout set
– Portion of tree that fits random noise works poorly on holdout set
Sufficient Statistics
• What information is need to determine what attribute to split on?
– Need to compute expected entropy of each attribute
– To compute expected entropy after splitting on attribute A:
• How many records are there with each value of A?
• Among the records with each A value, how many belong to each class?
– These counts are called sufficient statistics
• Computing sufficient statistics via SQL
– Use a simple group-by SQL query (one per attribute):
SELECT A, Class, COUNT(*)
FROM Table
GROUP BY A, Class
– For non-root nodes, need a WHERE clause for earlier splits:
SELECT A, Class, COUNT(*)
FROM Table
WHERE B=x AND C=y
GROUP BY A, Class
– Full data cube contains all sufficient statistics for entire decision tree
Decision Trees and
Data Warehouses
• Generally building a decision tree involves
dimension-focused queries
– As opposed to typical fact-focused queries
– Records for which predictions are made are
dimension rows (e.g. Customers, Accounts)
– Sometimes queries just involve the dimension table
– Other times dimension attributes may be
supplemented by virtual behavioral attributes
• Two approaches for gathering sufficient statistics
– Compute entire data cube (including subtotals) in one
query
– Issue a series of small group-by queries