Transcript Lecture2+3
Data Mining and Big
Data
Ahmed K. Ezzat,
Data Mining Concepts and
Techniques
1
Outline
Data Pre-processing
Data Mining Under the Hood
2
• Data Preprocessing Overview
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
Data Preprocessing
3
1. Why Preprocess the Data: Data Quality?
Measures for data quality: A multidimensional view
Accuracy: correct or wrong, accurate or not
Completeness: not recorded, unavailable, …
Consistency: some modified but some not, dangling, …
Timeliness: timely update?
Believability: how trustable the data are correct?
Interpretability: how easily the data can be understood?
4
1. Major Tasks in Data Preprocessing
Data cleaning
Data integration
Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
5
2. Data Cleaning
Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error
incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data
noisy: containing noise, errors, or outliers
e.g., Occupation=“ ” (missing data)
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,
Age=“42”, Birthday=“03/07/2010”
Was rating “1, 2, 3”, now rating “A, B, C”
discrepancy between duplicate records
Intentional (e.g., disguised missing data)
Jan. 1 as everyone’s birthday?
6
2. Incomplete (Missing) Data
Data is not always available
E.g.,
many tuples have no recorded value for several
attributes, such as customer income in sales data
Missing data may be due to
equipment
malfunction
inconsistent
data
with other recorded data and thus deleted
not entered due to misunderstanding
certain
data may not be considered important at the
time of entry
not
register history or changes of the data
Missing data may need to be inferred
7
2. How to Handle Missing Data?
Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably
Fill in the missing value manually: tedious + infeasible?
Fill in it automatically with:
a global constant : e.g., “unknown”, a new class?!
the attribute mean
the attribute mean for all samples belonging to the same
class: smarter
the most probable value: inference-based such as Bayesian
formula or decision tree
8
2. Noisy Data
Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention
Other data problems which require data cleaning
duplicate records
incomplete data
inconsistent data
9
2. How to Handle Noisy Data?
Binning
first sort data and partition into (equal-frequency) bins
then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
smooth by fitting the data into regression functions
Clustering
detect and remove outliers
Combined computer and human inspection
detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
2. Data Cleaning as a Process
Data discrepancy detection
Use metadata (e.g., domain, range, dependency, distribution)
Check field overloading
Check uniqueness rule, consecutive rule and null rule
Use commercial tools
Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering to
find outliers)
Data migration and integration
Data migration tools: allow transformations to be specified
ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
Iterative and interactive (e.g., Potter’s Wheels)
11
3. Data Integration
Data integration:
Schema integration: e.g., A.cust-id B.cust-#
Combines data from multiple sources into a coherent store
Integrate metadata from different sources
Entity identification problem:
Identify real world entities from multiple data sources, e.g., Bill Clinton
= William Clinton
Detecting and resolving data value conflicts
For the same real world entity, attribute values from different sources
are different
Possible reasons: different representations, different scales, e.g.,
metric vs. British units
12
3. Handling Redundancy in Data Integration
Redundant data occur often when integration of multiple
databases
Object identification: The same attribute or object may
have different names in different databases
Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue
Redundant attributes may be able to be detected by
correlation analysis and covariance analysis
Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
13
4. Data Reduction Strategies
Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies:
Dimensionality reduction, e.g., remove unimportant attributes
Wavelet transforms
Principal Components Analysis (PCA)
Feature subset selection, feature creation
Numerosity reduction (some simply call it: Data Reduction)
Regression and Log-Linear Models
Histograms, clustering, sampling
Data cube aggregation
Data compression
14
4. Data Reduction 1: Dimensionality Reduction
Curse of dimensionality
When dimensionality increases, data becomes increasingly sparse
Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful
The possible combinations of subspaces will grow exponentially
Dimensionality reduction
Avoid the curse of dimensionality
Help eliminate irrelevant features and reduce noise
Reduce time and space required in data mining
Allow easier visualization
Dimensionality reduction techniques
Wavelet transforms
Principal Component Analysis
Supervised and nonlinear techniques (e.g., feature selection)
15
4. Mapping Data to a New Space
Fourier transform
Wavelet transform
Two Sine Waves
Two Sine Waves + Noise
Frequency
16
4. What Is Wavelet Transform?
Decomposes a signal into
different frequency subbands
Applicable
to ndimensional signals
Data are transformed to
preserve relative distance
between objects at different
levels of resolution
Allow natural clusters to
become more distinguishable
Used for image compression
17
4. Wavelet Transformation
Discrete wavelet transform (DWT) for
linear signal processing, multi-resolution analysis
Haar2
Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients
Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space
Method:
Daubechie4
Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)
Each transform has 2 functions: smoothing, difference
Applies to pairs of data, resulting in two set of data of length L/2
Applies two functions recursively, until reaches the desired length
18
4. Principal Component Analysis (PCA)
Find a projection that captures the largest amount of variation in data
The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
e
x1
19
4. Data Reduction 2: Numerosity Reduction
Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling, …
20
4. Parametric Data Reduction:
Regression and Log-Linear Models
Linear regression
Data modeled to fit a straight line
Often uses the least-square method to fit the line
Multiple regression
Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
Approximates discrete multidimensional probability
distributions
21
y
4. Regression Analysis
Y1
Regression analysis: A collective name
for techniques for the modeling and
Y1’
analysis of numerical data consisting of
y=x+1
values of a dependent variable (also
called response variable or
measurement) and of one or more
X1
x
independent variables (aka. explanatory
variables or predictors)
The parameters are estimated so as to
give a "best fit" of the data
Most commonly the best fit is evaluated by
using the least squares method, but other
Used for prediction
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
criteria have also been used
22
4. Regress Analysis and Log-Linear Models
Linear regression: Y = w X + b
Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2
Many nonlinear functions can be transformed into the above
Log-linear models:
Approximate discrete multidimensional probability distributions
Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations
Useful for dimensionality reduction and data smoothing
23
4. Histogram Analysis
Divide data into buckets
and store average (sum)
for each bucket
Partitioning rules:
40
35
30
25
Equal-width: equal
bucket range
20
Equal-frequency
(or equal-depth)
10
15
5
0
10000
30000
50000
70000
90000
24
4. Clustering
Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only
Can be very effective if data is clustered but not if data is
“smeared”
Can have hierarchical clustering and be stored in multidimensional index tree structures
There are many choices of clustering definitions and
clustering algorithms
Cluster analysis will be studied in depth in Chapter 10
25
4. Sampling
Sampling: obtaining a small sample s to represent the whole
data set N
Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data
Key principle: Choose a representative subset of the data
Simple random sampling may have very poor performance
in the presence of skew
Develop adaptive sampling methods, e.g., stratified
sampling:
Note: Sampling may not reduce database I/Os (page at a time)
26
4. Types of Sampling
Simple random sampling
There is an equal probability of selecting any particular
item
Sampling without replacement
Once an object is selected, it is removed from the
population
Sampling with replacement
A selected object is not removed from the population
Stratified sampling
Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
Used in conjunction with skewed data
27
4. Sampling: With or without Replacement
Raw Data
28
4. Sampling: Cluster or Stratified Sampling
Raw Data
Cluster/Stratified Sample
29
4. Data Cube Aggregation
The lowest level of a data cube (base cuboid)
The aggregated data for an individual entity of interest
E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes
Reference appropriate levels
Further reduce the size of data to deal with
Use the smallest representation which is enough to solve
the task
Queries regarding aggregated information should be
answered using data cube, when possible
30
4. Data Reduction 3: Data Compression
String compression
There are extensive theories and well-tuned algorithms
Typically lossless, but only limited manipulation is possible
without expansion
Audio/video compression
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio
Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
31
4. Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
32
5. Data Transformation
A function that maps the entire set of values of a given attribute to a
new set of replacement values e.g., each old value can be identified
with one of the new values
Methods
Smoothing: Remove noise from data
Attribute/feature construction
New attributes constructed from the given ones
Aggregation: Summarization, data cube construction
Normalization: Scaled to fall within a smaller, specified range
min-max normalization
z-score normalization
normalization by decimal scaling
Discretization: Concept hierarchy climbing
33
5. Normalization
Min-max normalization: to [new_minA, new_maxA]
v'
v minA
(new _ maxA new _ minA) new _ minA
maxA minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600 12,000
(1.0 0) 0 0.716
1.0]. Then $73,000 is mapped to
98,000 12,000
Z-score normalization (μ: mean, σ: standard deviation):
v'
v A
A
Ex. Let μ = 54,000, σ = 16,000. Then
73,600 54,000
1.225
16,000
Normalization by decimal scaling
v
v' j
10
Where j is the smallest integer such that Max(|ν’|) < 1
34
Discretization
Three types of attributes
Nominal—values from an unordered set, e.g., color, profession
Ordinal—values from an ordered set, e.g., military or academic
rank
Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals
Interval labels can then be used to replace actual data values
Reduce data size by discretization
Supervised vs. unsupervised
Split (top-down) vs. merge (bottom-up)
Discretization can be performed recursively on an attribute
Prepare for further analysis, e.g., classification
35
5. Data Discretization Methods
Typical methods: All the methods can be applied recursively
Binning
Top-down split, unsupervised
Histogram analysis
Top-down split, unsupervised
Clustering analysis (unsupervised, top-down split or bottom-up
merge)
Decision-tree analysis (supervised, top-down split)
Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
36
5. Simple Discretization: Binning
Equal-width (distance) partitioning
Divides the range into N intervals of equal size: uniform grid
if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate presentation
Skewed data is not handled well
Equal-depth (frequency) partitioning
Divides the range into N intervals, each containing approximately
same number of samples
Good data scaling
Managing categorical attributes can be tricky
37
5. Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
38
5. Discretization Without Using Class Labels
(Binning vs. Clustering)
Data
Equal frequency (binning)
Equal interval width (binning)
K-means clustering leads to better results
39
5. Discretization by Classification &
Correlation Analysis
Classification (e.g., decision tree analysis)
Supervised: Given class labels, e.g., cancerous vs. benign
Using entropy to determine split point (discretization point)
Top-down, recursive split
Details are covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)
Supervised: use class information
Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge
Merge performed recursively, until a predefined stopping condition
40
5. Correlation Analysis (Nominal Data)
Χ2 (chi-square) test
2
(
Observed
Expected
)
2
Expected
The larger the Χ2 value, the more likely the variables are
related
The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count
Correlation does not imply causality
# of hospitals and # of car-theft in a city are correlated
Both are causally linked to the third variable: population
41
5. Chi-Square Calculation: An Example
Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
(250 90) 2 (50 210) 2 (200 360) 2 (1000 840) 2
507.93
90
210
360
840
2
It shows that like_science_fiction and play_chess are
correlated in the group
42
5. Correlation Analysis (Numeric Data)
Correlation coefficient (also called Pearson’s product
moment coefficient)
i1 (ai A)(bi B)
n
rA, B
(n 1) A B
n
i 1
(ai bi ) n AB
(n 1) A B
where n is the number of tuples, A and B are the respective means
of A and B, σA and σB are the respective standard deviation of A
and B, and Σ(aibi) is the sum of the AB cross-product.
If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.
rA,B = 0: independent; rAB < 0: negatively correlated
43
5. Concept Hierarchy Generation
Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse
Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity
Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)
Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers
Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
44
Summary
Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
Entity identification problem
Remove redundancies
Detect inconsistencies
Data reduction
Dimensionality reduction
Numerosity reduction
Data compression
Data transformation and data discretization
Normalization
Concept hierarchy generation
45
• Mining Frequent Patterns
• Classification Overview
• Cluster Analysis Overview
• Outlier Detection
Data Mining Under The Hood
46
1. What Is Frequent Pattern Analysis?
Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining
Motivation: Finding inherent regularities in data
What products were often purchased together?— Beer and diapers?!
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Can we automatically classify web documents?
Applications
Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
47
1. Why Is Freq. Pattern Mining Important?
Frequent pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
Association, correlation, and causality analysis
Sequential, structural (e.g., sub-graph) patterns
Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
Classification: discriminative, frequent pattern analysis
Cluster analysis: frequent pattern-based clustering
Data warehousing: iceberg cube and cube-gradient
Semantic data compression: fascicles
Broad applications
48
1. Basic Concepts: Frequent Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys diaper
Customer
buys beer
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
49
1. Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys beer
Customer
buys
diaper
Find all the rules X Y with
minimum support and confidence
support, s, probability that a
transaction contains X Y
confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3
Association rules: (many more!)
Beer Diaper (60%, 100%)
Diaper Beer (60%, 75%)
50
1. Closed Patterns and Max-Patterns
A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000)
= 2100 – 1 = 1.27*1030 sub-patterns!
Solution: Mine closed patterns and max-patterns instead
An itemset X is closed if X is frequent and there exists no
super-pattern Y כX, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)
An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y כX (proposed by Bayardo @
SIGMOD’98)
Closed pattern is a lossless compression of freq. patterns
Reducing the # of patterns and rules
51
1. Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
What is the set of closed itemset?
<a1, …, a100>: Min_sup = 1
< a1, …, a50>: Min_sup = 2
What is the set of max-pattern?
Min_sup = 1
<a1, …, a100>: 1
What is the set of all patterns?
!!
52
1. Scalable Frequent Itemset Mining Methods
Apriori: A Candidate Generation-and-Test Approach
Apriori (Agrawal & Srikant@VLDB’94)
Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach
Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
ECLAT: Frequent Pattern Mining with Vertical Data Format
Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
53
1. Apriori: A Candidate Generation & Test
Approach
Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
54
1. The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
55
1. Further Improvement of the Apriori Method
Major computational challenges
Multiple
Huge
number of candidates
Tedious
scans of transaction database
workload of support counting for candidates
Improving Apriori: general ideas
Reduce
Shrink
passes of transaction database scans
number of candidates
Facilitate
support counting of candidates
56
1. Sampling for Frequent Patterns
Select a sample of original database, mine frequent patterns
within sample using Apriori
Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent patterns
H. Toivonen. Sampling large databases for association rules.
In VLDB’96
57
1. Frequent Pattern-Growth Approach: Mining
Frequent Patterns Without Candidate Generation
Bottlenecks of the Apriori approach
Breadth-first (i.e., level-wise) search
Candidate generation and test
Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)
Depth-first search
Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local frequent
items only
“abc” is a frequent pattern
Get all transactions having “abc”, i.e., project DB on abc: DB|abc
“d” is a local frequent item in DB|abc abcd is a frequent pattern
58
1. Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
f:4
c:1
Item frequency head
frequent 1-itemset (single
f
4
item pattern)
c
4
c:3 b:1 b:1
2. Sort frequent items in
a
3
b
3
frequency descending
a:3
p:1
m
3
order, f-list
p
3
m:2 b:1
3. Scan DB again, construct
FP-tree
p:2 m:1
F-list = f-c-a-b-m-p
59
1. Partition Patterns and Databases
Frequent patterns can be partitioned into subsets according to
f-list
F-list = f-c-a-b-m-p
Patterns containing p
Patterns having m but no p
…
Patterns having c but no a nor b, m, p
Pattern f
Completeness and non-redundancy
60
1. Find Patterns Having P From P-conditional
Database
Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
Conditional pattern bases
item
cond. pattern base
b:1
c
f:3
p:1
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
61
1. From Conditional Pattern-bases to Conditional
FP-trees
For each pattern-base
Accumulate the count for each item in the base
Construct the FP-tree for the frequent items of the
pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,
f:3 fm, cm, am,
fcm, fam, cam,
c:3
fcam
a:3
m-conditional FP-tree
62
1. Benefits of the FP-tree Structure
Completeness
Preserve
complete information for frequent pattern
mining
Never
break a long pattern of any transaction
Compactness
Reduce
irrelevant info—infrequent items are gone
Items
in frequency descending order: the more
frequently occurring, the more likely to be shared
Never
be larger than the original database (not count
node-links and the count field)
63
1. Performance of FP Growth in Large
Datasets
100
140
120
D1 Apriori runtime
80
Runtime (sec.)
70
Run time(sec.)
D2 FP-growth
D1 FP-grow th runtime
90
60
Data set T25I20D10K
50
40
30
20
D2 TreeProjection
100
80
Data set T25I20D100K
60
40
20
10
0
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
FP-Growth vs. Apriori
3
0
0.5
1
1.5
2
Support threshold (%)
FP-Growth vs. Tree-Projection
64
1. ECLAT: Mining by Exploring Vertical Data Format
Vertical format: t(AB) = {T11, T25, …}
tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections
t(X) = t(Y): X and Y always happen together
t(X) t(Y): transaction having X always has Y
Using diffset to accelerate mining
Only keep track of differences of tids
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
Diffset (XY, X) = {T2}
Eclat (Zaki et al. @KDD’97)
Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
65
1. Interestingness Measure: Correlations (Lift)
play basketball eat cereal [40%, 66.7%] is misleading
The overall % of students eating cereal is 75% > 66.7%.
play basketball not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
Measure of dependent/correlated events: lift
P( A B)
lift
P( A) P( B)
lift ( B, C )
2000 / 5000
0.89
(3000 / 5000) * (3750 / 5000)
lift ( B, C )
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
1000 / 5000
1.33
(3000 / 5000) * (1250 / 5000)
66
2. Classification: Basic Concepts
Classification: Basic Concepts
Decision Tree Induction
Bayes Classification Methods
Rule-Based Classification
Model Evaluation and Selection
Techniques to Improve Classification Accuracy: Ensemble
Methods
67
2. Supervised vs. Unsupervised Learning
Supervised learning (classification)
Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations
New data is classified based on the training set
Unsupervised learning (clustering)
The class labels of training data is unknown
Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
68
2. Prediction Problems: Classification vs.
Numeric Prediction
Classification
predicts categorical class labels (discrete or nominal)
classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Numeric Prediction
Models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
Credit/loan approval
Medical diagnosis: if a tumor is cancerous or benign
Fraud detection: if a transaction is fraudulent
Web page categorization: which category it is
69
2. Classification—A Two-Step Process
Model construction: describing a set of predetermined classes
Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
The set of tuples used for model construction is training set
The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
Estimate accuracy of the model
The known label of test sample is compared with the
classified result from the model
Accuracy rate is the percentage of test set samples that are
correctly classified by the model
Test set is independent of training set (otherwise overfitting)
If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
70
2. Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank =‘professor’
OR years > 6
THEN tenured = ‘yes’
71
2. Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
72
2. Decision Tree Induction: An Example
Training data set: Buys_computer
The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
Resulting tree:
age?
<=30
31..40
overca
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
excellent
fair
yes
73
2. Attribute Selection Measure:
Information Gain (ID3/C4.5)
Select the attribute with the highest information gain
Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|
Expected information (entropy) needed to classify
a tuple in D:
m
Info( D ) pi log 2 ( pi )
i 1
Information needed (after using A to split D into v partitions) to
v | D |
classify D:
j
InfoA ( D)
j 1
|D|
Info( D j )
Information gained by branching on attribute A
Gain(A) Info(D) InfoA(D)
74
2. Attribute Selection: Information Gain
Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D) I (9,5)
age
<=30
31…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Infoage ( D)
9
9
5
5
log 2 ( ) log 2 ( ) 0.940
14
14 14
14
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no
5
4
I (2,3)
I (4,0)
14
14
5
I (3,2) 0.694
14
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age) Info( D) Infoage ( D) 0.246
Similarly,
Gain(income) 0.029
Gain( student ) 0.151
Gain(credit _ rating ) 0.048
75
2. Presentation of Classification Results
76
2. Visualization of a Decision Tree in
SGI/MineSet 3.0
77
3. What is Cluster Analysis?
Cluster: A collection of data objects
similar (or related) to one another within the same group
dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
As a stand-alone tool to get insight into data distribution
As a preprocessing step for other algorithms
78
3. Quality: What Is Good Clustering?
A good clustering method will produce high quality clusters
high intra-class similarity: cohesive within clusters
low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on
the similarity measure used by the method
its implementation, and
Its ability to discover some or all of the hidden patterns
79
3. Bayesian Classification: Why?
A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities
Foundation: Based on Bayes’ Theorem.
Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers
Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data
Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
80
2. Bayesian Theorem: Basics
Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (posteriori probability),
the probability that the hypothesis holds given the observed
data sample X
P(H) (prior probability), the initial probability
E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelyhood), the probability of observing the sample
X, given that the hypothesis holds
E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
81
2. Bayesian Theorem
Given training data X, posteriori probability of a hypothesis
H, P(H|X), follows the Bayes theorem
P(H | X) P(X | H )P(H ) P(X | H ) P(H ) / P(X)
P(X)
Informally, this can be written as
posteriori = likelihood x prior/evidence
Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes
Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
82
2. Using IF-THEN Rules for Classification
Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
ncovers = # of tuples covered by R
ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule are triggered, need conflict resolution
Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
Class-based ordering: decreasing order of prevalence or
misclassification cost per class
Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
83
2. Rule Extraction from a Decision Tree28
age?
Rules are easier to understand than large trees
One rule is created for each path from the root
<=30
31..40
to a leaf
student?
yes
Each attribute-value pair along a path forms
no
yes
a conjunction: the leaf holds the class
prediction
no
yes
Rules are mutually exclusive and exhaustive
Example: Rule extraction from our buys_computer decision-tree
>40
credit rating?
excellent
fair
yes
IF age = young AND student = no
THEN buys_computer = no
IF age = young AND student = yes
THEN buys_computer = yes
IF age = mid-age
THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair
THEN buys_computer = yes
84
2. Model Evaluation and Selection
Evaluation metrics: How can we measure accuracy? Other
metrics to consider?
Use test set of class-labeled tuples instead of training set
when assessing accuracy
Methods for estimating a classifier’s accuracy:
Holdout method, random subsampling
Cross-validation
Bootstrap
Comparing classifiers:
Confidence intervals
Cost-benefit analysis and ROC Curves
85
3. Clustering for Data Understanding and Applications
Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market research
86
3. Clustering as a Preprocessing Tool (Utility)
Summarization:
Compression:
Image processing: vector quantization
Finding K-nearest Neighbors
Preprocessing for regression, PCA, classification, and
association analysis
Localizing search to one or a small number of clusters
Outlier detection
Outliers are often viewed as those “far away” from any
cluster
87
3. Measure the Quality of Clustering
Dissimilarity/Similarity metric
Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
There is usually a separate “quality” function that measures
the “goodness” of a cluster.
It is hard to define “similar enough” or “good enough”
The answer is typically highly subjective
88
4. What Are Outliers?
Outlier: A data object that deviates significantly from the normal objects
as if it were generated by a different mechanism
Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
Outliers are different from the noise data
Noise is random error or variance in a measured variable
Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the
normal data
Outlier detection vs. novelty detection: early stage, outlier; but later merged
into the model
Applications:
Credit card fraud detection
Telecom fraud detection
Customer segmentation
Medical analysis
89
4. Types of Outliers (I)
Three kinds: global, contextual and collective outliers
Global outlier (or point anomaly)
Global Outlier
Object is Og if it significantly deviates from the rest of the data set
Ex. Intrusion detection in computer networks
Issue: Find an appropriate measurement of deviation
Contextual outlier (or conditional outlier)
Object is Oc if it deviates significantly based on a selected context
Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
Attributes of data objects should be divided into two groups
Contextual attributes: defines the context, e.g., time & location
Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
Issue: How to define or formulate meaningful context?
90
4. Types of Outliers (II)
Collective Outliers
A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers
Applications: E.g., intrusion detection:
Collective Outlier
When a number of computers keep sending denialof-service packages to each other
Detection of collective outliers:
Consider not only behavior of individual objects, but also that of
groups of objects
Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure on
objects.
A data set may have multiple types of outlier
object may belong to more than one type of outlier
91
4. Challenges of Outlier Detection
Modeling normal objects and outliers properly
Hard to enumerate all possible normal behaviors in an application
The border between normal and outlier objects is often a gray area
Application-specific outlier detection
Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
Handling noise in outlier detection
Noise may distort the normal objects and blur the distinction between
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
Understandability
Understand why these are outliers: Justification of the detection
Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
92
END
93