Lecture2+3

Transcript Lecture2+3

Data Mining and Big
Data
Ahmed K. Ezzat,
Data Mining Concepts and
Techniques
1
Outline

Data Pre-processing

Data Mining Under the Hood
2
• Data Preprocessing Overview
• Data Cleaning
• Data Integration
• Data Reduction
• Data Transformation and Data Discretization
Data Preprocessing
3
1. Why Preprocess the Data: Data Quality?

Measures for data quality: A multidimensional view

Accuracy: correct or wrong, accurate or not

Completeness: not recorded, unavailable, …

Consistency: some modified but some not, dangling, …

Timeliness: timely update?

Believability: how trustable the data are correct?

Interpretability: how easily the data can be understood?
4
1. Major Tasks in Data Preprocessing

Data cleaning


Data integration



Fill in missing values, smooth noisy data, identify or remove
outliers, and resolve inconsistencies
Integration of multiple databases, data cubes, or files
Data reduction

Dimensionality reduction

Numerosity reduction

Data compression
Data transformation and data discretization

Normalization

Concept hierarchy generation
5
2. Data Cleaning

Data in the Real World Is Dirty: Lots of potentially incorrect data,
e.g., instrument faulty, human or computer error, transmission error

incomplete: lacking attribute values, lacking certain attributes of
interest, or containing only aggregate data


noisy: containing noise, errors, or outliers



e.g., Occupation=“ ” (missing data)
e.g., Salary=“−10” (an error)
inconsistent: containing discrepancies in codes or names, e.g.,

Age=“42”, Birthday=“03/07/2010”

Was rating “1, 2, 3”, now rating “A, B, C”

discrepancy between duplicate records
Intentional (e.g., disguised missing data)

Jan. 1 as everyone’s birthday?
6
2. Incomplete (Missing) Data

Data is not always available
 E.g.,
many tuples have no recorded value for several
attributes, such as customer income in sales data

Missing data may be due to
 equipment
malfunction
 inconsistent
 data
with other recorded data and thus deleted
not entered due to misunderstanding
 certain
data may not be considered important at the
time of entry
 not

register history or changes of the data
Missing data may need to be inferred
7
2. How to Handle Missing Data?

Ignore the tuple: usually done when class label is missing (when
doing classification)—not effective when the % of missing values
per attribute varies considerably

Fill in the missing value manually: tedious + infeasible?

Fill in it automatically with:

a global constant : e.g., “unknown”, a new class?!

the attribute mean

the attribute mean for all samples belonging to the same
class: smarter

the most probable value: inference-based such as Bayesian
formula or decision tree
8
2. Noisy Data



Noise: random error or variance in a measured variable
Incorrect attribute values may be due to
 faulty data collection instruments
 data entry problems
 data transmission problems
 technology limitation
 inconsistency in naming convention
Other data problems which require data cleaning
 duplicate records
 incomplete data
 inconsistent data
9
2. How to Handle Noisy Data?




Binning
 first sort data and partition into (equal-frequency) bins
 then one can smooth by bin means, smooth by bin
median, smooth by bin boundaries, etc.
Regression
 smooth by fitting the data into regression functions
Clustering
 detect and remove outliers
Combined computer and human inspection
 detect suspicious values and check by human (e.g.,
deal with possible outliers)
10
2. Data Cleaning as a Process



Data discrepancy detection
 Use metadata (e.g., domain, range, dependency, distribution)
 Check field overloading
 Check uniqueness rule, consecutive rule and null rule
 Use commercial tools
 Data scrubbing: use simple domain knowledge (e.g., postal
code, spell-check) to detect errors and make corrections
 Data auditing: by analyzing data to discover rules and
relationship to detect violators (e.g., correlation and clustering to
find outliers)
Data migration and integration
 Data migration tools: allow transformations to be specified
 ETL (Extraction/Transformation/Loading) tools: allow users to
specify transformations through a graphical user interface
Integration of the two processes
 Iterative and interactive (e.g., Potter’s Wheels)
11
3. Data Integration

Data integration:


Schema integration: e.g., A.cust-id  B.cust-#


Combines data from multiple sources into a coherent store
Integrate metadata from different sources
Entity identification problem:

Identify real world entities from multiple data sources, e.g., Bill Clinton
= William Clinton

Detecting and resolving data value conflicts

For the same real world entity, attribute values from different sources
are different

Possible reasons: different representations, different scales, e.g.,
metric vs. British units
12
3. Handling Redundancy in Data Integration

Redundant data occur often when integration of multiple
databases

Object identification: The same attribute or object may
have different names in different databases

Derivable data: One attribute may be a “derived” attribute in
another table, e.g., annual revenue

Redundant attributes may be able to be detected by
correlation analysis and covariance analysis

Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
13
4. Data Reduction Strategies



Data reduction: Obtain a reduced representation of the data set that is
much smaller in volume but yet produces the same (or almost the
same) analytical results
Why data reduction? — A database/data warehouse may store
terabytes of data. Complex data analysis may take a very long time to
run on the complete data set.
Data reduction strategies:
 Dimensionality reduction, e.g., remove unimportant attributes
 Wavelet transforms
 Principal Components Analysis (PCA)
 Feature subset selection, feature creation
 Numerosity reduction (some simply call it: Data Reduction)
 Regression and Log-Linear Models
 Histograms, clustering, sampling
 Data cube aggregation
 Data compression
14
4. Data Reduction 1: Dimensionality Reduction



Curse of dimensionality

When dimensionality increases, data becomes increasingly sparse

Density and distance between points, which is critical to clustering, outlier
analysis, becomes less meaningful

The possible combinations of subspaces will grow exponentially
Dimensionality reduction

Avoid the curse of dimensionality

Help eliminate irrelevant features and reduce noise

Reduce time and space required in data mining

Allow easier visualization
Dimensionality reduction techniques

Wavelet transforms

Principal Component Analysis

Supervised and nonlinear techniques (e.g., feature selection)
15
4. Mapping Data to a New Space


Fourier transform
Wavelet transform
Two Sine Waves
Two Sine Waves + Noise
Frequency
16
4. What Is Wavelet Transform?

Decomposes a signal into
different frequency subbands
 Applicable
to ndimensional signals

Data are transformed to
preserve relative distance
between objects at different
levels of resolution

Allow natural clusters to
become more distinguishable

Used for image compression
17
4. Wavelet Transformation

Discrete wavelet transform (DWT) for
linear signal processing, multi-resolution analysis
Haar2

Compressed approximation: store only a small fraction of the
strongest of the wavelet coefficients

Similar to discrete Fourier transform (DFT), but better lossy
compression, localized in space

Method:
Daubechie4

Length, L, must be an integer power of 2 (padding with 0’s, when
necessary)

Each transform has 2 functions: smoothing, difference

Applies to pairs of data, resulting in two set of data of length L/2

Applies two functions recursively, until reaches the desired length
18
4. Principal Component Analysis (PCA)

Find a projection that captures the largest amount of variation in data

The original data are projected onto a much smaller space, resulting
in dimensionality reduction. We find the eigenvectors of the
covariance matrix, and these eigenvectors define the new space
x2
e
x1
19
4. Data Reduction 2: Numerosity Reduction



Reduce data volume by choosing alternative, smaller
forms of data representation
Parametric methods (e.g., regression)
 Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)
 Ex.: Log-linear models—obtain value at a point in m-D
space as the product on appropriate marginal subspaces
Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling, …
20
4. Parametric Data Reduction:
Regression and Log-Linear Models



Linear regression
 Data modeled to fit a straight line
 Often uses the least-square method to fit the line
Multiple regression
 Allows a response variable Y to be modeled as a
linear function of multidimensional feature vector
Log-linear model
 Approximates discrete multidimensional probability
distributions
21
y
4. Regression Analysis
Y1

Regression analysis: A collective name
for techniques for the modeling and
Y1’
analysis of numerical data consisting of
y=x+1
values of a dependent variable (also
called response variable or
measurement) and of one or more
X1
x
independent variables (aka. explanatory
variables or predictors)

The parameters are estimated so as to
give a "best fit" of the data

Most commonly the best fit is evaluated by
using the least squares method, but other

Used for prediction
(including forecasting of
time-series data),
inference, hypothesis
testing, and modeling of
causal relationships
criteria have also been used
22
4. Regress Analysis and Log-Linear Models


Linear regression: Y = w X + b

Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand

Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….
Multiple regression: Y = b0 + b1 X1 + b2 X2


Many nonlinear functions can be transformed into the above
Log-linear models:

Approximate discrete multidimensional probability distributions

Estimate the probability of each point (tuple) in a multi-dimensional
space for a set of discretized attributes, based on a smaller subset of
dimensional combinations

Useful for dimensionality reduction and data smoothing
23
4. Histogram Analysis


Divide data into buckets
and store average (sum)
for each bucket
Partitioning rules:


40
35
30
25
Equal-width: equal
bucket range
20
Equal-frequency
(or equal-depth)
10
15
5
0
10000
30000
50000
70000
90000
24
4. Clustering

Partition data set into clusters based on similarity, and store
cluster representation (e.g., centroid and diameter) only

Can be very effective if data is clustered but not if data is
“smeared”

Can have hierarchical clustering and be stored in multidimensional index tree structures

There are many choices of clustering definitions and
clustering algorithms

Cluster analysis will be studied in depth in Chapter 10
25
4. Sampling

Sampling: obtaining a small sample s to represent the whole
data set N

Allow a mining algorithm to run in complexity that is potentially
sub-linear to the size of the data

Key principle: Choose a representative subset of the data


Simple random sampling may have very poor performance
in the presence of skew

Develop adaptive sampling methods, e.g., stratified
sampling:
Note: Sampling may not reduce database I/Os (page at a time)
26
4. Types of Sampling




Simple random sampling
 There is an equal probability of selecting any particular
item
Sampling without replacement
 Once an object is selected, it is removed from the
population
Sampling with replacement
 A selected object is not removed from the population
Stratified sampling
 Partition the data set, and draw samples from each
partition (proportionally, i.e., approximately the same
percentage of the data)
 Used in conjunction with skewed data
27
4. Sampling: With or without Replacement
Raw Data
28
4. Sampling: Cluster or Stratified Sampling
Raw Data
Cluster/Stratified Sample
29
4. Data Cube Aggregation


The lowest level of a data cube (base cuboid)

The aggregated data for an individual entity of interest

E.g., a customer in a phone calling data warehouse
Multiple levels of aggregation in data cubes


Reference appropriate levels


Further reduce the size of data to deal with
Use the smallest representation which is enough to solve
the task
Queries regarding aggregated information should be
answered using data cube, when possible
30
4. Data Reduction 3: Data Compression




String compression
 There are extensive theories and well-tuned algorithms
 Typically lossless, but only limited manipulation is possible
without expansion
Audio/video compression
 Typically lossy compression, with progressive refinement
 Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
Time sequence is not audio
 Typically short and vary slowly with time
Dimensionality and numerosity reduction may also be
considered as forms of data compression
31
4. Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
32
5. Data Transformation

A function that maps the entire set of values of a given attribute to a
new set of replacement values e.g., each old value can be identified
with one of the new values

Methods

Smoothing: Remove noise from data

Attribute/feature construction

New attributes constructed from the given ones

Aggregation: Summarization, data cube construction

Normalization: Scaled to fall within a smaller, specified range


min-max normalization

z-score normalization

normalization by decimal scaling
Discretization: Concept hierarchy climbing
33
5. Normalization

Min-max normalization: to [new_minA, new_maxA]
v' 


v  minA
(new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0,
73,600  12,000
(1.0  0)  0  0.716
1.0]. Then $73,000 is mapped to
98,000  12,000
Z-score normalization (μ: mean, σ: standard deviation):
v' 


v  A

A
Ex. Let μ = 54,000, σ = 16,000. Then
73,600  54,000
 1.225
16,000
Normalization by decimal scaling
v
v'  j
10
Where j is the smallest integer such that Max(|ν’|) < 1
34
Discretization


Three types of attributes

Nominal—values from an unordered set, e.g., color, profession

Ordinal—values from an ordered set, e.g., military or academic
rank

Numeric—real numbers, e.g., integer or real numbers
Discretization: Divide the range of a continuous attribute into intervals

Interval labels can then be used to replace actual data values

Reduce data size by discretization

Supervised vs. unsupervised

Split (top-down) vs. merge (bottom-up)

Discretization can be performed recursively on an attribute

Prepare for further analysis, e.g., classification
35
5. Data Discretization Methods

Typical methods: All the methods can be applied recursively

Binning


Top-down split, unsupervised
Histogram analysis

Top-down split, unsupervised

Clustering analysis (unsupervised, top-down split or bottom-up
merge)

Decision-tree analysis (supervised, top-down split)

Correlation (e.g., 2) analysis (unsupervised, bottom-up
merge)
36
5. Simple Discretization: Binning

Equal-width (distance) partitioning

Divides the range into N intervals of equal size: uniform grid

if A and B are the lowest and highest values of the attribute, the
width of intervals will be: W = (B –A)/N.


The most straightforward, but outliers may dominate presentation

Skewed data is not handled well
Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately
same number of samples

Good data scaling

Managing categorical attributes can be tricky
37
5. Binning Methods for Data Smoothing
Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28,
29, 34
* Partition into equal-frequency (equi-depth) bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34

38
5. Discretization Without Using Class Labels
(Binning vs. Clustering)
Data
Equal frequency (binning)
Equal interval width (binning)
K-means clustering leads to better results
39
5. Discretization by Classification &
Correlation Analysis


Classification (e.g., decision tree analysis)

Supervised: Given class labels, e.g., cancerous vs. benign

Using entropy to determine split point (discretization point)

Top-down, recursive split

Details are covered in Chapter 7
Correlation analysis (e.g., Chi-merge: χ2-based discretization)

Supervised: use class information

Bottom-up merge: find the best neighboring intervals (those having
similar distributions of classes, i.e., low χ2 values) to merge

Merge performed recursively, until a predefined stopping condition
40
5. Correlation Analysis (Nominal Data)

Χ2 (chi-square) test
2
(
Observed

Expected
)
2  
Expected

The larger the Χ2 value, the more likely the variables are
related

The cells that contribute the most to the Χ2 value are
those whose actual count is very different from the
expected count

Correlation does not imply causality

# of hospitals and # of car-theft in a city are correlated

Both are causally linked to the third variable: population
41
5. Chi-Square Calculation: An Example

Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution
in the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 



 507.93
90
210
360
840
2

It shows that like_science_fiction and play_chess are
correlated in the group
42
5. Correlation Analysis (Numeric Data)

Correlation coefficient (also called Pearson’s product
moment coefficient)
i1 (ai  A)(bi  B)
n
rA, B 
(n  1) A B


n
i 1
(ai bi )  n AB
(n  1) A B
where n is the number of tuples, A and B are the respective means
of A and B, σA and σB are the respective standard deviation of A
and B, and Σ(aibi) is the sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (A’s values
increase as B’s). The higher, the stronger correlation.

rA,B = 0: independent; rAB < 0: negatively correlated
43
5. Concept Hierarchy Generation

Concept hierarchy organizes concepts (i.e., attribute values)
hierarchically and is usually associated with each dimension in a data
warehouse

Concept hierarchies facilitate drilling and rolling in data warehouses to
view data in multiple granularity

Concept hierarchy formation: Recursively reduce the data by collecting
and replacing low level concepts (such as numeric values for age) by
higher level concepts (such as youth, adult, or senior)

Concept hierarchies can be explicitly specified by domain experts
and/or data warehouse designers

Concept hierarchy can be automatically formed for both numeric and
nominal data. For numeric data, use discretization methods shown.
44
Summary





Data quality: accuracy, completeness, consistency, timeliness,
believability, interpretability
Data cleaning: e.g. missing/noisy values, outliers
Data integration from multiple sources:
 Entity identification problem
 Remove redundancies
 Detect inconsistencies
Data reduction
 Dimensionality reduction
 Numerosity reduction
 Data compression
Data transformation and data discretization
 Normalization
 Concept hierarchy generation
45
• Mining Frequent Patterns
• Classification Overview
• Cluster Analysis Overview
• Outlier Detection
Data Mining Under The Hood
46
1. What Is Frequent Pattern Analysis?

Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set

First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context of
frequent itemsets and association rule mining


Motivation: Finding inherent regularities in data

What products were often purchased together?— Beer and diapers?!

What are the subsequent purchases after buying a PC?

What kinds of DNA are sensitive to this new drug?

Can we automatically classify web documents?
Applications

Basket data analysis, cross-marketing, catalog design, sale campaign
analysis, Web log (click stream) analysis, and DNA sequence analysis.
47
1. Why Is Freq. Pattern Mining Important?


Frequent pattern: An intrinsic and important property of datasets
Foundation for many essential data mining tasks
 Association, correlation, and causality analysis
 Sequential, structural (e.g., sub-graph) patterns
 Pattern analysis in spatiotemporal, multimedia, time-series,
and stream data
 Classification: discriminative, frequent pattern analysis
 Cluster analysis: frequent pattern-based clustering
 Data warehousing: iceberg cube and cube-gradient
 Semantic data compression: fascicles
 Broad applications
48
1. Basic Concepts: Frequent Patterns
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper

30
Beer, Diaper, Eggs

40
Nuts, Eggs, Milk
50
Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys diaper



Customer
buys beer
itemset: A set of one or more
items
k-itemset X = {x1, …, xk}
(absolute) support, or, support
count of X: Frequency or
occurrence of an itemset X
(relative) support, s, is the
fraction of transactions that
contains X (i.e., the probability
that a transaction contains X)
An itemset X is frequent if X’s
support is no less than a minsup
threshold
49
1. Basic Concepts: Association Rules
Tid
Items bought
10
Beer, Nuts, Diaper
20
Beer, Coffee, Diaper
30
Beer, Diaper, Eggs
40
50
Nuts, Eggs, Milk

Nuts, Coffee, Diaper, Eggs, Milk
Customer
buys both
Customer
buys beer
Customer
buys
diaper
Find all the rules X  Y with
minimum support and confidence
 support, s, probability that a
transaction contains X  Y
 confidence, c, conditional
probability that a transaction
having X also contains Y
Let minsup = 50%, minconf = 50%
Freq. Pat.: Beer:3, Nuts:3, Diaper:4, Eggs:3,
{Beer, Diaper}:3

Association rules: (many more!)

Beer  Diaper (60%, 100%)

Diaper  Beer (60%, 75%)
50
1. Closed Patterns and Max-Patterns

A long pattern contains a combinatorial number of subpatterns, e.g., {a1, …, a100} contains (1001) + (1002) + … + (110000)
= 2100 – 1 = 1.27*1030 sub-patterns!

Solution: Mine closed patterns and max-patterns instead

An itemset X is closed if X is frequent and there exists no
super-pattern Y ‫ כ‬X, with the same support as X (proposed by
Pasquier, et al. @ ICDT’99)

An itemset X is a max-pattern if X is frequent and there exists
no frequent super-pattern Y ‫ כ‬X (proposed by Bayardo @
SIGMOD’98)

Closed pattern is a lossless compression of freq. patterns

Reducing the # of patterns and rules
51
1. Closed Patterns and Max-Patterns

Exercise. DB = {<a1, …, a100>, < a1, …, a50>}



What is the set of closed itemset?

<a1, …, a100>: Min_sup = 1

< a1, …, a50>: Min_sup = 2
What is the set of max-pattern?


Min_sup = 1
<a1, …, a100>: 1
What is the set of all patterns?

!!
52
1. Scalable Frequent Itemset Mining Methods


Apriori: A Candidate Generation-and-Test Approach

Apriori (Agrawal & Srikant@VLDB’94)

Improving the Efficiency of Apriori
FPGrowth: A Frequent Pattern-Growth Approach


Freq. pattern growth (FPgrowth—Han, Pei & Yin @SIGMOD’00)
ECLAT: Frequent Pattern Mining with Vertical Data Format

Vertical data format approach (Charm—Zaki & Hsiao @SDM’02)
53
1. Apriori: A Candidate Generation & Test
Approach

Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)

Method:

Initially, scan DB once to get frequent 1-itemset

Generate length (k+1) candidate itemsets from length k
frequent itemsets

Test the candidates against DB

Terminate when no frequent or candidate set can be
generated
54
1. The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C, E
40
B, E
Supmin = 2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
sup
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
C1
1st scan
C2
L2
Itemset
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
55
1. Further Improvement of the Apriori Method

Major computational challenges
 Multiple
 Huge
number of candidates
 Tedious

scans of transaction database
workload of support counting for candidates
Improving Apriori: general ideas
 Reduce
 Shrink
passes of transaction database scans
number of candidates
 Facilitate
support counting of candidates
56
1. Sampling for Frequent Patterns

Select a sample of original database, mine frequent patterns
within sample using Apriori

Scan database once to verify frequent itemsets found in
sample, only borders of closure of frequent patterns are
checked

Example: check abcd instead of ab, ac, …, etc.

Scan database again to find missed frequent patterns

H. Toivonen. Sampling large databases for association rules.
In VLDB’96
57
1. Frequent Pattern-Growth Approach: Mining
Frequent Patterns Without Candidate Generation

Bottlenecks of the Apriori approach

Breadth-first (i.e., level-wise) search

Candidate generation and test



Often generates a huge number of candidates
The FPGrowth Approach (J. Han, J. Pei, and Y. Yin, SIGMOD’ 00)

Depth-first search

Avoid explicit candidate generation
Major philosophy: Grow long patterns from short ones using local frequent
items only

“abc” is a frequent pattern

Get all transactions having “abc”, i.e., project DB on abc: DB|abc

“d” is a local frequent item in DB|abc  abcd is a frequent pattern
58
1. Construct FP-tree from a Transaction Database
TID
100
200
300
400
500
Items bought
(ordered) frequent items
{f, a, c, d, g, i, m, p}
{f, c, a, m, p}
{a, b, c, f, l, m, o}
{f, c, a, b, m}
min_support = 3
{b, f, h, j, o, w}
{f, b}
{b, c, k, s, p}
{c, b, p}
{a, f, c, e, l, p, m, n}
{f, c, a, m, p}
{}
Header Table
1. Scan DB once, find
f:4
c:1
Item frequency head
frequent 1-itemset (single
f
4
item pattern)
c
4
c:3 b:1 b:1
2. Sort frequent items in
a
3
b
3
frequency descending
a:3
p:1
m
3
order, f-list
p
3
m:2 b:1
3. Scan DB again, construct
FP-tree
p:2 m:1
F-list = f-c-a-b-m-p
59
1. Partition Patterns and Databases


Frequent patterns can be partitioned into subsets according to
f-list
 F-list = f-c-a-b-m-p
 Patterns containing p
 Patterns having m but no p
 …
 Patterns having c but no a nor b, m, p
 Pattern f
Completeness and non-redundancy
60
1. Find Patterns Having P From P-conditional
Database



Starting at the frequent item header table in the FP-tree
Traverse the FP-tree by following the link of each frequent item p
Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
{}
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
f:4
c:3
c:1
b:1
a:3
Conditional pattern bases
item
cond. pattern base
b:1
c
f:3
p:1
a
fc:3
b
fca:1, f:1, c:1
m:2
b:1
m
fca:2, fcab:1
p:2
m:1
p
fcam:2, cb:1
61
1. From Conditional Pattern-bases to Conditional
FP-trees

For each pattern-base
 Accumulate the count for each item in the base
 Construct the FP-tree for the frequent items of the
pattern base
Header Table
Item frequency head
f
4
c
4
a
3
b
3
m
3
p
3
{}
f:4
c:3
c:1
b:1
a:3
b:1
p:1
m:2
b:1
p:2
m:1
m-conditional pattern base:
fca:2, fcab:1
All frequent
patterns relate to m
{}
m,

f:3  fm, cm, am,
fcm, fam, cam,
c:3
fcam
a:3
m-conditional FP-tree
62
1. Benefits of the FP-tree Structure

Completeness
 Preserve
complete information for frequent pattern
mining
 Never

break a long pattern of any transaction
Compactness
 Reduce
irrelevant info—infrequent items are gone
 Items
in frequency descending order: the more
frequently occurring, the more likely to be shared
 Never
be larger than the original database (not count
node-links and the count field)
63
1. Performance of FP Growth in Large
Datasets
100
140
120
D1 Apriori runtime
80
Runtime (sec.)
70
Run time(sec.)
D2 FP-growth
D1 FP-grow th runtime
90
60
Data set T25I20D10K
50
40
30
20
D2 TreeProjection
100
80
Data set T25I20D100K
60
40
20
10
0
0
0
0.5
1
1.5
2
Support threshold(%)
2.5
FP-Growth vs. Apriori
3
0
0.5
1
1.5
2
Support threshold (%)
FP-Growth vs. Tree-Projection
64
1. ECLAT: Mining by Exploring Vertical Data Format

Vertical format: t(AB) = {T11, T25, …}



tid-list: list of trans.-ids containing an itemset
Deriving frequent patterns based on vertical intersections

t(X) = t(Y): X and Y always happen together

t(X)  t(Y): transaction having X always has Y
Using diffset to accelerate mining

Only keep track of differences of tids

t(X) = {T1, T2, T3}, t(XY) = {T1, T3}

Diffset (XY, X) = {T2}

Eclat (Zaki et al. @KDD’97)

Mining Closed patterns using vertical format: CHARM (Zaki &
Hsiao@SDM’02)
65
1. Interestingness Measure: Correlations (Lift)

play basketball  eat cereal [40%, 66.7%] is misleading


The overall % of students eating cereal is 75% > 66.7%.
play basketball  not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence

Measure of dependent/correlated events: lift
P( A B)
lift 
P( A) P( B)
lift ( B, C ) 
2000 / 5000
 0.89
(3000 / 5000) * (3750 / 5000)
lift ( B, C ) 
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
1000 / 5000
 1.33
(3000 / 5000) * (1250 / 5000)
66
2. Classification: Basic Concepts

Classification: Basic Concepts

Decision Tree Induction

Bayes Classification Methods

Rule-Based Classification

Model Evaluation and Selection

Techniques to Improve Classification Accuracy: Ensemble
Methods
67
2. Supervised vs. Unsupervised Learning

Supervised learning (classification)

Supervision: The training data (observations,
measurements, etc.) are accompanied by labels indicating
the class of the observations


New data is classified based on the training set
Unsupervised learning (clustering)

The class labels of training data is unknown

Given a set of measurements, observations, etc. with the
aim of establishing the existence of classes or clusters in
the data
68
2. Prediction Problems: Classification vs.
Numeric Prediction



Classification
 predicts categorical class labels (discrete or nominal)
 classifies data (constructs a model) based on the training set and
the values (class labels) in a classifying attribute and uses it in
classifying new data
Numeric Prediction
 Models continuous-valued functions, i.e., predicts unknown or
missing values
Typical applications
 Credit/loan approval
 Medical diagnosis: if a tumor is cancerous or benign
 Fraud detection: if a transaction is fraudulent
 Web page categorization: which category it is
69
2. Classification—A Two-Step Process


Model construction: describing a set of predetermined classes
 Each tuple/sample is assumed to belong to a predefined class,
as determined by the class label attribute
 The set of tuples used for model construction is training set
 The model is represented as classification rules, decision trees,
or mathematical formulae
Model usage: for classifying future or unknown objects
 Estimate accuracy of the model
 The known label of test sample is compared with the
classified result from the model
 Accuracy rate is the percentage of test set samples that are
correctly classified by the model
 Test set is independent of training set (otherwise overfitting)
 If the accuracy is acceptable, use the model to classify data
tuples whose class labels are not known
70
2. Process (1): Model Construction
Training
Data
NAME
M ike
M ary
B ill
Jim
D ave
A nne
RANK
YEARS TENURED
A ssistant P rof
3
no
A ssistant P rof
7
yes
P rofessor
2
yes
A ssociate P rof
7
yes
A ssistant P rof
6
no
A ssociate P rof
3
no
Classification
Algorithms
Classifier
(Model)
IF rank =‘professor’
OR years > 6
THEN tenured = ‘yes’
71
2. Process (2): Using the Model in Prediction
Classifier
Testing
Data
Unseen Data
(Jeff, Professor, 4)
NAME
T om
M erlisa
G eorge
Joseph
RANK
YEARS TENURED
A ssistant P rof
2
no
A ssociate P rof
7
no
P rofessor
5
yes
A ssistant P rof
7
yes
Tenured?
72
2. Decision Tree Induction: An Example
Training data set: Buys_computer
 The data set follows an example of
Quinlan’s ID3 (Playing Tennis)
 Resulting tree:

age?
<=30
31..40
overca
student?
no
no
yes
yes
yes
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
income student credit_rating buys_computer
high
no fair
no
high
no excellent
no
high
no fair
yes
medium
no fair
yes
low
yes fair
yes
low
yes excellent
no
low
yes excellent
yes
medium
no fair
no
low
yes fair
yes
medium yes fair
yes
medium yes excellent
yes
medium
no excellent
yes
high
yes fair
yes
medium
no excellent
no
credit rating?
excellent
fair
yes
73
2. Attribute Selection Measure:
Information Gain (ID3/C4.5)

Select the attribute with the highest information gain

Let pi be the probability that an arbitrary tuple in D belongs to
class Ci, estimated by |Ci, D|/|D|

Expected information (entropy) needed to classify
a tuple in D:
m
Info( D )   pi log 2 ( pi )
i 1

Information needed (after using A to split D into v partitions) to
v | D |
classify D:
j
InfoA ( D)  
j 1

|D|
 Info( D j )
Information gained by branching on attribute A
Gain(A)  Info(D)  InfoA(D)
74
2. Attribute Selection: Information Gain


Class P: buys_computer = “yes”
Class N: buys_computer = “no”
Info( D)  I (9,5)  
age
<=30
31…40
>40
age
<=30
<=30
31…40
>40
>40
>40
31…40
<=30
<=30
>40
<=30
31…40
31…40
>40
Infoage ( D) 
9
9
5
5
log 2 ( )  log 2 ( ) 0.940
14
14 14
14
pi
2
4
3
ni I(pi, ni)
3 0.971
0 0
2 0.971
income student credit_rating
high
no
fair
high
no
excellent
high
no
fair
medium
no
fair
low
yes fair
low
yes excellent
low
yes excellent
medium
no
fair
low
yes fair
medium
yes fair
medium
yes excellent
medium
no
excellent
high
yes fair
medium
no
excellent
buys_computer
no
no
yes
yes
yes
no
yes
no
yes
yes
yes
yes
yes
no

5
4
I (2,3) 
I (4,0)
14
14
5
I (3,2)  0.694
14
5
I (2,3) means “age <=30” has 5
14
out of 14 samples, with 2 yes’es
and 3 no’s. Hence
Gain(age)  Info( D)  Infoage ( D)  0.246
Similarly,
Gain(income)  0.029
Gain( student )  0.151
Gain(credit _ rating )  0.048
75
2. Presentation of Classification Results
76
2. Visualization of a Decision Tree in
SGI/MineSet 3.0
77
3. What is Cluster Analysis?




Cluster: A collection of data objects
 similar (or related) to one another within the same group
 dissimilar (or unrelated) to the objects in other groups
Cluster analysis (or clustering, data segmentation, …)
 Finding similarities between data according to the
characteristics found in the data and grouping similar data
objects into clusters
Unsupervised learning: no predefined classes (i.e., learning by
observations vs. learning by examples: supervised)
Typical applications
 As a stand-alone tool to get insight into data distribution
 As a preprocessing step for other algorithms
78
3. Quality: What Is Good Clustering?


A good clustering method will produce high quality clusters

high intra-class similarity: cohesive within clusters

low inter-class similarity: distinctive between clusters
The quality of a clustering method depends on

the similarity measure used by the method

its implementation, and

Its ability to discover some or all of the hidden patterns
79
3. Bayesian Classification: Why?

A statistical classifier: performs probabilistic prediction, i.e.,
predicts class membership probabilities

Foundation: Based on Bayes’ Theorem.

Performance: A simple Bayesian classifier, naïve Bayesian
classifier, has comparable performance with decision tree and
selected neural network classifiers

Incremental: Each training example can incrementally
increase/decrease the probability that a hypothesis is correct —
prior knowledge can be combined with observed data

Standard: Even when Bayesian methods are computationally
intractable, they can provide a standard of optimal decision
making against which other methods can be measured
80
2. Bayesian Theorem: Basics






Let X be a data sample (“evidence”): class label is unknown
Let H be a hypothesis that X belongs to class C
Classification is to determine P(H|X), (posteriori probability),
the probability that the hypothesis holds given the observed
data sample X
P(H) (prior probability), the initial probability
 E.g., X will buy computer, regardless of age, income, …
P(X): probability that sample data is observed
P(X|H) (likelyhood), the probability of observing the sample
X, given that the hypothesis holds
 E.g., Given that X will buy computer, the prob. that X is
31..40, medium income
81
2. Bayesian Theorem

Given training data X, posteriori probability of a hypothesis
H, P(H|X), follows the Bayes theorem
P(H | X)  P(X | H )P(H )  P(X | H ) P(H ) / P(X)
P(X)

Informally, this can be written as
posteriori = likelihood x prior/evidence

Predicts X belongs to C2 iff the probability P(Ci|X) is the
highest among all the P(Ck|X) for all the k classes

Practical difficulty: require initial knowledge of many
probabilities, significant computational cost
82
2. Using IF-THEN Rules for Classification



Represent the knowledge in the form of IF-THEN rules
R: IF age = youth AND student = yes THEN buys_computer = yes
 Rule antecedent/precondition vs. rule consequent
Assessment of a rule: coverage and accuracy
 ncovers = # of tuples covered by R
 ncorrect = # of tuples correctly classified by R
coverage(R) = ncovers /|D| /* D: training data set */
accuracy(R) = ncorrect / ncovers
If more than one rule are triggered, need conflict resolution
 Size ordering: assign the highest priority to the triggering rules that has
the “toughest” requirement (i.e., with the most attribute tests)
 Class-based ordering: decreasing order of prevalence or
misclassification cost per class
 Rule-based ordering (decision list): rules are organized into one long
priority list, according to some measure of rule quality or by experts
83
2. Rule Extraction from a Decision Tree28

age?
Rules are easier to understand than large trees
One rule is created for each path from the root
<=30
31..40
to a leaf
student?
yes
Each attribute-value pair along a path forms
no
yes
a conjunction: the leaf holds the class
prediction
no
yes
Rules are mutually exclusive and exhaustive

Example: Rule extraction from our buys_computer decision-tree



>40
credit rating?
excellent
fair
yes
IF age = young AND student = no
THEN buys_computer = no
IF age = young AND student = yes
THEN buys_computer = yes
IF age = mid-age
THEN buys_computer = yes
IF age = old AND credit_rating = excellent THEN buys_computer = no
IF age = old AND credit_rating = fair
THEN buys_computer = yes
84
2. Model Evaluation and Selection

Evaluation metrics: How can we measure accuracy? Other
metrics to consider?

Use test set of class-labeled tuples instead of training set
when assessing accuracy

Methods for estimating a classifier’s accuracy:


Holdout method, random subsampling

Cross-validation

Bootstrap
Comparing classifiers:

Confidence intervals

Cost-benefit analysis and ROC Curves
85
3. Clustering for Data Understanding and Applications








Biology: taxonomy of living things: kingdom, phylum, class, order,
family, genus and species
Information retrieval: document clustering
Land use: Identification of areas of similar land use in an earth
observation database
Marketing: Help marketers discover distinct groups in their customer
bases, and then use this knowledge to develop targeted marketing
programs
City-planning: Identifying groups of houses according to their house
type, value, and geographical location
Earth-quake studies: Observed earth quake epicenters should be
clustered along continent faults
Climate: understanding earth climate, find patterns of atmospheric
and ocean
Economic Science: market research
86
3. Clustering as a Preprocessing Tool (Utility)

Summarization:


Compression:


Image processing: vector quantization
Finding K-nearest Neighbors


Preprocessing for regression, PCA, classification, and
association analysis
Localizing search to one or a small number of clusters
Outlier detection

Outliers are often viewed as those “far away” from any
cluster
87
3. Measure the Quality of Clustering


Dissimilarity/Similarity metric
 Similarity is expressed in terms of a distance function,
typically metric: d(i, j)
 The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal
ratio, and vector variables
 Weights should be associated with different variables
based on applications and data semantics
Quality of clustering:
 There is usually a separate “quality” function that measures
the “goodness” of a cluster.
 It is hard to define “similar enough” or “good enough”

The answer is typically highly subjective
88
4. What Are Outliers?





Outlier: A data object that deviates significantly from the normal objects
as if it were generated by a different mechanism
 Ex.: Unusual credit card purchase, sports: Michael Jordon, Wayne
Gretzky, ...
Outliers are different from the noise data
 Noise is random error or variance in a measured variable
 Noise should be removed before outlier detection
Outliers are interesting: It violates the mechanism that generates the
normal data
Outlier detection vs. novelty detection: early stage, outlier; but later merged
into the model
Applications:
 Credit card fraud detection
 Telecom fraud detection
 Customer segmentation
 Medical analysis
89
4. Types of Outliers (I)



Three kinds: global, contextual and collective outliers
Global outlier (or point anomaly)
Global Outlier
 Object is Og if it significantly deviates from the rest of the data set
 Ex. Intrusion detection in computer networks
 Issue: Find an appropriate measurement of deviation
Contextual outlier (or conditional outlier)
 Object is Oc if it deviates significantly based on a selected context
 Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
 Attributes of data objects should be divided into two groups
 Contextual attributes: defines the context, e.g., time & location
 Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
 Can be viewed as a generalization of local outliers—whose density
significantly deviates from its local area
 Issue: How to define or formulate meaningful context?
90
4. Types of Outliers (II)

Collective Outliers

A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers

Applications: E.g., intrusion detection:
Collective Outlier
When a number of computers keep sending denialof-service packages to each other
 Detection of collective outliers:


Consider not only behavior of individual objects, but also that of
groups of objects

Need to have the background knowledge on the relationship
among data objects, such as a distance or similarity measure on
objects.

A data set may have multiple types of outlier

object may belong to more than one type of outlier
91
4. Challenges of Outlier Detection




Modeling normal objects and outliers properly
 Hard to enumerate all possible normal behaviors in an application
 The border between normal and outlier objects is often a gray area
Application-specific outlier detection
 Choice of distance measure among objects and the model of
relationship among objects are often application-dependent
 E.g., clinic data: a small deviation could be an outlier; while in
marketing analysis, larger fluctuations
Handling noise in outlier detection
 Noise may distort the normal objects and blur the distinction between
normal objects and outliers. It may help hide outliers and reduce the
effectiveness of outlier detection
Understandability
 Understand why these are outliers: Justification of the detection
 Specify the degree of an outlier: the unlikelihood of the object being
generated by a normal mechanism
92
END
93

Lecture2+3

Transcript Lecture2+3

Directory