3 國立聯合大學資訊管理學系資料探勘課程(陳士杰)

Download Report

Transcript 3 國立聯合大學資訊管理學系資料探勘課程(陳士杰)

Course 2
資料預處理
Data Preprocessing
Data Mining
資料探勘
國立聯合大學 資訊管理學系 陳士杰老師
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Outline

Why preprocess the data? (為何要做資料的預處理)

Descriptive data summarization (資料的摘要性描述)

Data cleaning (資料清理)

Data integration and transformation (資料整合與轉換)

Data reduction (資料縮減)

Discretization and concept hierarchy generation (離散化與
概念分層的產生)

Summary
2
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Why Data Preprocessing?

Data in the real world is dirty – 只要人一多,什麼樣的怪
腳都可能會出現!!

Incomplete (不完整的):

lacking attribute values, lacking certain attributes of interest


Noisy (含噪音的):

containing errors or outliers


e.g., occupation=“ ”
e.g., Salary=“-10”
Inconsistent (不一致的):

containing discrepancies in codes or names



e.g., Age=“42” Birthday=“03/07/1997”
e.g., Was rating “1,2,3”, now rating “A, B, C”
e.g., discrepancy between duplicate records
3
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Why Is Data Dirty?

Incomplete data may come from




Noisy data (incorrect values) may come from




“Not applicable (不合用)” data value when collected
Different considerations between the time when the data was collected and
when it is analyzed.
Human/hardware/software problems
Faulty data collection instruments (如: 問卷設計不良)
Human or computer error at data entry
Errors in data transmission
Inconsistent data may come from


Different data sources
Functional dependency (功能相依性) violation (e.g., modify some linked data)
4
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Why Is Data Preprocessing Important?

No quality data, no quality mining results!

Quality decisions must be based on quality data

e.g., duplicate or missing data may cause incorrect or even misleading
statistics.


Data warehouse needs consistent integration of quality data
Data extraction, cleaning, and transformation comprises the
majority of the work of building a data warehouse
5
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Multi-Dimensional Measure of Data Quality

A well-accepted multidimensional view:

Accuracy (精確性)

Completeness (完整性)

Consistency (一致性)

Timeliness (及時性)

Believability (可信度)

Value added (附加價值)

Interpretability (可解釋性)

Accessibility (易接受)
6
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Major Tasks in Data Preprocessing

Data cleaning (資料清理)


Data integration (資料整合)


Normalization and aggregation
Data reduction (資料縮減)


Integration of multiple databases, data cubes, or files
Data transformation (資料轉換)


Fill in missing values, smooth noisy data, identify or remove outliers, and
resolve inconsistencies
Obtains reduced representation in volume but produces the same or similar
analytical results
Data discretization (資料離散化)

Part of data reduction but with particular importance, especially for
numerical data
7
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Forms of Data Preprocessing
8
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Data Cleaning

Data cleaning tasks

Fill in missing values (填寫空缺值)

Identify outliers and smooth out noisy data (識別孤立點與消除
噪音資料)

Correct inconsistent data (解決不一致資料)
9
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Missing Data

Data is not always available



E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
Missing data may be due to

equipment malfunction (設備異常)

inconsistent with other recorded data and thus deleted (與其它已存
在資料不一致而遭刪除)

data not entered due to misunderstanding (因為誤解而資料沒有被
輸入)

certain data may not be considered important at the time of entry
(在輸入時,因為得不到應用的重視而沒有被輸入)
Missing data may need to be inferred.
10
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
How to Handle Missing Data?

Ignore the tuple:


Fill in the missing value manually:


usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage of missing values per
attribute varies considerably.
tedious + infeasible?
Fill in it automatically with

a global constant :





e.g., “unknown”
a new class?!
the attribute mean
the attribute mean for all samples belonging to the same class:
smarter
the most probable value:

inference-based such as Bayesian formula or decision tree
11
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Noisy Data

Noise: random error or variance in a measured variable

Incorrect attribute values may due to


faulty data collection instruments (資料收集工具的缺失)

data entry problems (資料輸入問題)

data transmission problems (資料傳輸問題)

technology limitation (技術限制)

inconsistency in naming convention (命名規則不一致)
Other data problems which requires data cleaning

duplicate records (重覆記錄)

incomplete data (不完整的資料)

inconsistent data (不一致的資料)
12
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
How to Handle Noisy Data?


Binning (分箱)

first sort data and partition into (equal-depth) bins

then one can smooth by bin means, smooth by bin median, smooth by
bin boundaries, etc.
Regression (回歸)


Clustering (聚類)


smooth by fitting the data into regression functions
detect and remove outliers
Combined computer and human inspection (電腦與人工判斷
的結合)

detect suspicious values by computer, and check by human (e.g., deal
with possible outliers)
13
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Simple Discretization Methods: Binning

Binning methods smooth a stored data value by consulting its
“neighborhood” (the values around it).

Equal-width (distance) partitioning

Divides the range into N intervals of equal size


e.g., Box 1: 1 ~ 10, Box 2: 11 ~ 20, Box 3: 21 ~ 30, …
Equal-depth (frequency) partitioning

Divides the range into N intervals, each containing approximately
same number of samples

e.g., 每一個Box都可以存放4筆資料
14
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Binning Methods for Data Smoothing

Step 1:
* Sorted data for price (in dollars): 4, 8, 9, 15, 21, 21, 24, 25, 26, 28, 29, 34
* Partition into equal-depth bins:
- Bin 1: 4, 8, 9, 15
- Bin 2: 21, 21, 24, 25
- Bin 3: 26, 28, 29, 34

Step 2:
* Smoothing by bin means:
- Bin 1: 9, 9, 9, 9
- Bin 2: 23, 23, 23, 23
- Bin 3: 29, 29, 29, 29
* Smoothing by bin boundaries:
- Bin 1: 4, 4, 4, 15
- Bin 2: 21, 21, 25, 25
- Bin 3: 26, 26, 26, 34
15
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
16
Regression
y
Y1
Y1’
y=x+1
X1
x
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Cluster Analysis
17
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Data Integration

Data integration:


Combines data from multiple sources into a coherent store
There are a number of issues to consider during data
integration.

Schema integration (概要整合)

Redundancy (冗餘資料)

Detection and resolution of data value conflicts (資料值衝突的偵測與
解決)

Careful integration of the data from multiple sources may help
reduce/avoid redundancies and inconsistencies and improve
mining speed and quality
18
國立聯合大學 資訊管理學系

Schema integration:


Identify real world entities from multiple data sources
Entity identification problem:


e.g., A.cust-id  B.cust-#
Integrate metadata from different sources




資料探勘課程 (陳士杰)
Metadata最常見的英文定義是〝data about data 〞,可直譯為描述資
料的資料,主要是描述資料屬性的資訊
Examples of metadata for each attribute include the name, meaning,
data type, and rang of values permitted for the attribute.
Can be used to help avoid errors in schema integration.
Detecting and resolving data value conflicts:


For the same real world entity, attribute values from different sources
are different
Possible reasons:



different representations,
different scales,
e.g., metric (公制) vs. British (英制) units
19
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Handling Redundancy in Data Integration


Redundant data occur often when integration of
multiple databases, because:

The same attribute or object may have different names in
different databases

One attribute may be a “derived” attribute in another
table, e.g., annual revenue
Redundant attributes may be able to be detected
by correlation analysis
20
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
21
Correlation Analysis (Numerical Data)

Correlation coefficient (also called Pearson’s product moment
coefficient)
rA, B
( A  A)( B  B )  ( AB)  n A B



(n  1)AB
(n  1)AB
where n is the number of tuples, A and B are the respective means of A and B,
σA and σB are the respective standard deviation of A and B, and Σ(AB) is the
sum of the AB cross-product.

If rA,B > 0, A and B are positively correlated (A’s values increase
as B’s). The higher, the stronger correlation.

rA,B = 0: independent; rA,B < 0: negatively correlated
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
22
Correlation Analysis (Categorical Data 類別資料)

Χ2 (chi-square) test
(Observed  Expected)
 
Expected
2
2

The larger the Χ2 value, the more likely the variables are
related

The cells that contribute the most to the Χ2 value are those
whose actual count is very different from the expected count

Correlation does not imply causality

# of hospitals and # of car-theft in a city are correlated

Both are causally linked to the third variable: population
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
23
Chi-Square Calculation: An Example

Play chess
Not play chess
Sum (row)
Like science fiction
250(90)
200(360)
450
Not like science fiction
50(210)
1000(840)
1050
Sum(col.)
300
1200
1500
Χ2 (chi-square) calculation (numbers in parenthesis are
expected counts calculated based on the data distribution in
the two categories)
(250  90) 2 (50  210) 2 (200  360) 2 (1000  840) 2
 



 507.93
90
210
360
840
2

It shows that like_science_fiction and play_chess are correlated
in the group
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Data Transformation

Smoothing (平滑): remove noise from data

Aggregation (聚集): summarization, data cube
construction

Generalization (一般化): concept hierarchy climbing

Normalization (正規化): scaled to fall within a small,
specified range


min-max normalization

z-score normalization

normalization by decimal scaling
Attribute construction (屬性構造):

New attributes constructed from the given ones
24
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Data Transformation: Normalization

Min-max normalization: to [new_minA, new_maxA]
v' 


v  minA
(new _ maxA  new _ minA)  new _ minA
maxA  minA
Ex. Let income range $12,000 to $98,000 normalized to [0.0, 1.0]. Then
$73,000 is mapped to 73,600  12,000
(1.0  0)  0  0.716
98,000  12,000
Z-score normalization (μ: mean, σ: standard deviation):
v' 


v  A

A
Ex. Let μ = 54,000, σ = 16,000. Then
Normalization by decimal scaling
v
v'  j
10
73,600  54,000
 1.225
16,000
Where j is the smallest integer such that Max(|ν’|) < 1
25
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Data Reduction Strategies

Why data reduction? (為何需要資料縮減?)



Data reduction


A database/data warehouse may store terabytes of data
Complex data analysis/mining may take a very long time to run on the
complete data set
Obtain a reduced representation of the data set that is much smaller
in volume but yet produce the same (or almost the same) analytical
results
Data reduction strategies





Data cube aggregation (資料立方體聚集) – aggregation operations
Attribute subset selection (屬性子集合選擇) - e.g., remove unimportant
attributes
Dimensionality reduction (維度縮減), Data Compressed (資料壓縮) –
encoding mechanism
Numerosity reduction (數值縮減) - e.g., fit data into models
Discretization and concept hierarchy generation (離散化和概念分層產生)
26
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Data Cube Aggregation

Data cubes store multidimensional aggregated information.

Data cubes are discussed in detail in Chapter 3

For example:
27
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Attribute Subset Selection

Data sets for analysis may in hundreds of attributes, many of
which may be irrelevant to the mining task or redundant.

Reduce the data set size by removing irrelevant or
redundant attributes (or dimensions).

The goal of attribute subset selection is to find a minimum set of
attributes such that the resulting probability distribution of the
data classes is as close as possible to the original distribution
obtained using all attributes.
28
國立聯合大學 資訊管理學系

How can we find a ‘good’ subset of the original attributes?


資料探勘課程 (陳士杰)
For n attributes, there are 2n possible subsets.
Heuristic methods (due to exponential # of choices):

Step-wise forward selection (向前逐步選擇)

Step-wise backward elimination (向後逐步排除)

Combining forward selection and backward elimination (結合向前逐步
選擇與向後逐步排除)

Decision-tree induction (決策樹歸納)
29
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
30
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Dimensionality Reduction

Data encoding or transformations are applied so as to obtain a
reduced or “compressed” representation of the original data.


Lossless (無損壓縮): If the original data can be reconstructed from the
compressed data without any loss of information.
Lossy (有損壓縮): we can reconstruct only an approximation of the
original data.
Compressed
Data
Original Data
lossless
Original Data
Approximated
31
國立聯合大學 資訊管理學系

String compression (字串壓縮)




There are extensive theories and well-tuned algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression



資料探勘課程 (陳士杰)
Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed without
reconstructing the whole
Two popular and effective methods of lossy dimensionality
reduction:


Wavelet transforms (微波轉換)
Principal components analysis (主成份分析)
32
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Numerosity Reduction


Reduce data volume by choosing alternative, smaller forms
of data representation
Parametric methods



Assume the data fits some model, estimate model parameters,
store only the parameters, and discard the data (except possible
outliers)
Example: Log-linear models—obtain value at a point in m-D space
as the product on appropriate marginal subspaces
Non-parametric methods


Do not assume models
Major families: histograms, clustering, sampling
33
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Regress Analysis and Log-Linear Models

Linear regression: Y = w X + b


Two regression coefficients, w and b, specify the line and are to be
estimated by using the data at hand
Using the least squares criterion to the known values of Y1, Y2, …, X1,
X2, ….

Multiple regression: Y = b0 + b1 X1 + b2 X2.


Many nonlinear functions can be transformed into the above
Log-linear models:


The multi-way table of joint probabilities is approximated by a
product of lower-order tables
Probability: p(a, b, c, d) = ab acad bcd
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
35
Histograms

Divide data into buckets and store average (sum) for each
bucket

Partitioning rules:

Equal-width

Equal-frequency (or equal-depth)
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
36
Clustering

Partition data set into clusters based on similarity, and store cluster
representation (e.g., centroid and diameter) only

Can be very effective if data is clustered, but not if data is “smeared”
(資料界限模糊)

Can have hierarchical clustering and be stored in multi-dimensional
index tree structures

There are many choices of clustering definitions and clustering
algorithms

Cluster analysis will be studied in depth in Chapter 7
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
37
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
38
Sampling


Sampling: obtaining a small sample s to represent the whole
data set N
The most common ways:

Simple random sample without replacement (SRSWOR) of size s



Simple random sample with replacement (SRSWR) of size s



Draw a tuple, record it, and replaced
A tuple may be draw again
Cluster sample



Draw a tuple, record it, and not replaced
The probability of drawing any tuple in D is 1/N
Grouped the tuples in D into M mutually disjoint clusters
An SRS of s clusters can be obtained.
Stratified sample


D is divided into mutually disjoint parts called strata
A stratified sample of D is generated by obtaining an SRS at each
stratum.
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Sampling: with or without Replacement
Raw Data
39
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Sampling: Cluster or Stratified Sampling
Raw Data
Cluster/Stratified Sample
40
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Discretization


Three types of attributes:

Nominal — values from an unordered set, e.g., color, profession

Ordinal — values from an ordered set, e.g., military or academic rank

Continuous — numbers, e.g., integer or real numbers
Discretization:

Divide the range of a continuous attribute into intervals

Some classification algorithms only accept categorical attributes.

Reduce data size by discretization

Prepare for further analysis
41
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Discretization and Concept Hierarchy

Discretization

Reduce the number of values for a given continuous attribute by dividing the
range of the attribute into intervals


Interval labels can then be used to replace actual data values

Discretization can be performed recursively on an attribute
Concept hierarchy formation

Recursively reduce the data by collecting and replacing low level concepts (such
as numeric values for age) by higher level concepts (such as young, middleaged, or senior)
42
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
43
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Discretization and Concept Hierarchy Generation
for Numeric Data

Typical methods: All the methods can be applied recursively

Binning (covered above)


Histogram analysis (covered above)


Top-down split, unsupervised,
Top-down split, unsupervised
Clustering analysis (covered above)

Either top-down split or bottom-up merge, unsupervised

Entropy-based discretization: supervised, top-down split

Segmentation by natural partitioning: top-down split, unsupervised
44
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Entropy-Based Discretization

Given a set of samples S, if S is partitioned into two intervals S1 and S2 using
boundary T, the information gain after partitioning is
I (S , T ) 

| S1 |
|S |
Entropy( S1)  2 Entropy( S 2)
|S|
|S|
Entropy is calculated based on class distribution of the samples in the set.
Given m classes, the entropy of S1 is
m
Entropy( S1 )   pi log 2 ( pi )
i 1
where pi is the probability of class i in S1

The boundary that minimizes the entropy function over all possible
boundaries is selected as a binary discretization

The process is recursively applied to partitions obtained until some stopping
criterion is met

Such a boundary may reduce data size and improve classification accuracy
45
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Segmentation by Natural Partitioning



聚類分析產生概念分層可能會將一個工資區間劃
分為︰[51263.98, 60872.34]
通常數據分析人員希望看到劃分的形式為[50000,
60000]
A simply 3-4-5 rule can be used to segment numeric
data into relatively uniform, “natural” intervals.
46
國立聯合大學 資訊管理學系

Steps:





資料探勘課程 (陳士杰)
如果一個區間最高有效位上包含3,6,7或9個不同的值,就將該區
間劃分為3個等寬子區間;(7 2,3,2)
如果一個區間最高有效位上包含2,4,或8個不同的值,就將該區
間劃分為4個等寬子區間;
如果一個區間最高有效位上包含1,5,或10個不同的值,就將該區
間劃分為5個等寬子區間;
將該規則遞迴的應用於每個子區間,產生給定數值屬性的
概念分層;
對于數據集中出現的最大值和最小值的極端分佈,為了避
免上述方法出現的結果扭曲,可以在頂層分段時,選用一
個大部分的機率空間。e.g. 5%-95%
47
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Example of 3-4-5 Rule
48
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Concept Hierarchy Generation for Categorical Data


分類資料是指無序的離散資料,它有有限個值(可能很多
個)
分類資料的概念分層之生成方法:

Specification of a partial/total ordering of attributes explicitly at the
schema level by users or experts


Specification of a hierarchy for a set of values by explicit data
grouping


{Urbana, Champaign, Chicago} < Illinois
Specification of only a partial set of attributes


street < city < state < country
E.g., only street < city, not others
Automatic generation of hierarchies (or attribute levels) by the
analysis of the number of distinct values

E.g., for a set of attributes: {street, city, state, country}
49
國立聯合大學 資訊管理學系

資料探勘課程 (陳士杰)
50
Some hierarchies can be automatically generated based on
the analysis of the number of distinct values per attribute in
the data set

The attribute with the most distinct values is placed at the lowest
level of the hierarchy
country
15 distinct values
province_or_ state
365 distinct values
city
3567 distinct values
street
674,339 distinct values
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Mining Data Descriptive Characteristics

Motivation


To better understand the data:

central tendency,

variation

Other spread
Data dispersion characteristics

median, max, min, quartiles, outliers, variance, etc.
51
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
52
Measuring the Central Tendency

Mean (algebraic measure) (sample vs. population):
x

1 n
x   xi
n i 1
N
n


Weighted arithmetic mean:

Trimmed mean: chopping extreme values
x
i
i
w
i
Middle value if odd number of values, or average of the middle two
values otherwise

i 1
n
i 1
Median: A holistic measure

w x
Estimated by interpolation (for grouped data):
median  L1  (
n / 2  ( f )l
f median
)c
國立聯合大學 資訊管理學系

資料探勘課程 (陳士杰)
Mode

Value that occurs most frequently in the data

Unimodal, bimodal, trimodal

Empirical formula:
mean  mode  3  (mean  median)
53
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Symmetric vs. Skewed Data

Median, mean and mode of symmetric, positively and
negatively skewed data
54
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
55
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Measuring the Dispersion of Data

Quartiles, outliers and boxplots

Quantiles: Q1 (25th percentile), Q3 (75th percentile)

Inter-quartile range: IQR = Q3 – Q1

Five number summary: min, Q1, M, Q3, max

Boxplot: ends of the box are the quartiles, median is marked, whiskers,
and plot outlier individually

Outlier: usually, a value higher/lower than 1.5 x IQR
56
國立聯合大學 資訊管理學系

資料探勘課程 (陳士杰)
Variance and standard deviation (sample: s, population: σ)

Variance: (algebraic, scalable computation)
1 n
1 n 2 1 n
2
2
s 
(
x

x
)

[
x

(
x
)
 i
i n
i ]
n  1 i 1
n  1 i 1
i 1
2
1
 
N
2

n
1
(
x


)


i
N
i 1
2
n
x
i 1
i
2
 2
tandard deviation s (or σ) is the square root of variance s2 (or σ2)
57
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Properties of Normal Distribution Curve

The normal (distribution) curve



From μ–σ to μ+σ: contains about 68% of the measurements (μ: mean, σ:
standard deviation)
From μ–2σ to μ+2σ: contains about 95% of it
From μ–3σ to μ+3σ: contains about 99.7% of it
58
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Boxplot Analysis

Five-number summary of a distribution:
Minimum, Q1, M, Q3, Maximum

Boxplot

Data is represented with a box

The ends of the box are at the first and third quartiles, i.e., the height of the
box is IRQ

The median is marked by a line within the box

Whiskers: two lines outside the box extend to Minimum and Maximum
59
國立聯合大學 資訊管理學系


資料探勘課程 (陳士杰)
該盒圖為在給定時間段在AllElectronics的4個分店
銷售的商品單價的盒圖:
分店1



中位數$80
Q1: $60
Q3: $100
60
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Visualization of Data Dispersion: Boxplot Analysis
61
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
62
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Histogram Analysis

Frequency histograms


A univariate graphical method
Consists of a set of rectangles that reflect the counts or frequencies
of the classes present in the given data
63
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Quantile Plot


Displays all of the data (allowing the user to assess both the
overall behavior and unusual occurrences)
Plots quantile information

For a data xi data sorted in increasing order, fi indicates that
approximately 100 fi% of the data are below or equal to the value xi
64
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Quantile-Quantile (Q-Q) Plot


Graphs the quantiles of one univariate distribution against the
corresponding quantiles of another
Allows the user to view whether there is a shift in going from
one distribution to another
65
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Scatter plot


Provides a first look at bivariate data to see clusters of points,
outliers, etc
Each pair of values is treated as a pair of coordinates and
plotted as points in the plane
66
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Loess (Local regression) Curve


Adds a smooth curve to a scatter plot in order to provide
better perception of the pattern of dependence
Loess curve is fitted by setting two parameters:


a smoothing parameter 
the degree of the polynomials that are fitted by the regression 
67
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Positively and Negatively Correlated Data
68
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
Not Correlated Data
69
國立聯合大學 資訊管理學系
資料探勘課程 (陳士杰)
 Summary

Data preparation or preprocessing is a big issue for both
data warehousing and data mining

Descriptive data summarization is need for quality data
preprocessing

Data preparation includes


Data cleaning and data integration

Data reduction and feature selection

Descretization
A lot a methods have been developed but data
preprocessing still an active area of research
70