linux C编程

Download Report

Transcript linux C编程

数据分析与挖掘 02
——Data Preprocessing
电子科技大学计算机学院
张玉宏 主讲
[email protected]
内容提纲
Why preprocess the data?
Data cleaning
Data integration and transformation
Data reduction
Discretization(离散) and concept hierarchy
generation
Summary
2
I.
Why Data Preprocessing?
Data in the real world is dirty

incomplete: lacking attribute values, lacking certain
attributes of interest, or containing only aggregate data
 e.g., occupation=“”

noisy: containing errors or outliers(孤立点)
 e.g., Salary=“-10”
inconsistent: containing (semantic语义上的)
discrepancies(矛盾) in codes or names
 e.g., Age=“42” Birthday=“03/07/1997”
 e.g., Was rating “1,2,3”, now rating “A, B, C”
 e.g., discrepancy between duplicate records
3
Outliers
Outliers are data objects with characteristics
that are considerably different than most of
the other data objects in the data set
4
Why Is Data Dirty?
Incomplete data comes from

n/a data value when collected

different consideration between the time when the data was collected
and when it is analyzed.

human/hardware/software problems
Noisy data comes from the process of data

collection(收集)

entry(输入)

transmission(传输)
Inconsistent(不一致) data comes from

Different data sources

Functional dependency violation
5
感兴趣的属性
6
(秒表)
(矛盾)
7
What is redundancy?
(冗余)
(继承或推断于)
8
9
Why Is Data Preprocessing Important?
Data extraction, cleaning, and
transformation comprises(组成) the
majority of the work of building a data
warehouse. —Bill Inmon
10
Major Tasks in Data Preprocessing
Data cleaning
Data integration
Data transformation
Data reduction
Data discretization (离散化)
11
II.
Data Cleaning
Importance


“Data cleaning is one of the three biggest problems in
data warehousing”—Ralph Kimball
“Data cleaning is the number one problem in data
warehousing”—DCI survey
Data cleaning tasks




Fill in missing values
Identify outliers (识别孤立点)and smooth out noisy
data
Correct inconsistent data
Resolve redundancy caused by data integration
12
suds:肥皂水
13
14
Missing Data
Data is not always available

E.g., many tuples have no recorded value for several attributes, such
as customer income in sales data
Missing data may be due to

equipment malfunction(故障)

inconsistent with other recorded data and thus deleted

data not entered due to misunderstanding(误解)

certain data may not be considered important at the time of entry

not register history or changes of the data
Missing data may need to be inferred.
15
16
How to Handle Missing Data?
Ignore the tuple
usually done when class label is missing (assuming the tasks in
classification—not effective when the percentage (百分比)
of missing values per attribute varies considerably).
Fill in the missing value manually(人工填写空缺值)
tedious + infeasible? (很费时,数据量很大时不可行)
Fill in it automatically with




a global constant(全局产量) : e.g., “unknown”, a new class?!
the attribute mean(属性的均值)
the attribute mean for all samples belonging to the same class: smarter
the most probable value: inference-based such as Bayesian formula
or decision tree
17
标准化
18
表示法
19
Noise
Noise refers to modification(修改) of
original values(原始值)

Examples: distortion of a person’s voice when
talking on a poor phone and “snow” on television
screen
Two Sine Waves
Two Sine Waves + Noise
20
Noisy Data
Noise: random error or variance in a measured
variable(噪音是测量变量的随机错误或偏差)
Incorrect attribute values may due to





faulty data collection instruments
data entry problems
data transmission problems
technology limitation
inconsistency in naming convention (命名习惯)
Other data problems which requires data cleaning



duplicate records
incomplete data
inconsistent data
21
How to Handle Noisy Data?
Binning method(分箱法):
Clustering
Combined computer and human inspection
(检查)
Regression
22
Simple Discretization(分散) Methods: Binning
1.Equal-width (distance) partitioning:




Divides the range into N intervals of equal size: uniform
grid
if A and B are the lowest and highest values of the
attribute, the width of intervals will be: W = (B –A)/N.
The most straightforward, but outliers may dominate
presentation!
Skewed(歪曲的) data is not handled well.
23
Simple Discretization Methods: Binning
2.Equal-depth (frequency) partitioning:
 Divides the range into N intervals, each
containing approximately(近似的) same
number of samples
24
25
26
27
求助于
28
干预
29
元数据
30
31
重温几个概率的概念
33
最大-最小规范化保持原始数据值之间的关系。
可能面临“越界”的错误
34
35
在z- score 规范化(或零-均值规范化)
中,属性A 的值基于A 的平均值和标准
差规范化。A 的值v 被规范化为v’
小数定标规范化通过移动
属性A 的小数点位置进行
规范化。小数点的移动位
数依赖于A 的最大绝对值
36
37
数据规约
相当于归纳
譬如小波变换:提起主要数据
主成分分析:对稀疏矩阵进行分
析
离散化
38
Data Cube Aggregation
Imagine that you have collected the data for your
analysis. These data consist of the AllElectronics sales
per quarter,for the years 1997 to 1999.
You are, however, interested in the annual sales (total
per year), rather than the total per quarter.
Thus the data can be aggregated so that the resulting
data summarize the total sales per year instead of per
quarter.
39
Data Cube Aggregation
The resulting data set is smaller in volume,
without loss of information necessary for the
analysis task.
This aggregation is illustrated in Figure 3.4.
40
Data Cube Aggregation
Figure 3.4: Sales data for a given branch of AllElectronics for the
years 1997 to 1999.
In the data on the left, the sales are shown per quarter. In the
data on the right, the data are aggregated to provide the annual
sales.
41
•When replying to such OLAP queries or data mining
requests, the smallest available cuboid relevant to the
given task should be used.
42
43
Dimensionality Reduction
Data sets for analysis may contain hundreds of
attributes, many of which may be irrelevant to(不
相关的) the mining task, or redundant.
For example, if the task is to classify customers
as to whether or not they are likely to purchase a
popular new CD at AllElectronics when noticed
of a sale, attributes such as the customer's
telephone number are likely to be irrelevant,
unlike attributes such as age or music taste.
44
Dimensionality Reduction
Leaving out relevant attributes, or keeping
irrelevant attributes may be detrimental(有害的
), causing confusion for the mining algorithm(
算法) employed.
This can result in discovered patterns of poor
quality.
In addition, the added volume of irrelevant or
redundant attributes can slow down the mining
process.
45
Dimensionality Reduction
Feature selection (i.e., attribute subset selection):
 Select a minimum set of features such that the
probability distribution of different classes given the
values for those features is as close as possible to the
original distribution given the values of all features

(属性子集选择的目标是找出最小属性集,使得
数据类的概率分布尽可能地接近使用所有属性的
原分布。)
46
Dimensionality Reduction
Heuristic(启发式的) methods:
 step-wise(逐步的) forward selection(向前选择
)
 step-wise backward elimination(逐步向后删除)
 combining forward selection and backward
elimination
 decision-tree induction(决策树归纳)
47
Dimensionality Reduction
1. Step-wise forward selection:
 The procedure starts with an empty set of
attributes.
 The best of the original attributes is
determined and added to the set.
 At each subsequent iteration(迭代) or
step, the best of the remaining original
attributes is added to the set.
48
49
Dimensionality Reduction
2. Step-wise backward elimination:
 The procedure starts with the full set of
attributes.
 At each step, it removes the worst
attribute remaining in the set.
50
51
Dimensionality Reduction
3. Combination forward selection and
backward elimination:
 The step-wise forward selection and
backward elimination methods can be
combined, where at each step one selects
the best attribute and removes the worst
from among the remaining attributes.
52
53
Dimensionality Reduction
The stopping criterion (标准) for methods
1 to 3 may vary.
The procedure may employ a threshold(阈值
) on the measure used to determine when
to stop the attribute selection process.
54
Dimensionality Reduction
4. Decision tree induction:
 Decision tree algorithms, such as ID3 and
C4.5, were originally intended for
classification.
55
Dimensionality Reduction
4. Decision tree induction:




constructs a flow-chart-like structure where each
internal (non-leaf) node denotes(表示) a test on
an attribute
each branch corresponds to(对应于) an outcome
of the test, and each external (leaf)node denotes
a class prediction.
At each node, the algorithm chooses the “best”
attribute to partition the data into individual
classes.
(在每个结点,算法选择“最好”的属性,将数据
划分成类。)
56
57
This tree classifies Saturday mornings according to whether or
not they are suitable for playing tennis.
Initial atrribute set :
{Outlook,Sunny,overcast,rain,hum
idity,wind,man,woman,marriage}
Reduced set
{Outlook,Sunny,overcast,rain,humidity,wind}
58
4.Decision tree induction
is used for
attribute subset selection, a tree is
When decision tree induction
constructed from the given data.
All attributes that do not appear in the tree
are assumed to be irrelevant. The set of
attributes appearing in the tree form the
reduced subset of attributes.
59
Q:Computer is stupid or not?
Example:let a robot send a chalk to a
gril,the programmer will tell how
many parameters to the robot?
60
61
Data Compression
Original Data
Compressed
Data
lossless
Original Data
Approximated
62
63
Data Compression
String compression



There are extensive theories and well-tuned(调整好的)
algorithms
Typically lossless
But only limited manipulation is possible without expansion
Audio/video compression


Typically lossy compression, with progressive refinement
Sometimes small fragments of signal can be reconstructed
without reconstructing the whole
64
Q&Exp.
Try to compress a text document
with winRAR, checking the result.
then tell me why?
65
66
主成分分析也称主分量分析,旨在利用降维的
思想,把多指标转化为少数几个综合指标。
在统计学中,主成分分析(principal
components analysis,PCA)是一种简化数据集
的技术。它是一个线性变换。这个变换把数据变换
到一个新的坐标系统中,使得任何数据投影的第一
大方差在第一个坐标(称为第一主成分)上,第二大
方差在第二个坐标(第二主成分)上,依次类推。主
成分分析经常用减少数据集的维数,同时保持数据
集的对方差贡献最大的特征.这是通过保留低阶主成
分,忽略高阶主成分做到的。
67
柱状图
his-
68
69
Numerosity Reduction(数值归约)
Parametric methods

Assume the data fits some model, estimate model
parameters, store only the parameters, and discard the
data (except possible outliers)

Log-linear models: obtain value at a point in m-D space
as the product on appropriate marginal subspaces
Non-parametric methods

Do not assume models

Major families: histograms, clustering, sampling
70
Regress Analysis and Log-Linear Models
Linear regression: Y =  +  X


Two parameters ,  and  specify the line and are to be
estimated by using the data at hand.
using the least squares criterion to the known values of Y1,
Y2, …, X1, X2, ….
Multiple regression: Y = b0 + b1X1 + b2X2.

Many nonlinear functions can be transformed into the above.
Log-linear models:


The multi-way table of joint probabilities is approximated by
a product of lower-order tables.
Probability: p(a, b, c, d) = ab acad bcd
71
Histograms
40
35
30
25
20
15
10
5
0
10000
30000
50000
70000
90000
72
Histograms
Histograms use binning to approximate data
distributions and are a popular form of data
reduction.
A histogram for an attribute A partitions the data
distribution of A into disjoint subsets, or buckets.
The buckets are displayed on a horizontal axis
while the height (and area) of a bucket typically
reflects the average frequency of the values
represented by the bucket.
73
Histograms
If each bucket represents only a single attributevalue/frequency pair, the buckets are called
singleton buckets.
Often, buckets instead represent continuous
ranges for the given attribute.
74
Example 3.4
The following data are a list of prices of
commonly sold items at AllElectronics
(rounded to the nearest dollar). The numbers
have been sorted.
1, 1, 5, 5, 5, 5, 5, 8, 8, 10, 10, 10, 10, 12, 14,
14, 14, 15, 15, 15, 15, 15, 15, 18, 18, 18, 18,
18, 18, 18, 18, 20, 20,20, 20, 20, 20, 20, 21,
21, 21, 21, 25, 25, 25, 25, 25, 28, 28, 30, 30,
30.
75
A histogram for price using singleton buckets - each bucket
represents one price-value/frequency pair.
76
Above Figure shows a histogram for the data using singleton
buckets. To further reduce the data, it is common to have each
bucket denote a continuous range of values for the given
attribute. Under Figure, each bucket represents a different $10
range for price.
77
每个桶大致包含相同个数的临近样本
每个桶宽都是一个常数(例如上例每个桶的宽度为10美元)
给定桶的个数,该方法具有最小方差
78
79
质心
80
Sampling
Sampling can be used as a data reduction technique since
it allows a large data set to be represented by a much
smaller random sample (or subset) of the data. Suppose
that a large data set, D, contains N tuples.
Let's have a look at some possible samples for D.
81
Sampling
1. Simple random sample without
replacement (SRSWOR) of size n:
This is created by drawing n of the N tuples
from D (n < N), where the probably of
drawing any tuple in D is 1=N, i.e., all tuples
are equally likely.
82
Sampling
2. Simple random sample with replacement
(SRSWR) of size n:
This is similar to SRSWOR, except that each
time a tuple is drawn from D, it is recorded
and then replaced.
That is, after a tuple is drawn, it is placed
back in D so that it may be drawn again.
83
84
85
Sampling
3. Cluster sample:
If the tuples in D are grouped into M mutually
disjoint “clusters", then a SRS of m clusters
can be obtained, where m < M.
 SRS :Simple random sample
86
分层的
87
Sampling
4. Stratrified sample:
 If D is divided into mutually disjoint parts called
“strata", a stratified sample of D is generated by
obtaining a SRS at each stratum.
 This helps to ensure a representative sample,
especially when the data are skewed.
 In this way, the age group having the smallest
number of customers will be sure to be represented.
88
分层的
89
90
Shannon
sampling theory
91
Entropy-based discretization
An information-based measure called “entropy”
can be used to recursively(递归的) partition the
values of a numeric attribute A, resulting in a
hierarchical(分层的) discretization.
Such a discretization forms a numerical concept
hierarchy for the attribute.
Given a set of data tuples, S, the basic method
for entropy-based discretization of A is as
follows.
92
Step 1
Each value of A can be considered a potential
interval boundary or threshold T.
For example, a value v of A can partition the
samples in S into two subsets satisfying the
conditions A < v and A ≥ v, respectively,
thereby creating a binary discretization.
93
Step 2
Given S, the threshold value selected is the
one that maximizes the information gain
resulting from the subsequent partitioning.
The information gain is:
where S1 and S2 correspond to the samples
in S satisfying the conditions A < T and A ≥
T, respectively.
94
Step 2
The entropy function Ent for a given set is calculated
based on the class distribution of the samples in the
set.
For example, given m classes, the entropy of S1 is:
where pi is the probability of class i in S1, determined
by dividing the number of samples of class i in S1 by
the total number of samples in S1. The value of
Ent(S2) can be computed similarly.
95
Note :
where pi is the proportion of S belonging
to class i.
Note the logarithm is still base 2 because
entropy is a measure of the expected
encoding length measured in bits.
Note also that if the target attribute can
take on c possible values, the entropy can
be as large as log2c.
96
step3
The process of determining a threshold value
is recursively applied to each partition
obtained, until some stopping criterion is met,
such as
97
98
99
100
101
Summarize
102
Hierarchical Reduction
Use multi-resolution structure with different degrees of
reduction
Hierarchical clustering is often performed but tends to define
partitions of data sets rather than “clusters”
Parametric methods are usually not amenable to hierarchical
representation
Hierarchical aggregation
 An index tree hierarchically divides a data set into partitions
by value range of some attributes
 Each partition can be considered as a bucket
 Thus an index tree with aggregates stored at each node is a
hierarchical histogram
103
V.
Discretization
Three types of attributes:



Nominal — values from an unordered set
Ordinal — values from an ordered set
Continuous — real numbers
Discretization:




divide the range of a continuous attribute into intervals
Some classification algorithms only accept categorical
attributes.
Reduce data size by discretization
Prepare for further analysis
104
Discretization and Concept hierachy
Discretization

reduce the number of values for a given continuous
attribute by dividing the range of the attribute into
intervals. Interval labels can then be used to replace actual
data values
Concept hierarchies

reduce the data by collecting and replacing low level
concepts (such as numeric values for the attribute age) by
higher level concepts (such as young, middle-aged, or
senior)
105
Discretization and Concept
Hierarchy Generation for Numeric
Data
Binning (see sections before)
Histogram analysis (see sections before)
Clustering analysis (see sections before)
Entropy-based discretization
Segmentation by natural partitioning
106
Entropy-Based Discretization
Given a set of samples S, if S is partitioned into two intervals
S1 and S2 using boundary T, the entropy after partitioning is
E (S ,T ) 
| S1|
| S|
Ent ( S1) 
|S 2|
| S|
Ent ( S 2)
The boundary that minimizes the entropy function over all
possible boundaries is selected as a binary discretization.
The process is recursively applied to partitions obtained until
some stopping criterion is met, e.g., Ent ( S )  E (T , S )  
Experiments show that it may reduce data size and improve
classification accuracy
107
Segmentation by Natural Partitioning
A simply 3-4-5 rule can be used to segment numeric data into
relatively uniform, “natural” intervals.

If an interval covers 3, 6, 7 or 9 distinct values at the most
significant digit, partition the range into 3 equi-width
intervals

If it covers 2, 4, or 8 distinct values at the most significant
digit, partition the range into 4 intervals

If it covers 1, 5, or 10 distinct values at the most
significant digit, partition the range into 5 intervals
108
Example of 3-4-5 Rule
count
Step
1:
-$351
Min
Step 2:
-$159
profit
Low (i.e, 5%-tile)
msd=1,000
Low=-$1,000
(-$1,000 0)
(-$400 - 0)
(-$100 0)
($1,000 - $2,000)
(0 -$
1,000)
($1,000 - $2, 000)
(0 - $1,000)
(0 -
(-$200 -$100)
High(i.e, 95%-0 tile)
Max
High=$2,000
(-$4000 -$5,000)
Step
4:
(-$300 -$200)
$4,700
(-$1,000 - $2,000)
Step 3:
(-$400 -$300)
$1,838
($1,000
-
$200)
($200 $400)
($1,200
$1,200) $1,400)
($1,400 $1,600)
($400 $600)
($600 $800)
($800 $1,000)
($1,600 ($1,800 $1,800)
$2,000)
($2,000 - $5, 000)
($2,000 $3,000)
($3,000 $4,000)
($4,000
$5,000)
109
Concept Hierarchy Generation for
Categorical Data
Specification of a partial ordering of attributes explicitly at the schema
level by users or experts
 street<city<state<country
Specification of a portion of a hierarchy by explicit data grouping
 {Urbana, Champaign, Chicago}<Illinois
Specification of a set of attributes.
 System automatically generates partial ordering by analysis of the
number of distinct values
 E.g., street < city <state < country
Specification of only a partial set of attributes
 E.g., only street < city, not others
110
Automatic Concept Hierarchy
Generation
Some concept hierarchies can be automatically generated
based on the analysis of the number of distinct values per
attribute in the given data set
 The attribute with the most distinct values is placed at
the lowest level of the hierarchy
 Note: Exception—weekday, month, quarter, year
country
15 distinct values
province_or_ state
65 distinct
values
3567 distinct values
city
street
674,339 distinct values
111
VI.
Summary
Data preparation is a big issue for both warehousing and
mining
Data preparation includes

Data cleaning and data integration

Data reduction and feature selection

Discretization
A lot a methods have been developed but still an active area
of research
112
References
E. Rahm and H. H. Do. Data Cleaning: Problems and Current Approaches. IEEE Bulletin of the
Technical Committee on Data Engineering. Vol.23, No.4
D. P. Ballou and G. K. Tayi. Enhancing data quality in data warehouse environments.
Communications of ACM, 42:73-78, 1999.
H.V. Jagadish et al., Special Issue on Data Reduction Techniques. Bulletin of the Technical
Committee on Data Engineering, 20(4), December 1997.
A. Maydanchik, Challenges of Efficient Data Cleansing (DM Review - Data Quality resource
portal)
D. Pyle. Data Preparation for Data Mining. Morgan Kaufmann, 1999.
D. Quass. A Framework for research in Data Cleaning. (Draft 1999)
V. Raman and J. Hellerstein. Potters Wheel: An Interactive Framework for Data Cleaning and
Transformation, VLDB’2001.
T. Redman. Data Quality: Management and Technology. Bantam Books, New York, 1992.
Y. Wand and R. Wang. Anchoring data quality dimensions ontological foundations.
Communications of ACM, 39:86-95, 1996.
R. Wang, V. Storey, and C. Firth. A framework for analysis of data quality research. IEEE Trans.
Knowledge and Data Engineering, 7:623-640, 1995.
http://www.cs.ucla.edu/classes/spring01/cs240b/notes/data-integration1.pdf
113
本讲完毕