Transcript Compiled By

Data Preprocessing
Compiled By:
Umair Yaqub
Lecturer
Govt. Murray College Sialkot
Data Reduction
 Databases/Data warehouses may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the complete data set
 Data reduction
 Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
 Strategies
 Data cube aggregation
 Dimensionality reduction
 Attribute subset selection
 Numerosity reduction
 Discretization and concept hierarchy generation
2
Data Cube Aggregation
 The lowest level of a data cube (base cuboid)
 Multiple levels of aggregation in data cubes
 Further reduce the size of data to deal with
 The highest level of a data cube (apex cuboid)
 Reference appropriate levels
 Use the smallest representation which is enough to solve the task
3
Attribute Subset Selection
 All attributes may not be relevant to the mining task
 Reduced attributes should result in
 Less data so faster learning
 Higher accuracy
 Simple results
 If behaviour of data is not known, manual selection of useful attributes may
be a time consuming task
 Careful selection is required
 Keep relevant attributes
 Leave out irrelevant attributes
4
Attribute Subset Selection (contd…)
 Attribute subset selection is a search problem
 Heuristic methods (due to exponential # of choices), typically greedy:




5
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
Attribute Subset Selection (contd…)
6
Dimensionality Reduction
 Data encoding or transformations are applied so as to obtain a reduced
or ‘compressed’ representation of original data.
Original Data
lossless
Original Data
Approximated
7
Compressed
Data
Dimensionality Reduction…
 Data encoding techniques:



8
Huffman algorithm
Wavelets
Principal components analysis
Numerosity Reduction
 Reduce data volume by choosing alternative, smaller forms of data
representation
 Parametric methods
 Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
 Non-parametric methods
 Do not assume models
 Major families: histograms, clustering, sampling
9