Transcript Compiled By
Data Preprocessing
Compiled By:
Umair Yaqub
Lecturer
Govt. Murray College Sialkot
Data Reduction
Databases/Data warehouses may store terabytes of data: Complex data
analysis/mining may take a very long time to run on the complete data set
Data reduction
Obtains a reduced representation of the data set that is much smaller in volume but yet
produces the same (or almost the same) analytical results
Strategies
Data cube aggregation
Dimensionality reduction
Attribute subset selection
Numerosity reduction
Discretization and concept hierarchy generation
2
Data Cube Aggregation
The lowest level of a data cube (base cuboid)
Multiple levels of aggregation in data cubes
Further reduce the size of data to deal with
The highest level of a data cube (apex cuboid)
Reference appropriate levels
Use the smallest representation which is enough to solve the task
3
Attribute Subset Selection
All attributes may not be relevant to the mining task
Reduced attributes should result in
Less data so faster learning
Higher accuracy
Simple results
If behaviour of data is not known, manual selection of useful attributes may
be a time consuming task
Careful selection is required
Keep relevant attributes
Leave out irrelevant attributes
4
Attribute Subset Selection (contd…)
Attribute subset selection is a search problem
Heuristic methods (due to exponential # of choices), typically greedy:
5
step-wise forward selection
step-wise backward elimination
combining forward selection and backward elimination
decision-tree induction
Attribute Subset Selection (contd…)
6
Dimensionality Reduction
Data encoding or transformations are applied so as to obtain a reduced
or ‘compressed’ representation of original data.
Original Data
lossless
Original Data
Approximated
7
Compressed
Data
Dimensionality Reduction…
Data encoding techniques:
8
Huffman algorithm
Wavelets
Principal components analysis
Numerosity Reduction
Reduce data volume by choosing alternative, smaller forms of data
representation
Parametric methods
Assume the data fits some model, estimate model parameters, store only the
parameters, and discard the data (except possible outliers)
Non-parametric methods
Do not assume models
Major families: histograms, clustering, sampling
9