#### Transcript Lecture slides

Data preparation: Selection, Preprocessing, and Transformation Literature: I.H. Witten and E. Frank, Data Mining, chapter 2 and chapter 7 1 Fayyad’s KDD Methodology Transformed data Target data Patterns Processed data Data Mining data Selection Knowledge Interpretation Evaluation Transformation Preprocessing & feature selection & cleaning 2 Contents Data Selection Data Preprocessing Data Transformation 3 Data Selection Goal Understanding the data Explore the data: possible attributes their values distribution, outliers 4 Getting to know the data Simple visualization tools are very useful for identifying problems Nominal attributes: histograms (Distribution consistent with background knowledge?) Numeric attributes: graphs (Any obvious outliers?) 2-D and 3-D visualizations show dependencies Domain experts need to be consulted Too much data to inspect? Take a sample! 5 Data preprocessing Problem: different data sources (e.g. sales department, customer billing department, …) Differences: styles of record keeping, conventions, time periods, data aggregation, primary keys, errors Data must be assembled, integrated, cleaned up “Data warehouse”: consistent point of access External data may be required (“overlay data”) Critical: type and level of data aggregation 6 Data Preprocessing Choose data structure (table, tree or set of tables) Choose attributes with enough information Decide on a first representation of the attributes (numeric or nominal) Decide on missing values Decide on inaccurate data (cleansing) 7 Attribute types used in practice Most schemes accommodate just two levels of measurement: nominal and ordinal Nominal attributes are also called “categorical”, “enumerated”, or “discrete” But: “enumerated” and “discrete” imply order Special case: dichotomy (“boolean” attribute) Ordinal attributes are called “numeric”, or “continuous” But: “continuous” implies mathematical continuity 8 The ARFF format % ARFF file for weather data with some numeric features % @relation weather @attribute outlook {sunny, overcast, rainy} @attribute temperature numeric @attribute humidity numeric @attribute windy {true, false} @attribute play? {yes, no} @data sunny, 85, 85, false, no sunny, 80, 90, true, no overcast, 83, 86, false, yes ... 9 Attribute types ARFF supports numeric and nominal attributes Interpretation depends on learning scheme Numeric attributes are interpreted as ordinal scales if less-than and greater-than are used ratio scales if distance calculations are performed (normalization/standardization may be required) Instance-based schemes define distance between nominal values (0 if values are equal, 1 otherwise) Integers: nominal, ordinal, or ratio scale? 10 Nominal vs. ordinal Attribute “age” nominal If age = young and astigmatic = no and tear production rate = normal then recommendation = soft If age = pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft Attribute “age” ordinal (e.g. “young” < “pre-presbyopic” < “presbyopic”) If age pre-presbyopic and astigmatic = no and tear production rate = normal then recommendation = soft 11 Missing values Frequently indicated by out-of-range entries Types: unknown, unrecorded, irrelevant Reasons: malfunctioning equipment, changes in experimental design, collation of different datasets, measurement not possible Missing value may have significance in itself (e.g. missing test in a medical examination) Most schemes assume that is not the case “missing” may need to be coded as additional value 12 Inaccurate values Reason: data has not been collected for mining it Result: errors and omissions that don’t affect original purpose of data (e.g. age of customer) Typographical errors in nominal attributes values need to be checked for consistency Typographical and measurement errors in numeric attributes outliers need to be identified Errors may be deliberate (e.g. wrong zip codes) Other problems: duplicates, stale data 13 Transformation Attribute selection Adding a random (i.e. irrelevant) attribute can significantly degrade C4.5’s performance Problem: attribute selection based on smaller and smaller amounts of data IBL is also very susceptible to irrelevant attributes Number of training instances required increases exponentially with number of irrelevant attributes Naïve Bayes doesn’t have this problem. Relevant attributes can also be harmful 14 Scheme-independent selection Filter approach: assessment based on general characteristics of the data One method: find subset of attributes that is enough to separate all the instances Another method: use different learning scheme (e.g. C4.5, 1R) to select attributes IBL-based attribute weighting techniques can also be used (but can’t find redundant attributes) CFS: uses correlation-based evaluation of subsets 15 Attribute subsets for weather data 16 Searching the attribute space Number of possible attribute subsets is exponential in the number of attributes Common greedy approaches: forward selection and backward elimination More sophisticated strategies: Bidirectional search Best-first search: can find the optimum solution Beam search: approximation to best-first search Genetic algorithms 17 Scheme-specific selection Wrapper approach: attribute selection implemented as wrapper around learning scheme Evaluation criterion: cross-validation performance Time consuming: adds factor k2 even for greedy approaches with k attributes Linearity in k requires prior ranking of attributes Scheme-specific attribute selection essential for learning decision tables Can be done efficiently for DTs and Naïve Bayes 18 Discretizing numeric attributes Can be used to avoid making normality assumption in Naïve Bayes and Clustering Simple discretization scheme is used in 1R C4.5 performs local discretization Global discretization can be advantageous because it’s based on more data Learner can be applied to discretized attribute or It can be applied to binary attributes coding the cut points in the discretized attribute 19 Unsupervised discretization Unsupervised discretization generates intervals without looking at class labels Only possible way when clustering Two main strategies: Equal-interval binning Equal-frequency binning (also called histogram equalization) Inferior to supervised schemes in classification tasks 20 Entropy-based discretization Supervised method that builds a decision tree with pre-pruning on the attribute being discretized Entropy used as splitting criterion MDLP used as stopping criterion State-of-the-art discretization method Application of MDLP: “Theory” is the splitting point (log2[N-1] bits) plus class distribution in each subset DL before/after adding splitting point is compared 21 Example: temperature attribute 22 Formula for MDLP N instances and k classes and entropy E in original set k1 classes and entropy E1 in first subset k2 classes and entropy E2 in first subset log 2 ( N 1) log 2 (3k 2) kE k1E1 k2 E2 gain N N Doesn’t result in any discretization intervals for the temperature attribute 23 Other discretization methods Top-down procedure can be replaced by bottomup method MDLP can be replaced by chi-squared test Dynamic programming can be used to find optimum k-way split for given additive criterion Requires time quadratic in number of instances if entropy is used as criterion Can be done in linear time if error rate is used as evaluation criterion 24 Transformation WEKA provides a lot of filters that can help you transforming and selecting your attributes! Use them to build an promising model for the caravan data! 25