#### Transcript Statistical Techniques: data preparation, descriptive statistics

Statistical Techniques Robyn & Valerie CSC 426 5/14/15 Outline 1. 2. 3. 4. 5. Motivation Background & Getting to Know Your Data Data Pre-processing Inferential Statistics (Analysis) Inference / Results Motivation Why are we here? Motivation • • • • • Effectively conduct research Know what statistics to use before collecting data To better read journal articles To further develop critical and analytic thinking To be an informed consumer (Statistical) conclusion validity • Degree to which conclusions we reach about relationships in data are reasonable • Is there actually a relationship ??? ▫ Conclude there is a relationship when there isn’t one ▫ Conclude there isn’t a relationship when there is one! Threats • reliability of measures/observations • statistical power ▫ Sample size ▫ Alpha level (Type I) ▫ Power (Type II) • fishing and the error rate problem • violated assumptions of statistical tests Get to know your data With some background Get to Know Your Data • Population vs. Sample • Independent vs. Dependent Variables • Data Types • Descriptives • Distributions • Correlation Population vs Sample Data types • Nominal • Ordinal • Interval/Ratio GDP USA • $16,768,100,000,000 • Rank: 1 • Percentile: 100th CYPRUS • $22,767,000,000 • Ranks: 102 • Percentile: 46th Actual vs. change Country GDP GDP Growth Japan $4.7913 Trillion 2.26% China $1.1838 Trillion 8.43% Descriptive Statistics Univariate Bivariate Describes the distribution of a single variable Describes the relationship between pairs of variables • • • • • • Cross-tabulations and Contingency Tables • Graphical Representation via Scatterplots • Quantitative Measures of Dependence Central Tendency Five Number Summary Dispersion Measures of Spread Shape Measures of Central Tendency Hockey Player Points Scored 6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 Mean Median Mode ~ 24 24 24 Measures of Central Tendency Hockey Player Points Scored 6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517 Mean Median Mode ~ 54 24 24 Five Number Summary / Measures of Dispersion / Measures of Spread Hockey Player Points Scored 6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50 Minimum First Quartile Median • Range = Max - Min = 44 • Standard Deviation (SD) = 11.2 • Variance = s^2 = 126.4 Third Quartile Maximum Correlation Correlation vs. Causation Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Correlation does not imply causation Data preparation / pre-processing Cleaning, integrating and transforming your data! Dirty Data • Incomplete ▫ occupation=“ ” • Noisy Major threats to conclusion validity ▫ Salary=“-10” • Inconsistent: ▫ Age=“42” Birthday=“03/07/1997” ▫ Was rating “1,2,3”, now rating “A, B, C” ▫ Discrepancy between duplicate records Forms of data pre-processing • Cleaning • Integration • Transformation • Reduction Data cleaning ▫ Fill in missing values (manual vs. automatic) Ignore Constant: “unknown”, a new class?! Attribute mean (of entire set or subset) Most probable value: inference-based ▫ Identify outliers and smooth out noisy data Binning method Clustering Combined computer and human inspection Regression ▫ Correct inconsistent data ▫ Resolve redundancy caused by data integration Outlier Detection Cluster Analysis Regression y Y1 Y1’ y=x+1 X1 x Data Integration • Remove redundancies ▫ Correlational analysis • Integrate Schemas • Detec, resolve value conflicts Data Transformation • • • • • Smoothing Aggregation Generalization Normalization Attribute/feature construction weight (kg) ▫ BMI = 2 height (m) Normalization • min-max normalization • z-score normalization (standardization) • normalization by decimal scaling Where j is the smallest integer such that Max(| |)<1 Inferential Statistics (Analysis) Parametric and Non-parametric Parametric vs Nonparametric • Interval or ratio scales • Data fall into a normal distribution • More complex and powerful analysis • Check for analysis methods what assumptions are absolutely necessary for use • Do not violate assumptions • • • • Ordinal Bi-modal or skewed distributions Less assumptions in general Number of parameters grows with the training data • More robust • Simpler, can be used when less is known about the application • Downside - A larger sample size may be need to draw conclusions with the same confidence Inferential procedures Purpose Parametric Non-parametric Sig. difference between 2 MCTs Student’s t-test (means) • • Sig. difference between 3 or more MCTs ANOVA Kruskal-Wallis test Sig. diff among MCT while controlling for covariate ANCOVA Is r larger than it would be by chance? T-test for r How closely observations match expected (freq. or probability) Mann-Whitney U (median) Wilcoxon signed rank test (median, correlated) Fisher’s exact test Chi-square (𝜒 2 ) goodness of fit MCT: Measure of central tendency (mean, median, mode) So you designed an experiment, what now? Quasiexperimental (no random assignment) Experimental Design Analysis Two-group posttest-only randomized T-test One-way ANOVA Factorial ANOVA Randomized block design ANOVA with blocking Analysis of Covariance ANCOVA Nonequivalent Groups (NEGD) Reliability-corrected ANCOVA Regression-Discontinuity Polynomial regression Regression Point Displacement ANCOVA variant General Linear Model (GLM) • Assumption: 𝜖 ~ 𝑁 0, 𝜎 2 ▫ ▫ ▫ ▫ Independent Identically Normally … distributed • Basis for • • • • • • Student t-test ANOVA ANCOVA correlation t-test regression factor analysis Y = 𝑿𝜷 + 𝜖 ϵ Checking assumptions: iid residuals Checking assumptions: Normality via Q-Q plots Hypothesis/significance testing • Testing whether claims or hypotheses regarding a population are likely to be true • State hypotheses (H0 and Ha) ▫ H0 assumed to be true but we think it is wrong ▫ Ha contradicts H0 (what we think is wrong about H0) • Set criteria for decision ▫ amount of error we wish to accept • Compute test statistic ▫ mathematical formula that allows researchers to determine the likelihood of obtaining sample outcomes if the null hypothesis were true • Make a decision ▫ reject or fail to reject null hypothesis t vs. z One sample analysis Confidence limits for the mean One-sample t-test • How well your sample average estimates the population mean • Based on your sample, is the population mean different from some value ▫ Is the true mean body temp 98.6 deg based on a sample from this class? • 𝑥 ± 𝑡1−𝛼,𝑁−1 2 𝑠 𝑁 • H0 : 𝜇 = 𝜇0 • Ha : 𝜇 ≠ 𝜇0 , 𝜇 < 𝜇0 , 𝜇 > 𝜇0 𝑥− 𝜇 • T = 𝑠 / 𝑁0 • Compare this value in t – table and decide whether or not to reject Two sample t-tests • Paired ▫ Assume two samples are correlated (or dependent) ▫ Same as one-way but using difference as the 𝜇 and 0 as 𝜇0 • Not paired ▫ Pooled variance Assume populations have the same variance ▫ Not pooled variance Example: Run time Alg 1 Alg 2 d d-𝐝 1.2 1.4 0.2 -1.27 4.2 2.3 1.9 0.43 2.3 1.2 1.2 -0.27 3.4 2.1 1.3 0.17 4.1 1.3 2.8 1.33 4.2 3.2 1 -0.47 2.1 1.2 0.9 -1.43 3.2 1.3 1.9 0.43 4.2 2.1 2.1 0.63 • H0: 𝜇1 = 𝜇2 • H1: 𝜇1 < 𝜇2 or 𝜇1 > 𝜇2 or 𝜇1 ≠ 𝜇2 • Significance level (𝛼) of 0.05 Statistic Value n 9 𝑑 s 1.47 0.28 Non-parametric • • • • Less assumptions in general No assumption made to the distribution of the data Number of parameters grows with the training data Used for data that takes on a ranked order without clear numerical interpretation • More robust • Simpler, can be used when less is known about the application • Downside - A larger sample size may be need to draw conclusions with the same confidence Parametric vs Nonparametric • Interval or ratio scales • Data fall into a normal distribution • More complex and powerful analysis • Check for analysis methods what assumptions are absolutely necessary for use • Do not violate assumptions • • • • Ordinal Bi-modal or skewed distributions Less assumptions in general Number of parameters grows with the training data • More robust • Simpler, can be used when less is known about the application • Downside - A larger sample size may be need to draw conclusions with the same confidence Ordinal / Not Interval Experimental Design Two-group posttest-only randomized experiment Equivalent to independent samples T-test Analysis Mann-Whitney U Two-group posttest-only Wilcoxon Signed-Rank randomized experiment Test Equivalent to dependent samples T-test Three or more groups Equivalent to ANOVA Kruskal-Wallis Test Two Dichotomous Variables Experimental Design Analysis Nominal Variables Significant Correlation Equivalent to T-test for Pearson’s r Odds Ratio Nominal or Ordinal Significant Correlation Small sample size Equivalent to T-test for Pearson’s r Fisher’s Exact Test 2 Chi-square (𝜒 ) test • Determines how closely observed frequencies or probabilities match expected • Can be used for nominal, ordinal, interval, or ratio data types The best paper I ever read • Zhang, Min-Ling, and Kun Zhang. "Multi-label learning by exploiting label dependency."Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2010. • “In multi-label learning, each training example is associated with a set of labels and the task is to predict the proper label set for the unseen example.” What they did very well • Explain their experimental design ▫ Ten-fold cross-validation ▫ mean metric value as well as the standard deviation of each algorithm is recorded. ▫ pairwise t-tests at 5% significance level are conducted between the algorithms. • Use effective summary tables • Specify parameters, algorithms • Use many, many evaluation metrics Data Descriptives Results Inference/Results P-values and visualization The p-value • Definition ▫ The probability, under assumption of the null hypothesis, of obtaining a result equal to or more extreme than what was actually observed. • Weighs the strength of the evidence • Not a measure of how right the analysis is • Not a measure of how significant the difference is • You can only see whether your hypothesis is consistent with the data The power of visualizing data • Transform massive amounts of data into something meaningful • More accessible and understandable to a broader audience • Aim to make the understanding your data or results accessible through visual representation and presentation Viz like a pro 1 - Establish the visualization's context and ideas 2 - Acquire, familiarize with and prepare your data 3 - Determine the editorial focus of your subject matter 4 - Conceive your design: data representation and presentation 5 - Construct and evaluate your design solution References/Resources • Data Mining: Concepts and Techniques (Han, Kamber, Pei) • Data Mining: Practical Machine Learning Tools and Techniques (Ch. 5, Degregori, Witten) • Experiments: planning, analysis and optimization (Wu, Hamada) • Writing for CS (Ch 15, Zobel) • Practical Research (Ch 8, LO) • IS 567 • CSC 424 • Internet • xkcd