Statistical Techniques: data preparation, descriptive statistics

Transcript Statistical Techniques: data preparation, descriptive statistics

Statistical Techniques
Robyn & Valerie
CSC 426
5/14/15
Outline
1.
2.
3.
4.
5.
Motivation
Background & Getting to Know Your Data
Data Pre-processing
Inferential Statistics (Analysis)
Inference / Results
Motivation
Why are we here?
Motivation
•
•
•
•
•
Effectively conduct research
Know what statistics to use before collecting data
To better read journal articles
To further develop critical and analytic thinking
To be an informed consumer
(Statistical) conclusion validity
• Degree to which conclusions we reach about
relationships in data are reasonable
• Is there actually a relationship ???
▫ Conclude there is a relationship when there isn’t one
▫ Conclude there isn’t a relationship when there is one!
Threats
• reliability of measures/observations
• statistical power
▫ Sample size
▫ Alpha level (Type I)
▫ Power (Type II)
• fishing and the error rate problem
• violated assumptions of statistical tests
Get to know your data
With some background
Get to Know Your Data
• Population vs. Sample
• Independent vs. Dependent Variables
• Data Types
• Descriptives
• Distributions
• Correlation
Population vs Sample
Data types
• Nominal
• Ordinal
• Interval/Ratio
GDP
USA
• $16,768,100,000,000
• Rank: 1
• Percentile: 100th
CYPRUS
• $22,767,000,000
• Ranks: 102
• Percentile: 46th
Actual vs. change
Country
GDP
GDP Growth
Japan
$4.7913 Trillion
2.26%
China
$1.1838 Trillion
8.43%
Descriptive Statistics
Univariate
Bivariate
Describes the distribution of a single
variable
Describes the relationship between
pairs of variables
•
•
•
•
•
• Cross-tabulations and
Contingency Tables
• Graphical Representation via
Scatterplots
• Quantitative Measures of
Dependence
Central Tendency
Five Number Summary
Dispersion
Measures of Spread
Shape
Measures of Central Tendency
Hockey Player Points Scored
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50
Mean
Median
Mode
~ 24
24
24
Measures of Central Tendency
Hockey Player Points Scored
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50, 517
Mean
Median
Mode
~ 54
24
24
Five Number Summary / Measures of
Dispersion / Measures of Spread
Hockey Player Points Scored
6, 7, 13, 17, 20, 22, 24, 24, 24, 25, 27, 28, 35, 36, 50
Minimum
First Quartile
Median
• Range = Max - Min = 44
• Standard Deviation (SD) = 11.2
• Variance = s^2 = 126.4
Third Quartile
Maximum
Correlation
Correlation vs. Causation
Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation
Correlation does not imply causation
Data preparation / pre-processing
Cleaning, integrating and transforming your data!
Dirty Data
• Incomplete
▫ occupation=“ ”
• Noisy
Major threats to conclusion validity
▫ Salary=“-10”
• Inconsistent:
▫ Age=“42” Birthday=“03/07/1997”
▫ Was rating “1,2,3”, now rating “A, B, C”
▫ Discrepancy between duplicate records
Forms of data pre-processing
• Cleaning
• Integration
• Transformation
• Reduction
Data cleaning
▫ Fill in missing values (manual vs. automatic)




Ignore
Constant: “unknown”, a new class?!
Attribute mean (of entire set or subset)
Most probable value: inference-based
▫ Identify outliers and smooth out noisy data




Binning method
Clustering
Combined computer and human inspection
Regression
▫ Correct inconsistent data
▫ Resolve redundancy caused by data integration
Outlier Detection
Cluster Analysis
Regression
y
Y1
Y1’
y=x+1
X1
x
Data Integration
• Remove redundancies
▫ Correlational analysis
• Integrate Schemas
• Detec, resolve value conflicts
Data Transformation
•
•
•
•
•
Smoothing
Aggregation
Generalization
Normalization
Attribute/feature construction
weight (kg)
▫ BMI =
2
height (m)
Normalization
• min-max normalization
• z-score normalization (standardization)
• normalization by decimal scaling
Where j is the smallest integer such that Max(|
|)<1
Inferential Statistics (Analysis)
Parametric and Non-parametric
Parametric vs Nonparametric
• Interval or ratio scales
• Data fall into a normal distribution
• More complex and powerful
analysis
• Check for analysis methods what
assumptions are absolutely
necessary for use
• Do not violate assumptions
•
•
•
•
Ordinal
Bi-modal or skewed distributions
Less assumptions in general
Number of parameters grows
with the training data
• More robust
• Simpler, can be used when less is
known about the application
• Downside - A larger sample size
may be need to draw conclusions
with the same confidence
Inferential procedures
Purpose
Parametric
Non-parametric
Sig. difference between 2
MCTs
Student’s t-test
(means)
•
•
Sig. difference between 3 or
more MCTs
ANOVA
Kruskal-Wallis test
Sig. diff among MCT while
controlling for covariate
ANCOVA
Is r larger than it would be by
chance?
T-test for r
How closely observations
match expected (freq. or
probability)
Mann-Whitney U (median)
Wilcoxon signed rank test
(median, correlated)
Fisher’s exact test
Chi-square (𝜒 2 ) goodness of fit
MCT: Measure of central tendency (mean, median, mode)
So you designed an experiment, what now?
Quasiexperimental
(no random
assignment)
Experimental
Design
Analysis
Two-group posttest-only
randomized
T-test
One-way ANOVA
Factorial
ANOVA
Randomized block design
ANOVA with blocking
Analysis of Covariance
ANCOVA
Nonequivalent Groups (NEGD)
Reliability-corrected
ANCOVA
Regression-Discontinuity
Polynomial regression
Regression Point Displacement
ANCOVA variant
General Linear Model (GLM)
• Assumption: 𝜖 ~ 𝑁 0, 𝜎 2
▫
▫
▫
▫
Independent
Identically
Normally
… distributed
• Basis for
•
•
•
•
•
•
Student t-test
ANOVA
ANCOVA
correlation t-test
regression
factor analysis
Y = 𝑿𝜷 + 𝜖
ϵ
Checking assumptions: iid residuals
Checking assumptions: Normality via Q-Q
plots
Hypothesis/significance testing
• Testing whether claims or hypotheses regarding a population are
likely to be true
• State hypotheses (H0 and Ha)
▫ H0 assumed to be true but we think it is wrong
▫ Ha contradicts H0 (what we think is wrong about H0)
• Set criteria for decision
▫ amount of error we wish to accept
• Compute test statistic
▫ mathematical formula that allows researchers to determine the
likelihood of obtaining sample outcomes if the null hypothesis were true
• Make a decision
▫ reject or fail to reject null hypothesis
t vs. z
One sample analysis
Confidence limits for the mean
One-sample t-test
• How well your sample average
estimates the population mean
• Based on your sample, is the
population mean different from
some value
▫ Is the true mean body temp 98.6
deg based on a sample from this
class?
• 𝑥 ± 𝑡1−𝛼,𝑁−1
2
𝑠
𝑁
• H0 : 𝜇 = 𝜇0
• Ha : 𝜇 ≠ 𝜇0 , 𝜇 < 𝜇0 , 𝜇 > 𝜇0
𝑥− 𝜇
• T = 𝑠 / 𝑁0
• Compare this value in t – table and
decide whether or not to reject
Two sample t-tests
• Paired
▫ Assume two samples are
correlated (or dependent)
▫ Same as one-way but using
difference as the 𝜇 and 0 as 𝜇0
• Not paired
▫ Pooled variance
 Assume populations have the
same variance
▫ Not pooled variance
Example: Run time
Alg 1
Alg 2
d
d-𝐝
1.2
1.4
0.2
-1.27
4.2
2.3
1.9
0.43
2.3
1.2
1.2
-0.27
3.4
2.1
1.3
0.17
4.1
1.3
2.8
1.33
4.2
3.2
1
-0.47
2.1
1.2
0.9
-1.43
3.2
1.3
1.9
0.43
4.2
2.1
2.1
0.63
• H0: 𝜇1 = 𝜇2
• H1: 𝜇1 < 𝜇2 or 𝜇1 > 𝜇2 or
𝜇1 ≠ 𝜇2
• Significance level (𝛼) of 0.05
Statistic
Value
n
9
𝑑
s
1.47
0.28
Non-parametric
•
•
•
•
Less assumptions in general
No assumption made to the distribution of the data
Number of parameters grows with the training data
Used for data that takes on a ranked order without clear
numerical interpretation
• More robust
• Simpler, can be used when less is known about the
application
• Downside - A larger sample size may be need to draw
conclusions with the same confidence
Parametric vs Nonparametric
• Interval or ratio scales
• Data fall into a normal distribution
• More complex and powerful
analysis
• Check for analysis methods what
assumptions are absolutely
necessary for use
• Do not violate assumptions
•
•
•
•
Ordinal
Bi-modal or skewed distributions
Less assumptions in general
Number of parameters grows
with the training data
• More robust
• Simpler, can be used when less is
known about the application
• Downside - A larger sample size
may be need to draw conclusions
with the same confidence
Ordinal / Not Interval
Experimental
Design
Two-group posttest-only
randomized experiment
Equivalent to independent
samples T-test
Analysis
Mann-Whitney U
Two-group posttest-only
Wilcoxon Signed-Rank
randomized experiment
Test
Equivalent to dependent samples
T-test
Three or more groups
Equivalent to ANOVA
Kruskal-Wallis Test
Two Dichotomous Variables
Experimental
Design
Analysis
Nominal Variables
Significant Correlation
Equivalent to T-test for Pearson’s
r
Odds Ratio
Nominal or Ordinal
Significant Correlation
Small sample size
Equivalent to T-test for Pearson’s
r
Fisher’s Exact Test
2
Chi-square (𝜒 ) test
• Determines how closely observed frequencies or
probabilities match expected
• Can be used for nominal, ordinal, interval, or ratio
data types
The best paper I ever read
• Zhang, Min-Ling, and Kun Zhang. "Multi-label learning
by exploiting label dependency."Proceedings of the
16th ACM SIGKDD international conference on
Knowledge discovery and data mining. ACM, 2010.
• “In multi-label learning, each training example is
associated with a set of labels and the task is to
predict the proper label set for the unseen example.”
What they did very well
• Explain their experimental design
▫ Ten-fold cross-validation
▫ mean metric value as well as the standard deviation of
each algorithm is recorded.
▫ pairwise t-tests at 5% significance level are conducted
between the algorithms.
• Use effective summary tables
• Specify parameters, algorithms
• Use many, many evaluation metrics
Data Descriptives
Results
Inference/Results
P-values and visualization
The p-value
• Definition
▫ The probability, under assumption of the null
hypothesis, of obtaining a result equal to or
more extreme than what was actually
observed.
• Weighs the strength of the evidence
• Not a measure of how right the analysis is
• Not a measure of how significant the difference
is
• You can only see whether your hypothesis is
consistent with the data
The power of visualizing data
• Transform massive amounts of data into something
meaningful
• More accessible and understandable to a broader
audience
• Aim to make the understanding your data or results
accessible through visual representation and
presentation
Viz like a pro
1 - Establish the visualization's context and ideas
2 - Acquire, familiarize with and prepare your data
3 - Determine the editorial focus of your subject matter
4 - Conceive your design: data representation and
presentation
5 - Construct and evaluate your design solution
References/Resources
• Data Mining: Concepts and Techniques (Han, Kamber, Pei)
• Data Mining: Practical Machine Learning Tools and
Techniques (Ch. 5, Degregori, Witten)
• Experiments: planning, analysis and optimization (Wu,
Hamada)
• Writing for CS (Ch 15, Zobel)
• Practical Research (Ch 8, LO)
• IS 567
• CSC 424
• Internet
• xkcd

Statistical Techniques: data preparation, descriptive statistics

Transcript Statistical Techniques: data preparation, descriptive statistics

Directory