Transcript Test
Data and Information Resources,
Role of Hypothesis, Exploration
and Distributions
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 2a, February 2, 2016, LALLY 102
1
Contents
• Data sources
– Cyber
– Human
• “Munging”
• Exploring
– Distributions…
– Summaries
– Visualization
• Testing and evaluating
the results (beginning)
2
Lower layers in the Analytics Stack
3
“Cyber Data” …
4
“Human Data” …
5
Data Prepared for Analysis = Munging
• Missing values, null values, etc.
• E.g. in the EPI_data – they use “--”
– Most data applications provide built ins for these higherorder functions – in R “NA” is used and functions such as
is.na(var), etc. provide powerful filtering options (we’ll
cover these on Friday)
• Of course, different variables often are missing
“different” values
• In R – higher-order functions such as: Reduce,
Filter, Map, Find, Position and Negate will become
your enemies and then your friends:
http://www.johnmyleswhite.com/notebook/2010/09/2
3/higher-order-functions-in-r/
6
Getting started – summarize data
• Summary statistic
– Ranges, “hinges”
– Tukey’s five numbers
• Look for a distribution match
• Tests…for…
– Normality – shapiro-wilks – returns a statistic (W!)
and a p-value – what is the null hypothesis here?
> shapiro.test(EPI_data$EPI)
Shapiro-Wilk normality test
data: EPI_data$EPI
W = 0.9866, p-value = 0.1188
7
Accept or Reject?
• Reject the null hypothesis if the p-value
is less than the level of significance.
• You will fail to reject the null hypothesis if
the p-value is greater than or equal to the
level of significance.
• Typical significance 0.05 (!)
8
Another variable in EPI
> shapiro.test(EPI_data$DALY)
Shapiro-Wilk normality test
data: EPI_data$DALY
W = 0.9365, p-value = 1.891e-07
Accept or reject?
9
Distribution tests
• Binomial, …. most distributions have tests
• Wilcoxon (Mann-Whitney)
– Comparing populations – versus to a distribution
• Kolmogorov-Smirnov (KS)
• …
• It got out of control when people realized they
can name the test after themselves, v.
10
someone else…
Getting started – look at the data
• Visually
– What is the improvement in the understanding of
the data as compared to the situation without
visualization?
– Which visualization techniques are suitable for
one's data?
•
•
•
•
•
•
•
Scatter plot diagrams
Box plots (min, 1st quartile, median, 3rd quartile, max)
Stem and leaf plots
Frequency plots
Group Frequency Distributions plot
Cumulative Frequency plots
Distribution plots
11
Why visualization?
•
•
•
•
•
•
•
Reducing amount of data, quantization
Patterns
Features
Events
Trends
Irregularities
Leading to presentation of data, i.e.
information products
• Exit points for analysis
12
Exploring the distribution
> summary(EPI)
# stats
NA's
68
[1] 32.1 48.6 59.2 67.6 93.5
30
> fivenum(EPI,na.rm=TRUE)
40
50
60
70
> boxplot(EPI)
80
90
Min. 1st Qu. Median Mean 3rd Qu. Max.
32.10 48.60 59.20 58.37 67.60 93.50
Tukey: min, lower hinge, median, upper hinge,
max
13
Stem and leaf plot
> stem(EPI)
# like-a histogram
The decimal point is 1 digit(s) to the right of the | - but the scale of the stem
is 10… watch carefully..
3 | 234
3 | 66889
4 | 00011112222223344444
4 | 5555677788888999
5 | 0000111111111244444
5 | 55666677778888999999
6 | 000001111111222333344444
6 | 5555666666677778888889999999
7 | 000111233333334
7 | 5567888
8 | 11
8 | 669
9|4
14
Grouped Frequency Distribution aka binning
> hist(EPI)
#defaults
15
Distributions
• Shape
• Character
• Parameter(s)
• Which one fits?
16
Histogram of EPI
0.04
0.03
0.02
0.01
0.00
> rug(EPI)
or
> lines
(density(EPI,na.rm=TR
UE,bw=“SJ”))
Density
> lines
(density(EPI,na.rm=TR
UE,bw=1.))
0.05
> hist(EPI, seq(30.,
95., 1.0), prob=TRUE)
30
40
50
60
EPI
70
80
90
17
Histogram of EPI
0.03
0.02
0.01
0.00
> lines
(density(EPI,na.rm=TR
UE,bw=“SJ”))
Density
0.04
0.05
> hist(EPI, seq(30.,
95., 1.0), prob=TRUE)
30
40
50
60
EPI
70
80
90
18
Why are histograms so
unsatisfying?
Histogram of EPI
0.02
0.03
Density
0.03
0.02
30
40
50
60
EPI
70
80
90
0.00
0.01
0.01
0.00
Density
0.04
0.04
0.05
0.05
Histogram of EPI
19
30
40
50
60
EPI
70
80
90
0.04
0.03
0.02
0.01
0.00
Density
> xn<-seq(30,95,1)
> qn<dnorm(xn,mean=63,
sd=5,log=FALSE)
> lines(xn,qn)
> lines(xn,.4*qn)
> ln<-dnorm(xn,mean=44,
sd=5,log=FALSE)
> lines(xn,.26*ln)
0.05
Histogram of EPI
30
40
50
60
EPI
70
80
90
20
Exploring the distribution
> summary(DALY) # stats
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 37.19 60.35 53.94 71.97 91.50
NA's
39
> fivenum(DALY,na.rm=TRUE)
0
20
40
60
80
[1] 0.000 36.955 60.350 72.320 91.500
EPI
1
DALY
2
21
Stem and leaf plot
> stem(DALY)
#
The decimal point is 1 digit(s) to the right of the |
0 | 0000111244
0 | 567899
1 | 0234
1 | 56688
2 | 000123
2 | 5667889
3 | 00001134
3 | 5678899
4 | 00011223444
4 | 555799
5 | 12223344
5 | 556667788999999
6 | 0000011111222233334444
6 | 6666666677788889999
7 | 00000000223333444
7 | 66888999
8 | 1113333333
8 | 555557777777777799999
9 | 22
22
Beyond histograms
• Cumulative distribution function: probability that a
real-valued random variable X with a given
probability distribution will be found at a value less
than or equal to x.
23
> plot(ecdf(EPI), do.points=FALSE, verticals=TRUE)
Beyond histograms
• Quantile ~ inverse cumulative density function –
points taken at regular intervals from the CDF, e.g.
2-quantiles=median, 4-quantiles=quartiles
• Quantile-Quantile (versus default=normal dist.)
> par(pty="s")
> qqnorm(EPI); qqline(EPI)
24
Beyond histograms
• Simulated data from t-distribution (random):
> x <- rt(250, df = 5)
> qqnorm(x); qqline(x)
25
Beyond histograms
• Q-Q plot against the generating distribution: x<seq(30,95,1)
> qqplot(qt(ppoints(250), df = 5), x, xlab = "Q-Q plot for t
dsn")
> qqline(x)
26
But if you are not sure it is normal
> wilcox.test(EPI,DALY)
Wilcoxon rank sum test with continuity
correction
data: EPI and DALY
W = 15970, p-value = 0.7386
alternative hypothesis: true location shift is not
equal to 0
27
Comparing the CDFs
0.0
0.2
0.4
0.6
0.8
1.0
ecdf(EPI)
Fn(x)
> plot(ecdf(EPI),
do.points=FALSE,
verticals=TRUE)
> plot(ecdf(DALY),
do.points=FALSE,
verticals=TRUE,
add=TRUE)
30
40
50
60
x
70
80
90
100
28
29
30
31
32
More munging
• Bad values, outliers, corrupted entries,
thresholds …
• Noise reduction – low-pass filtering, binning
• Modal filtering
• REMEMBER: when you munge you MUST
record what you did (and why) and save
copies of pre- and post- operations…
33
34
35
Populations within populations
• In the EPI example:
– Geographic regions (GEO_subregion)
– EPI_regions
– Eco-regions (EDC v. LEDC – know what that is?)
– Primary industry(ies)
– Climate region
• What would you do to start exploring?
36
37
Or, a twist – n=1 but many attributes?
The item of interest in relation to its attributes
38
Summary: explore
• Going from preliminary to initial analysis…
• Determining if there is one or more common
distributions involved – i.e. parametric
statistics (assumes or asserts a probability
distribution)
• Fitting that distribution -> provides a model!
• Or NOT
– A hybrid or
– Non-parametric (statistics) approaches are
needed – more on this to come
39
Goodness of fit
• And, we cannot take the models at face
value, we must assess how fit they may be:
– Chi-Square
– One-sided and two-sided Kolmogorov-Smirnov
tests
– Lilliefors tests
– Ansari-Bradley tests
– Jarque-Bera tests
• Just a preview…
40
Summary
• Cyber and Human data; quality, uncertainty and
bias – you will often spend a lot of time with the
data
• Distributions – the common and not-so common
ones and how cyber and human data can have
distinct distributions
• How simple statistical distributions can mislead
us
• Populations and samples and how inferential
statistics will lead us to model choices (no we
have not actually done that yet in detail)
• Munging toward exploratory analysis
• Toward models!
41
How are the software installs going?
• R
• Data exercises?
– You can try some of the examples from today on
the EPI dataset
• More on Friday… and other datasets.
42