Transcript Document

Introductory Statistics/
Refresher and Relevant
Software Installation.
Peter Fox
Data Analytics – ITWS-4600/ITWS-6600
Week 1b, January 29, 2016
1
DATUM survey…
2
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4600/ITWS 6600
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: Lally 102
Instructor: Peter Fox
Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Rahul Divekar [email protected]
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2016
– Schedule, lectures, syllabus, reading, assignments, etc.
– http://aquarius.tw.rpi.edu/html/DA/
3
Assignment 1
• Drop that Web link that does not resolve…
4
Today
• Initial review of stats and terms that are
important for this course
• Then… check in on installation of application
software, and
• Getting some data and read, explore, etc.
5
Definitions/ topics
• Statistic
• Statistics
• Population and
Samples
• Sampling
• Distributions and
parameters
• Central Tendencies
• Frequency
• Probability
• Significance tests
• Hypothesis (null and
alternate)
• P-value
• Density and cumulative
distributions
6
Statistic and Statistics
• Statistic (not to be confused with Statistics)
– Characteristic or measure obtained from a
sample.
• Statistics
– Collection of methods for planning experiments,
obtaining data, and then organizing,
summarizing, presenting, analyzing, interpreting,
and drawing conclusions.
7
Populations and samples
• A population is defined (“all” of the data)
– We must be able to say, for every object, if it is in the population or
not
– We must be able, in principle, to find every individual of the
population
– Inferential statistics apply here - Generalizing from samples to
populations using probabilities. Performing hypothesis testing,
determining relationships between variables, and making
predictions.
• A sample is a subset of a population (“some” of the data)
– We must be able to say, for every object in the population, if it is in
the sample or not (detecting “outliers”, “errors”, etc.)
– Sampling is the process of selecting a sample from a population
– Descriptive statistics apply here (especially distributions)
8
E.g. Election prediction
• Exit polls versus election results
– Human versus cyber
• How is the “population” defined here?
• What is the sample, how is it chosen?
– What is described and how is that used to
predict?
– Are results categorized? (where from, M/F, age)
• What is the uncertainty?
– It is reflected in the “sample distribution”
– And controlled/ constraints by “sampling theory”
9
Sampling Types (basic)
• Random Sampling
– Sampling in which the data is collected using chance methods or
random numbers.
• Systematic Sampling
– Sampling in which data is obtained by selecting every kth object.
• Convenience Sampling
– Sampling in which data is which is readily available is used.
• Stratified Sampling
– Sampling in which the population is divided into groups (called strata)
according to some characteristic. Each of these strata is then
sampled using one of the other sampling techniques.
• Cluster Sampling
– Sampling in which the population is divided into groups (usually
geographically). Some of these groups are randomly selected, and
then all of the elements in those groups are selected.
10
Random Numbers
• Can a computer generate a random number?
• Can you?
• Origin – to reduce selection bias!
• In R – many ways – see help on Random
{base} and get familiar with set.seed
11
Sampling Theory
• See Nyquist–Shannon – for time-series*
• Basically if there are no frequencies greater
than x, then you need to sample at 2 x /time
unit
• Not well known application: good, better, best
– How many samples?
12
Minimum Sample Size
• Typical formula** is
– N=(z * std deviation)^2/ (margin of error)^2
– May need to estimate std deviation
– z is from confidence intervals (normal
distribution)
– Margin of error is your tolerance for being wrong
– E.g. for elections ~7000 ! Based on 1% error and
95% confidence…
13
Bias difference: between
cyber and human data
• Election results and exit polls
– What are examples of bias in election results?
– In exit polls?
14
Distributions
• http://www.quantitativeskills.com/sisa/rojo/alld
ist.zip
• Shape
• Character
• Parameter(s)
– Mean
– Standard deviation
– Skewness
– Etc.
15
Plotting these distributions
• Histograms and binning
• Getting used to log scales
• Going beyond 2-D
• More of this next week (in more detail)
16
In applications
• Scipy:
http://docs.scipy.org/doc/scipy/reference/stats
.html
• R: http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/Distributions.html
• Matlab:
http://www.mathworks.com/help/stats/_brn2irf
.html
• Excel: HAH!
17
Heavy-tail distributions
• are probability distributions whose tails are
not exponentially bounded
• Common – long-tail… human v. cyber…
Few that dominate
More that add up
Equal areas
18
http://en.wikipedia.org/wiki/Heavy-tailed_distribution
Spatial example
19
Spatial roughness…
20
Central tendency –
median, mean, mode
21
Significance Tests
• Confidence intervals allow you to accept or
reject hypotheses… (critical region) - twotailed test.
– If the hypothesized value of the parameter lies
within the confidence interval with a 1-alpha level
of confidence, then the decision at an alpha level
of significance is to fail to reject the null
hypothesis, i.e. accept
– If the hypothesized value of the parameter lies
outside the confidence interval with a 1-alpha
level of confidence, then the decision at an alpha
level of significance is to reject the null
hypothesis.
22
Variability in normal distributions
23
F-test
F = S12 / S22
where S1 and S2 are the
sample variances.
The more this ratio deviates
from 1, the stronger the
evidence for unequal
population variances.
24
T-test
25
Note on Standard Error
• Versus standard deviation = SD (i.e. from the
mean)
• SE ~ SD/sample size
• So, as size increases SE << SD !! Big data
26
Frequencies v. Probabilities
• Actual rate of occurrence in a sample or
population – frequency
• Expected or estimate likelihood of a value or
outcome – probability
• Coin toss – two outcomes (binomial)
– p=0.5 (of “heads”)
• Male/Female
• Which US State you live in
27
Ranges: z, Percentiles, Quartiles
• The standard score is obtained by subtracting
the mean and dividing the difference by the
standard deviation. The symbol is z, which is
why it's also called a z-score.
• Percentiles (100 regions)
– The kth percentile is the number which has k% of
the values below it. The data must be ranked.
• Quartiles (4 regions)
– The quartiles divide the data into 4 equal regions.
– Note: The 2nd quartile is the same as the median.
28
The 1st quartile is the 25th percentile, the 3rd
quartile is the 75th percentile.
Hypothesis
1. Write the original claim and identify whether it is
the null hypothesis or the alternative hypothesis.
2. Write the null and alternative hypothesis. Use the
alternative hypothesis to identify the type of test.
3. Write down all information from the problem.
4. Find the critical value using the tables
5. Compute the test statistic
6. Make a decision to reject or fail to reject the null
hypothesis. A picture showing the critical value and
test statistic may be useful.
7. Write the conclusion.
29
Hypothesis
• What are you exploring?
• Regular data analytics features ~ well defined
hypotheses
– Big Data messes that up
• E.g. Stock market performance / trends
versus unusual events (crash/ boom):
– Populations versus samples – which is which?
– Why?
• E.g. Election results are predictable from exit
polls
30
Null and Alternate Hypotheses
• H0 - null
• H1 – alternate
• If a given claim contains equality, or a
statement of no change from the given or
accepted condition, then it is the null
hypothesis, otherwise, if it represents change,
it is the alternative hypothesis.
• It never snows in Troy in January
• Students will attend their scheduled classes
31
P-value
• One common way to evaluate significance,
especially in R output
– approaches hypothesis testing from a different
manner. Instead of comparing z-scores or tscores as in the classical approach, you're
comparing probabilities, or areas.
• The level of significance (alpha) is the area in
the critical region. That is, the area in the tails
to the right or left of the critical values.
32
P-value
• The p-value is the area to the right or left of
the test statistic.
– If it is a two tail test, then look up the probability in
one tail and double it.
• If the test statistic is in the critical region, then
the p-value will be less than the level of
significance.
– It does not matter whether it is a left tail, right tail,
or two tail test. This rule always holds.
33
Accept or Reject?
• Reject the null hypothesis if the p-value
is less than the level of significance.
• You will fail to reject the null hypothesis if
the p-value is greater than or equal to the
level of significance.
• Typical significance 0.05 (!)
34
Probability Density
35
Cumulative…
36
Pause…
37
Gnu R
• http://lib.stat.cmu.edu/R/CRAN/ - load this first
• http://cran.r-project.org/doc/manuals/
• http://cran.r-project.org/doc/manuals/R-lang.html
• R Studio – (see R-intro.html too)
https://www.rstudio.com/products/rstudio/
(desktop version)
• Manuals - Libraries – at the command line –
library(), or select the packages tab, and
check/ uncheck as needed
38
Files
• http://aquarius.tw.rpi.edu/html/DA/
• This is where the files for assignments,
exercise will be placed – data and code
fragments…
39
Exercises – getting data in
• Rstudio
– read in csv file (two ways to do this) GPW3_GRUMP_SummaryInformation_2010.csv
– Read in excel file (directly or by csv convert) 2010EPI_data.xls (2010EPI_data tab)
– See if you can plot some variables
– Anything in common between them?
• Enter
> data()
> help(data)
40
If time or for fun…
• se_eqs.xls
– Plot it
– Fit it
• PRESSURE.xls
– Plot it
– Smooth it
– Fit it …
41
No further reading this week
• Complete the installs as best you can
42