Transcript Document
Data and Information Resources,
Role of Hypothesis, Synthesis
and Model Choices
Peter Fox
Data Analytics – ITWS-4963/ITWS-6965
Week 2a, January 28, 2014, SAGE 3101
1
Admin info (keep/ print this slide)
•
•
•
•
•
•
•
•
•
Class: ITWS-4963/ITWS 6965
Hours: 12:00pm-1:50pm Tuesday/ Friday
Location: SAGE 3101
Instructor: Peter Fox
Instructor contact: [email protected], 518.276.4862 (do not
leave a msg)
Contact hours: Monday** 3:00-4:00pm (or by email appt)
Contact location: Winslow 2120 (sometimes Lally 207A
announced by email)
TA: Lakshmi Chenicheri [email protected]
Web site: http://tw.rpi.edu/web/courses/DataAnalytics/2014
– Schedule, lectures, syllabus, reading, assignments, etc.
2
Contents
• Back to the data
sources
– Cyber
– Human
• “Munging”
• Beginning with
hypothesis -> synthesis
• Distributions…
• Scoping out analysis
and model choices
3
Lower layers in the Analytics Stack
4
“Cyber Data” …
5
“Human Data” …
6
Descriptive / Inferential
• Descriptive statistics: numerical summaries of samples
– i.e., what was observed, distributions
– The ‘sample’ may be exhaustive, i.e., identical to the population
• Inferential statistics: from samples to populations
– i.e., what could have been or will be observed in a larger population
• Descriptive (report) to Inferential (model suggestion) is a key
process in analytics
• So often NOT a linear process..
• Sample bias – choice and awareness
Adapted from Marshall Ma (and other sources)
7
Populations and samples
• A population is defined
– We must be able to say, for every object, if it is in the population or
not
– We must be able, in principle, to find every individual of the
population
A geographic example of a population is all pixels in a multi-spectral
satellite image
• A sample is a subset of a population
– We must be able to say, for every object in the population, if it is in
the sample or not
– Sampling is the process of selecting a sample from a population
• E.g 2010EPI_data.xls (EPI2010_all countries or
EPI2010_onlyEPIcountries tabs)
8
Election prediction
• Exit polls versus election results
– Human versus cyber
• How is the “population” defined here?
• What is the sample, how chosen?
– What is described and how is that used to
predict?
– Are results categorized? (where from, M/F, age)
• What is the uncertainty?
– It is reflected in the “sample distribution”
– And controlled/ constraints by “sampling theory”
9
Bias difference: between
cyber and human data
• Election results and exit polls
– What are examples of bias in election results?
– In exit polls?
10
Hypothesis
• What are you exploring?
• Regular data analytics features ~ well defined
hypotheses
– Big Data messes that up
• E.g. Stock market performance / trends
versus unusual events (crash/ boom):
– Populations versus samples – which is which?
– Why?
• E.g. Election results are predictable from exit
polls
11
Distributions
• http://www.quantitativeskills.com/sisa/rojo/alld
ist.zip
• Shape
• Character
• Parameter(s)
12
Plotting these distributions
• Histograms and binning
• Getting used to log scales
• Going beyond 2-D
• More of this on Friday (in more detail)
13
In applications
• Scipy:
http://docs.scipy.org/doc/scipy/reference/stats
.html
• R: http://stat.ethz.ch/R-manual/Rpatched/library/stats/html/Distributions.html
• Matlab:
http://www.mathworks.com/help/stats/_brn2irf
.html
• Excel: HAH!
14
Heavy-tail distributions
• are probability distributions whose tails are
not exponentially bounded
• Common – long-tail… human v. cyber…
Few that dominate
More that add up
Equal areas
15
http://en.wikipedia.org/wiki/Heavy-tailed_distribution
Spatial example
16
Spatial roughness…
17
Compare median, mean, mode
18
Huh, we have Big Data?
• Why would we care about samples?
– Let’s take it all?
• It gets messy == quality, gaps, …
• Very often goes beyond known patterns, i.e.
out of the range of previous values
– Anyone remember the financial crisis in 2008?
• Data becomes more subjective than objective
and especially human v. cyber..
• To start: let’s take a look at EPI data that you 19
started to explore last week (cyber)
Munging
• Missing values, null values, etc.
• E.g. in EPI_data – they use “--”
– Most data applications provide built ins for these higherorder functions – in R “NA” is used and functions such as
is.na(var), etc. provide powerful filtering options (we’ll
cover these on Friday)
• Of course, different variables often are missing
“different” values
• In R – higher-order functions such as: Reduce,
Filter, Map, Find, Position and Negate will become
your enemies and then friends:
http://www.johnmyleswhite.com/notebook/2010/09/2 20
3/higher-order-functions-in-r/
21
22
23
24
Patterns and Relationships
• Stepping from elementary/ distribution
analysis to algorithmic-based analysis
• I.e. pattern detection via data mining:
classification, clustering, rules; machine
learning; support vector machines, nonparametric models
• Relations – associations between/among
populations
• Outcome: model and an evaluation of its
fitness for purpose
25
More munging
• Bad values, outliers, corrupted entries,
thresholds …
• Noise reduction – low-pass filtering, binning
• A few example today but the labs will bring
this into view soon
• REMEMBER: when you munge you MUST
record what you did (and why) and save
copies of pre- and post- operations…
26
27
28
Populations within populations
• In the EPI example:
– Geographic regions (GEO_subregion)
– EPI_regions
– Eco-regions (EDC v. LEDC – know what that is?)
– Primary industry(ies)
– Climate region
• What would you do to start exploring?
29
30
Or, a twist – n=1 but many attributes?
The item of interest in relation to its attributes
31
Summary: explore
• Going from preliminary to initial analysis…
• Determining if there is one or more common
distributions involved – i.e. parametric
statistics (assumes or asserts a probability
distribution)
• Fitting that distribution
• Or NOT
– A hybrid or
– Non-parametric (statistics) approaches are
needed – more on this to come
32
Models
• Assumptions are often used when
considering models, e.g. as being
representative of the population – since they
are so often derived from a sample – this
should be starting to make sense (a bit)
• Two key topics:
– N=all and the open world assumption
– Model of the thing of interest versus model of the
data (data model; structural form)
• “All models are wrong but some are useful”
(generally attributed to the statistician George Box)
33
Conceptual, logical and
physical models
Applied to a database:
However our
models will be
mathematical,
statistical, or a
combination.
The concept of the
model comes from
the hypothesis
The implementation
of the physical
model comes from 34
the data ;-)
Art or science?
• The form of the model, incorporating the
hypothesis determines a “form”
• Thus, as much art as science because it
depends both on your world view and what
the data is telling you (or not)
• We will however, be giving the models nice
mathematical properties; orthogonal/
orthonormal basis functions, etc…
35
Goodness of fit
• And, we cannot take the models at face
value, we must assess how fit they may be:
– Chi-Square
– One-sided and two-sided Kolmogorov-Smirnov
tests
– Lilliefors tests
– Ansari-Bradley tests
– Jarque-Bera tests
• Just a preview…
36
Summary
• Cyber and Human data; quality, uncertainty and
bias
• Distributions – the common and not-so common
ones and how cyber and human data can have
distinct distributions
• How simple statistical distributions can mis-lead
us
• Populations and samples and how inferential
statistics will lead us to model choices (no we
have not actually done that yet in detail)
• Big Data and some consequences
• Munging toward exploratory analysis
• Toward models!
37
Tentative assignments
• Assignment 2: Datasets and data infrastructures – lab
assignment. Held in week 3 (Feb. 7) 10% (lab; individual);
• Assignment 3: Preliminary and Statistical Analysis. Due ~
week 4. 15% (15% written and 0% oral; individual);
• Assignment 4: Pattern, trend, relations: model development
and evaluation. Due ~ week 5. 15% (10% written and 5%
oral; individual);
• Assignment 5: Term project proposal. Due ~ week 6. 5%
(0% written and 5% oral; individual);
• Assignment 6: Predictive and Prescriptive Analytics. Due ~
week 8. 15% (15% written and 5% oral; individual);
• Term project. Due ~ week 13. 30% (25% written, 5% oral;
individual).
38
How are the software installs going?
• R/Scipy (et al)/Matlab
• Data infrastructure
• Exercises?
• More on Friday…
39
Assignment 1 – how is it going?
• Choose a DA case study from a) readings, or
b) your choice (must be approved by me)
• Read it and provide a short written review/
critique (business case, area of application,
approach/ methods, tools used, results,
actions, benefits).
• Be prepared to discuss it in class this Friday
31st. Hand in the written report by 5pm that
day.
40