presentation source

Download Report

Transcript presentation source

Exploratory Data Analysis
• Set of techniques
• The flexibility to respond to the patterns
revealed by successive iterations in the
discovery process is an important
attribute
• Free to take many paths in revealing
mysteries in the data
• Emphasizes visual representations and
graphical techniques over summary statistics
EDA
• Summary statistics , may obscure, conceal
the underlying structure of the data
• When numerical summaries are used
exclusively and accepted without visual
inspection, the selection of confirmatory
modes may be based on flawed assumptions
and may produce erroneous conclusions
Previously Discussed Techniques
for Displaying Data
•
•
•
•
•
Frequency Tables
Bar Charts (Histograms)
Pie Charts
Stem and Leaf Displays
Boxplots
Resistant Statistics
• Example: data set = [5,6,6,7,7,7,8,8,9]
• The mean is 7 and the standard deviation 1.23
• Replace the 9 with 90 and the mean becomes 16
and the standard deviation 27.78.
• Changing only one of the nine values has
disturbed the location and spread summaries to the
point where they no longer represent the other
eight values. Both mean and standard deviation
are considered nonresistant statistics
• The median remained at 7 and the lower and upper
quartiles stayed at 6 and 8, respectively.
Visual Techniques of EDA
• Gain insight into the data
• More common ways of summarizing
location, spread, and shape
• Used resistant statistics
• From these we could make decisions on test
selection and whether the data should be
transformed or reexpressed before further
analysis
More Techniques
• Last section focused on primarily singlevariable distributions
• inspect relationships between and among
variables
Crosstabulation
• Technique for comparing two classification
variables
• uses tables having rows and columns that
correspond to the levels or values of each of
the variable’s categories
Example of a Crosstabulation
Oversees Assignment
YES
NO
Row Total
Gender
Male
Row %
Col %
Tot %
Female
Row %
Col %
Tot %
Column
22
35.5
78.6
22.0
6
15.8
21.4
6.0
28
28.0
40
64.5
55.6
40.0
32
84.2
44.4
32.0
72
72.0
62
62.0
38
38.0
100
100.0
The Use of Percentages
• simplify the data by reducing all numbers to
a range from 0 to 100
• translate the data into standard form, with a
base of 100, for relative comparisons
– A raw count has little value unless we know it is from a
sample of 100 (28%)
– while this is useful, it even more useful when the
research calls for a comparison of several distributions
of the data
Comparison of a Crosstabulations
Oversees Assignment
YES
NO
Row Total
Gender
Male
Row %
Col %
Tot %
Female
Row %
Col %
Tot %
Column
225
25.0
62.5
15.0
135
22.5
37.5
9.0
360
24.0
675
75.0
59.2
45.0
465
77.5
40.8
31.0
1140
76.0
900
60.0
600
40.0
1500
100.0
Use of Percentages
• Comparing the present sample (100) and the
previous sample (1500), we can view the
relative relationships and shifts in the data.
• In comparing two-dimensional tables, the
selection of either the row or the column
will accentuate a particular distribution or
comparison. ( Note in our last tables both
column and row were presented)
Presenting Percentages
• When one variable is hypothesized to the
presumed cause, it is thought to affect or
predict a response, label it the independent
variable and % should be computed in the
direction of this variable
• Which direction should the last example(s),
gender by oversees assignment run?
Independent Variable
• (row) - the implication is that gender
influences selection for oversees
assignments
• if you said column, you are implying that
the assignment status has some effect on the
gender and this is implausible!
• Note that you can do the calculations, but
they may not make sense!
Other Guidelines for Percentages
• Averages percentages: Percentages cannot
be averaged unless each is weighted by the
size of the group from which it is derived.
(weighted average)
• Use of too large percentages: A large
percentage is difficult to understand. If a
1000 % increase, better to state it as a
tenfold increase.
Other Guidelines for Percentages
• Using too small of a base: Percentages hide
the base from which they have been
computed
• Percentage decrease can never exceed 100
percent. The higher figure should be always
used as the base.
Other Table-Based Analysis
• Recognition of a meaningful relationship between
variables generally signals a need for further
investigation.
• Even if one finds a statistically significant
relationship, the questions of why and under what
conditions remain.
• Normally introduce a control variable
• Statistical packages can handle complex tables
Control and Nested Variables
Control Variable
Category 1
Category 2
Nested Variable
Cat 1
labels
Nested Variable
Cat 2 Cat 3
Cat 1
Cells ...
Cat 2
Cat 3
Data Mining
• Describes the concept of discovering knowledge
from databases
• the idea behind it is the process of identifying
valid, novel, useful, and ultimately understandable
patterns in data
• provides two unique capabilities to the researcher
– pattern discovery
– predicting trend and behavior
Data-Mining Process
Investigative
Question
Sampling yes/no
Data Visualization
Clustering,
factor
correspondence
Neural
Networks
Variable
selection,
creation
Treebased
models
Classification
Models
Model Assessment
Data
Transformation
Other
Stat
Models
Sampling Yes/No
• Use the entire set or a sample of the data
• if fast turnaround is more important than
absolute accuracy, sampling may be
appropriate
• Sample - if data set is large - terabytes
Modify
• Based on discoveries, data may require
modification
– Clustering, factor, correspondence analysis
– Variable selection, creation
– Data transformation
Factor Analysis
• General term for several specific
computational techniques
• All have the objective of reducing to a
manageable number many variables that
belong together and have overlapping
measurement characteristics
Factor Analysis Method
• Begins with construction of a new set
variables based on the relationships in the
correlation matrix
• Can be done in a variety of ways
• most popular is principal components
analysis.
Principal Components Analysis
• Transforms a set of variables into a new set
that are not correlated with each other.
• These linear combinations of variables,
called factors, account for the variance in
the data as a whole.
• All factors being the best linear
combination of variables not accounted for
by previous factors
Principal Components Analysis
• Process continues until all the variance is
accounted for
Extracted components
Component 1
Component 2
Component 3
% of variance
accounted for
63%
29
8
cumulative
variance
63%
92
100
Cluster Analysis
• Unlike the techniques for analyzing the
relationships between variables
• Set of techniques for grouping similar
objects
• Cluster starts with a undifferentiated group
• Different that discriminant analysis where
you search for set of variables to separate
them
Cluster Analysis Method
• Select the sample (employees, buyers)
• Definition of the variables on which to measure
the objects
• Computation of similarities amount entities
through correlation, Euclidean distances and other
techniques
• Selection of mutually exclusive clusters (
maximization of within-cluster similarity and
between-cluster differences)
• Cluster comparison and validation
Clustering
Different methods produce different solutions
• Cluster analysis methods are not clearly established.
There are many options one may select when doing
a cluster analysis using a statistical package. Cluster
analysis is thus open to the criticism that a
statistician may mine the data trying different
methods of computing the proximities matrix and
linking groups until he or she "discovers" the
structure that he or she originally believed was
contained in the data. One wonders why anyone
would bother to do a cluster analysis for such a
purpose.
A Very Simple Cluster Analysis
• In cases of one or two measures, a visual
inspection of the data using a frequency
polygon or scatterplot often provides a clear
picture of grouping possibilities. For
example, "Example Assignment" is data
from a cluster analysis homework
assignment.
•It is fairly clear from this
picture that two subgroups, the
first including Julie, John, and
Ryan and the second including
everyone else except Dave
describe the data fairly well.
•When faced with complex multivariate data, such
visualization procedures are not available and computer
programs assist in assigning objects to groups.
Dendogram
The clusters and their relative distances are displayed in
a diagram called a dendogram
The following HTML page describes the
logic involved in cluster analysis
algorithms.
http://www.cs.bsu.edu/homepages/dmz/cs689/ppt/entire_cluster_exa
mple.html
Correspondence Analysis
• a descriptive/exploratory technique designed to
analyze simple two-way and multi-way tables
containing some measure of correspondence
between the rows and columns.
• provide information which is similar in nature to
those produced by factor analysis techniques
• allow one to explore the structure of categorical
variables included in the table.
• The most common kind of table of this type is the
two-way frequency crosstabulation table
• See http://www.statsoft.com/textbook/stcoran.html
Variable Selection, Creation
• If important constructs were discovered,
new factors would be introduced to
categorize the data
• Some may be dropped
WinSTAT
http://www.winstat.com/
Welcome! (text from their home page)
WinSTAT is the statistics Add-In for Microsoft Excel, and this is
the place to find out all about it.
Tired of your hard-to-use, need-to-be-a-fulltime-expert statistics
package? Find out why WinSTAT is the program for you.
Wondering if WinSTAT covers the functions and graphics you
need? Let the function reference page surprise you, complete
with sample outputs of tables and graphics for all functions.
Still not convinced? There's no way to be sure until you've tried
WinSTAT for yourself. We've got the demo download right
here.
Dmz Note WinSTAT also does clustering, factor analysis,
and the usual EDA techniques
Model
• If a complex predictive model is needed, the
researcher will move to the next step of the
process, building a model
• Modeling techniques include, neural
networks, decision tree, sequence-based,
classification and estimation
Neural Networks
• Also called artificial neural networks (ANN)
• Collections of simple processing nodes that are
connected
• Each node operates only its local data and on the
inputs it receives through connections
• The result is a nonlinear predictive model that
resembles biological neural networks and learns
through training.
Neural Networks
• The neural model has to train its network on
a training data set.
Tree Models
• Segregates data by using a hierarchy of ifthen statements based on the values of
variables and creates a tree-shaped structure
that represents the segregation decisions.
Classification –Sky Surveying
Cataloging
• To predict class (star or galaxy) of sky objects,
especially faint ones, based on telescopic survey
images (from Palomar Observatory)
• 3000 images with 23,040 x 23,040 pixels per image
– Approach:
– Segment the image
– Measure the image attributes (features) 40 of them per
object.
– Model the class based on these features
– Success Story: Could find 16 new red-shift quasars,
some of the farthest objects that are difficult to find
Estimation
• Variation of classification
• Instead of just “yes” or ‘no” outcome,
generates a score
Other Mining Techniques
• Association – find patterns across transactions,
patterns
– Bundling of services
• Sequence-based analysis – takes into account not
only the combination of items but also the order of
the items
– In health care, can be used to predict the course of a disease and
order preventive care
• Fuzzy logic – extension of Boolean – can have truth values
between completely true and completely false
• Fractal-based transformation – work on gigabytes of data,
offering the possibility of identify tiny subsets of data that
have common characteristics
Other Statistical Products
• http://www.statsoftinc.com/ - also includes
an online statistical textbook
• Statlib: a major site for statistical software
of all sorts.
– Gopher to lib.stat.cmu.edu
– Anonymous ftp to lib.stat.cmu.edu
– URL: http://lib.stat.cmu.edu/