sept062007 - University at Albany
Download
Report
Transcript sept062007 - University at Albany
Descriptive Exploratory Data Analysis
9/6/2007
Jagdish S. Gangolly
State University of New York at Albany
Data Manipulation:
– Matrices: bind rows (rbind), bind columns (cbind)
– Arrays: rowMeans, colMeans, rowSums,
colSums, rowVars, colVars,…
– apply(data, dim, function,…)
– attach(framename):permits you to refer to
variables without cumbersome notations. You can
detach the frame when done.
– function (x) { function definition}: To define
your own functions
– rm(comma-separated S-Plus objects): To remove
objects
Trellis Graphics I
• A matrix of graphs
Example:
>par(mfrow=c(2,2)) # 2 X 2 matrix of figures
>x <- 1:100/100:1
>plot(x) # plot cell (1,1)
>plot(x, type=“l”) # plot cell (1,2) line
>hist(x) # plot cell (2,1) histogram
>boxplot(x) # plot cell (2,2) boxplot
Trellis Graphics II
Syntax:
Dependent variable ~ explanatory variable |conditioning variable Data set
Output:
>trellis.device(motif)
>dev.off() or >graphics.off()
Trellis Graphics III
Example:
histogram(~height | voice.part,
data=singer)
– No dependent variable for histogram
– Height is explanatory variable
– Data set is singer
Trellis Graphics IV
• Layout: layout and skip and aspect
parameters (p.147).
• Ordering graphs: left to right, bottom to top. If
as.table=T, left to right top to bottom p.149).
Data Mining
• What is Data mining?
• Data mining primitives
– Task-relevant data
– Kinds of knowledge to be mined
– Background knowledge
– Interestedness measures
– Visualisation of discovered patterns
• Query language
Data Mining
• Concept Description (Descriptive
Datamining)
– Data generalisation
• Data cube (OLAP) approach (offline pre-computation)
• Attribute-oriented induction approach (online aggregation)
• Presentation of generalisation
• Descriptive Statistical Measures and Displays
What is Data mining?
• Discovery of knowledge from
Databases
– A set of data mining primitives to facilitate
such discovery (what data, what kinds of
knowledge, measures to be evaluated,
how the knowledge is to be visualised)
– A query language for the user to
interactively visualise knowledge mined
Data mining primitives I
• Task-relevant data: attributes relevant for the
study of the problem at hand
• Kinds of knowledge to be mined:
characterisation, discrimination, association,
classification, clustering, evolution,…
• Background knowledge: Knowledge about
the domain of the problem (concept
hierarchies, beliefs about the relationships,
expected patterns of data, …)
Data mining primitives II
• Interestedness measures: support measures
(prevalence of rule pattern) and confidence
measures(strength of the implication of the
rule)
• Visualisation of discovered patterns: rules,
tables, charts, graphs, decision trees, cubes,…
Task-relevant Data
Steps:
• Derivation of initial relation through
database queries (data retrieval
operations). (Obtaining a minable view)
• Data cleaning & transformation of the
initial relation to facilitate mining
• Data mining
Kinds of knowledge to be mined
• Kinds of knowledge & templates (metapatterns, meta-rules, meta-queries)
– Association
An Example:
age(X:customer, W) Λ income(X, Y) buys(X, Z)
– Classification
– Discrimination
– Clustering
– Evolution analysis
Background knowledge
• Knowledge from the problem domain
– usually in the form of
• concept hierarchies (rolling up or drilling down)
• schema hierarchies (lattices)
• set-grouping hierarchies (successive sub-grouping of
attributes)
• rule-based hierarchies
Interestedness measures I
• Simplicity: More complex the structure, the
more difficult it is to interpret, and so likely to
be less interesting (rule length,…)
• Certainty: Validity, trustworthiness
# tuples containing both A and B
confidence(AB)
# tuples containing A
Sometimes called “certainty factor”
Interestedness measures II
• Utility: Support is the percentage of taskrelevant data tuples for which the pattern is
true
# tuples containing both A and B
support(AB)
total # tuples
Visualisation of discovered patterns
•
•
•
•
•
Hierarchies
tables
pie/bar charts
dot/box plots
……
Descriptive Datamining
(Concept Description & Characterisation )
• Concept description:Description of data
generalised at multiple levels of abstraction
• Concept characterisation: Concise and
succinct summarisation of a given collection
of data
• Concept comparison: Discrimination
Data Generalisation
• Abstraction of task-relevant high
conceptual level data from a database
containing relatively low conceptual level
data
– Data cube (OLAP) approach (offline precomputation) (Figs 2.1 & 2.2, pages 46 &47)
– Attribute-oriented induction approach (online
aggregation)
• Presentation of generalisation (Tables 5.3 &
5.4 on p. 191, and Figs 5.2, 5.3, & 5.4 on
pages 192 & 193)
Descriptive Statistical Measures and Displays I
• Measures of central tendency
– Mean, Weighted mean (weights signifying
importance or occurrence frequency)
– Median
– Mode
• Measures of dispersion
– Quartiles, outliers, boxplots
Descriptive Statistical Measures and Displays II
• Displays
– Histograms (Fig 5.6, page 214)
– Barcharts
– Quantile plot (Fig 5.7, page 215)
– Quantile-Quantile plot (Fig 5.8, page 216)
– Scatter plot (Fig 5.9, page 216)
– Loess curve (Fig 5.10, page 217)
Descriptive Data Exploration
•
•
•
•
•
•
•
summary : mean, median, quartiles p.171
stem : stem and leaf display p.171
quantile p.172
stdev p.173
tapply : splits data p.174
by p.175
mean works on vector, and other structures need to
be converted to vectors before computing means.
• (example on p.176-7)
Data Preprocessing for Datamining I
• Why
– Incomplete
• Attribute values not available, equipment
malfunctions, not considered important
– Noisy (errors)
• instrument problems, human/computer errors,
transmission errors
– Inconsistent
• inconsistencies due to data definitions
Data Preprocessing for Datamining II
• Data Cleaning
– Missing values:
• ignore tuple, fill-in values manually, use a global constant
(unknown), missing value=attribute mean, missing value =
attribute group mean, missing value= most probable value
– Noisy data:
• Binning: partitioning into equi-sized bins, smoothing by bin
means or bin boundaries
• Clustering
• Inspection: computer & human
• Regression
– Inconsistencies
Data Preprocessing for Datamining III
• Data Integration: Combining data from different
sources into a coherent whole
– Schema integration: combining data models (entity
identification problems)
– Redundancy (derived values, calculated fields, use of
different key attributes): use of correlations to detect
redundancies
– Resolution of data value conflicts (coding values in
different measures)
Data Preprocessing for Datamining III
• Transformation
–
–
–
–
–
Smoothing
Aggregation
Generalisation
Normalisation
Attribute (or feature) construction
Data Preprocessing for Datamining IV
• Data Reduction & compression
– Data cube aggregation (p.117)
– Dimension reduction: minimise loss of
information.
• Attribute selection
• Decision tree induction
• Principal components analysis
Data Preprocessing for Datamining IV
– Numerosity reduction
• Regression/log-linear regression
• histograms
• Clustering