Transcript Document
Data Analysis and Mining
Peter Fox
Data Science – ITEC/CSCI/ERTH-6961-01
Week 7, October 18, 2011
1
Reading assignment
• Brief Introduction to Data Mining
• Longer Introduction to Data Mining and slide
sets
• Software resources list
• Data Analysis Tutorial
• Example: Data Mining
2
Contents
• Preparing for data analysis, completing and
presenting results
• Visualization as an information tool
• Visualization as an analysis tool
• New visualization methods (new types of
data)
• Managing the output of viz/ analysis
• Enabling discovery
• Use, citation, attribution and reproducability
3
Contents
• Data Mining what it is, is not, types
• Distributed applications – modern data
mining
• Science example
• A specific toolkit set of examples (next
week)
– Classifier
– Image analysis – clouds
• Assignment 3 and week 7 reading
• Week 8
4
Types of data
5
Data types
• Time-based, space-based, image-based, …
• Encoded in different formats
• May need to manipulate the data, e.g.
– In our Data Mining tutorial and conversion to
ARFF
– Coordinates
– Units
– Higher order, e.g. derivative, average
6
Induction or deduction?
• Induction: The development of theories from
observation
– Qualitative – usually information-based
• Deduction: The testing/application of theories
– Quantitative – usually numeric, data-based
7
‘Signal to noise’
• Understanding accuracy and precision
– Accuracy
– Precision
• Affects choices of analysis
• Affects interpretations (gigo)
• Leads to data quality and assurance
specification
• Signal and noise are context dependent
8
Other considerations
• Continuous or discrete
• Underlying reference system
• Oh yeah: metadata standards and
conventions
• The underlying data structures are important
at this stage but there is a tendency to read in
partial data
– Why is this a problem?
– How to ameliorate any problems?
9
Outlier
• An extreme, or atypical, data value(s) in a
sample.
• They should be considered carefully, before
exclusion from analysis.
• For example, data values maybe recorded
erroneously, and hence they may be
corrected.
• However, in other cases they may just be
surprisingly different, but not necessarily
'wrong'.
10
Special values in data
•
•
•
•
•
•
•
•
Fill value
Error value
Missing value
Not-a-number
Infinity
Default
Null
Rational numbers
11
Errors
• Three main types: personal error, systematic
error, and random error
• Personal errors are mistakes on the part of
the experimenter. It is your responsibility to
make sure that there are no errors in
recording data or performing calculations
• Systematic errors tend to decrease or
increase all measurements of a quantity, (for
instance all of the measurements are too
large). E.g. calibration
12
Errors
• Random errors are also known as statistical
uncertainties, and are a series of small,
unknown, and uncontrollable events
• Statistical uncertainties are much easier to
assign, because there are rules for estimating
the size
• E.g. If you are reading a ruler, the statistical
uncertainty is half of the smallest division on
the ruler. Even if you are recording a digital
readout, the uncertainty is half of the smallest
place given. This type of error should always 13
be recorded for any measurement
Standard measures of error
• Absolute deviation
– is simply the difference between an
experimentally determined value and the
accepted value
• Relative deviation
– is a more meaningful value than the absolute
deviation because it accounts for the relative size
of the error. The relative percentage deviation is
given by the absolute deviation divided by the
accepted value and multiplied by 100%
• Standard deviation
– standard definition
14
Standard deviation
• the average value is found by summing and
dividing by the number of determinations.
Then the residuals are found by finding the
absolute value of the difference between
each determination and the average value.
Third, square the residuals and sum them.
Last, divide the result by the number of
determinations - 1 and take the square root.
15
Propagating errors
• This is an unfortunate term – it means making
sure that the result of the analysis carries with
it a calculation (rather than an estimate) of the
error
• E.g. if C=A+B (your analysis), then ∂C=∂A+∂B
• E.g. if C=A-B (your analysis), then ∂C=∂A+∂B!
• Exercise – it’s not as simple for other calcs.
• When the function is not merely addition,
subtraction, multiplication, or division, the error
propagation must be defined by the total
16
derivative of the function.
Types of analysis
•
•
•
•
Preliminary
Detailed
Summary
Reporting the results and propagating
uncertainty
• Qualitative v. quantitative, e.g. see
http://hsc.uwe.ac.uk/dataanalysis/index.asp
17
What is preliminary analysis?
• Self-explanatory…?
• Down sampling…?
• The more measurements that can be made of
a quantity, the better the result
– Reproducibility is an axiom of science
• When time is involved, e.g. a signal – the
‘sampling theorem’ – having an idea of the
hypothesis is useful, e.g. periodic versus
aperiodic or other…
• http://en.wikipedia.org/wiki/Nyquist–
Shannon_sampling_theorem
18
Detailed analysis
• Most important distinction between initial and
the main analysis is that during initial data
analysis it refrains from any analysis.
• Basic statistics of important variables
– Scatter plots
– Correlations
– Cross-tabulations
• Dealing with quality, bias, uncertainty,
accuracy, precision limitations - assessing
• Dealing with under- or over-sampling
• Filtering, cleaning
19
Summary analysis
• Collecting the results and accompanying
documentation
• Repeating the analysis (yes, it’s obvious)
• Repeating with a subset
• Assessing significance, e.g. the confusion
matrix we used in the supervised
classification example for data mining, pvalues (null hypothesis probability)
20
Reporting results/ uncertainty
• Consider the number of significant digits in
the result which is indicative of the certainty
of the result
• Number of significant digits depends on the
measuring equipment you use and the
precision of the measuring process - do not
report digits beyond what was recorded
• The number of significant digits in a value
infers the precision of that value
21
Reporting results…
• In calculations, it is important to keep enough
digits to avoid round off error.
• In general, keep at least one more digit than
is significant in calculations to avoid round off
error
• It is not necessary to round every
intermediate result in a series of calculations,
but it is very important to round your final
result to the correct number of significant
digits.
22
Uncertainty
• Results are usually reported as result ±
uncertainty (or error)
• The uncertainty is given to one significant
digit, and the result is rounded to that place
• For example, a result might be reported as
12.7 ± 0.4 m/s2. A more precise result would
be reported as 12.745 ± 0.004 m/s2. A result
should not be reported as 12.70361 ± 0.2
m/s2
• Units are very important to any result
23
Secondary analysis
• Depending on where you are in the data
analysis pipeline (i.e. do you know?)
• Having a clear enough awareness of what
has been done to the data (either by you or
others) prior to the next analysis step is very
important – it is very similar to sampling bias
• Read the metadata (or create it) and
documentation
24
Tools
• 4GL
– Matlab
– IDL
– Ferret
– NCL
– Many others
• Statistics
– SPSS
– Gnu R
• Excel
• What have you used?
25
Considerations for viz. as analysis
• What is the improvement in the
understanding of the data as compared to the
situation without visualization?
• Which visualization techniques are suitable
for one's data?
– E.g. Are direct volume rendering techniques to be
preferred over surface rendering techniques?
26
Why visualization?
•
•
•
•
•
•
•
Reducing amount of data, quantization
Patterns
Features
Events
Trends
Irregularities
Leading to presentation of data, i.e.
information products
• Exit points for analysis
27
Types of visualization
• Color coding (including false color)
• Classification of techniques is based on
– Dimensionality
– Information being sought, i.e. purpose
•
•
•
•
•
•
Line plots
Contours
Surface rendering techniques
Volume rendering techniques
Animation techniques
Non-realistic, including ‘cartoon/ artist’ style
28
Compression (any format)
• Lossless compression methods are methods for
which the original, uncompressed data can be
recovered exactly. Examples of this category are the
Run Length Encoding, and the Lempel-Ziv Welch
algorithm.
• Lossy methods - in contrast to lossless compression,
the original data cannot be recovered exactly after a
lossy compression of the data. An example of this
category is the Color Cell Compression method.
• Lossy compression techniques can reach reduction
rates of 0.9, whereas lossless compression
techniques normally have a maximum reduction rate 29
of 0.5.
Remember - metadata
• Many of these formats already contain
metadata or fields for metadata, use them!
30
Tools
• Conversion
– Imtools
– GraphicConverter
– Gnu convert
– Many more
• Combination/Visualization
– IDV
– Matlab
– Gnuplot
– http://disc.sci.gsfc.nasa.gov/giovanni
31
New modes
• http://www.actoncopenhagen.decc.gov.uk/co
ntent/en/embeds/flash/4-degrees-large-mapfinal
• http://www.smashingmagazine.com/2007/08/
02/data-visualization-modern-approaches/
• Many modes:
– http://www.siggraph.org/education/materials/Hyp
erVis/domik/folien.html
32
Periodic table
33
Publications, web sites
• www.jove.com - Journal of Visualized
Experiments
• www.visualizing.org • logd.tw.rpi.edu -
34
Managing visualization products
• The importance of a ‘self-describing’ product
• Visualization products are not just consumed
by people
• How many images, graphics files do you
have on your computer for which the origin,
purpose, use is still known?
• How are these logically organized?
35
(Class 2) Management
•
•
•
•
•
•
Creation of logical collections
Physical data handling
Interoperability support
Security support
Data ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Data dissemination and publication
36
Use, citation, attribution
• Think about and implement a way for others
(including you) to easily use, cite, attribute
any analysis or visualization you develop
• This must include suitable connections to the
underlying (aka backbone) data – and note
this may not just be the full data set!
• Naming, logical organization, etc. are key
• Make them a resource, e.g. URI/ URL
37
Producability/ reproducability
• The documentation around procedures used
in the analysis and visualization are very
often neglected – DO NOT make this mistake
• Treat this just like a data collection (or
generation) exercise
• Follow your management plan
• Despite the lack or minimal metadata/
metainformation standards, capture and
record it
• Get someone else to verify that it works
38
Data Mining – What it is
• Extracting knowledge from large amounts of data
• Motivation
– Our ability to collect data has expanded rapidly
– It is impossible to analyze all of the data manually
– Data contains valuable information that can aid in decision making
• Uses techniques from:
–
–
–
–
–
Pattern Recognition
Machine Learning
Statistics
High Performance Database Systems
OLAP
• Plus techniques unique to data mining (Association rules)
• Data mining methods must be efficient and scalable
Data Mining – What it isn’t
• Small Scale
– Data mining methods are designed for large data sets
– Scale is one of the characteristics that distinguishes data mining
applications from traditional machine learning applications
• Foolproof
–
–
–
–
Data mining techniques will discover patterns in any data
The patterns discovered may be meaningless
It is up to the user to determine how to interpret the results
“Make it foolproof and they’ll just invent a better fool”
• Magic
– Data mining techniques cannot generate information that is not
present in the data
– They can only find the patterns that are already there
Data Mining – Types of Mining
• Classification (Supervised Learning)
– Classifiers are created using labeled training samples
– Training samples created by ground truth / experts
– Classifier later used to classify unknown samples
• Clustering (Unsupervised Learning)
– Grouping objects into classes so that similar objects are in the
same class and dissimilar objects are in different classes
– Discover overall distribution patterns and relationships between
attributes
• Association Rule Mining
– Initially developed for market basket analysis
– Goal is to discover relationships between attributes
– Uses include decision support, classification and clustering
• Other Types of Mining
– Outlier Analysis
– Concept / Class Description
– Time Series Analysis
Data Mining in the ‘new’ Distributed
Data/Services Paradigm
Science Motivation
• Study the impact of natural iron fertilization process
such as dust storm on plankton growth and
subsequent DMS production
– Plankton plays an important role in the carbon cycle
– Plankton growth is strongly influenced by nutrient
availability (Fe/Ph)
– Dust deposition is important source of Fe over ocean
– Satellite data is an effective tool for monitoring the effects
of dust fertilization
Hypothesis
• In remote ocean locations there is a positive
correlation between the area averaged
atmospheric aerosol loading and oceanic
chlorophyll concentration
• There is a time lag between oceanic dust
deposition and the photosynthetic activity
Primary source of
ocean nutrients
OCEAN
UPWELLING
WIND
BLOWNDUST
SEDIMENTS
FROM RIVER
SAHARA
CLOUDS
Factors modulating
dust-ocean
photosynthetic effect
SST
CHLOROPHYLL
DUST
NUTRIENTS
SAHARA
Objectives
• Use satellite data to determine, if
atmospheric dust loading and
phytoplankton photosynthetic activity are
correlated.
• Determine physical processes responsible
for observed relationship
Data and Method
• Data sets obtained from SeaWiFS and
MODIS during 2000 – 2006 are employed
• MODIS derived AOT
• SeaWIFS
• MODIS
• AOT
The areas of study
*Figure: annual SeaWiFS chlorophyll image for 2001
8
7
6
1
2
3
4
5
1-Tropical North Atlantic Ocean 2-West coast of Central Africa 3Patagonia 4-South Atlantic Ocean 5-South Coast of Australia 6-Middle
East 7- Coast of China 8-Arctic Ocean
Tropical North Atlantic Ocean
dust from Sahara Desert
-0.0902
-0.328
-0.4595
-0.14019
-0.7253
-0.1095
-0.75102
-0.66448
-0.72603
AOT
Chlorophyll
-0.17504
-0.68497
-0.15874
-0.85611
-0.4467
Arabian Sea Dust from Middle
East
0.66618
0.65211
0.76650
0.37991
0.45171
0.52250
0.36517
0.5618
0.4412
0.75071
0.708625
0.8495
AOT
Chlorophyll
0.59895
0.69797
Summary …
• Dust impacts oceans photosynthetic activity,
positive correlations in some areas NEGATIVE
correlation in other areas, especially in the Saharan
basin
• Hypothesis for explaining observations of negative
correlation: In areas that are not nutrient limited,
dust reduces photosynthetic activity
• But also need to consider the effect of clouds,
ocean currents. Also need to isolate the effects of
dust. MODIS AOT product includes contribution
from dust, DMS, biomass burning etc.
Data Mining – Types of Mining
• Classification (Supervised Learning)
– Classifiers are created using labeled training samples
– Training samples created by ground truth / experts
– Classifier later used to classify unknown samples
• Clustering (Unsupervised Learning)
– Grouping objects into classes so that similar objects are in the
same class and dissimilar objects are in different classes
– Discover overall distribution patterns and relationships between
attributes
• Association Rule Mining
– Initially developed for market basket analysis
– Goal is to discover relationships between attributes
– Uses include decision support, classification and clustering
• Other Types of Mining
– Outlier Analysis
– Concept / Class Description
– Time Series Analysis
Models/ types
• Trade-off between Accuracy and
Understandability
• Models range from “easy to understand” to
incomprehensible
– Decision trees
– Rule induction
– Regression models
– Neural Networks
H
a
r
d
e
r
54
Qualitative and Quantitative
• Qualitative
– Provide insight into the data you are working with
• If city = New York and 30 < age < 35 …
• Important age demographic was previously 20 to 25
• Change print campaign from Village Voice to New
Yorker
– Requires interaction capabilities and good
visualization
• Quantitative
• Automated process
• Score new gene chip datasets with error model every
night at midnight
• Bottom-line orientation
55
Management
•
•
•
•
•
•
Creation of logical collections
Physical data handling
Interoperability support
Security support
Data ownership
Metadata collection, management and
access.
• Persistence
• Knowledge and information discovery
• Data dissemination and publication
56
Provenance*
• Origin or source from which something
comes, intention for use, who/what
generated for, manner of manufacture,
history of subsequent owners, sense of
place and time of manufacture,
production or discovery, documented in
detail sufficient to allow
reproducibility
ADaM – System Overview
• Developed by the Information Technology and Systems
Center at the University of Alabama in Huntsville
• Consists of over 75 interoperable mining and image
processing components
• Each component is provided with a C++ application
programming interface (API), an executable in support of
scripting tools (e.g. Perl, Python, Tcl, Shell)
• ADaM components are lightweight and autonomous, and have
been used successfully in a grid environment
• ADaM has several translation components that provide data
level interoperability with other mining systems (such as
WEKA and Orange), and point tools (such as libSVM and
svmLight)
• Future versions will include Python wrappers and possible web
service interfaces
ADaM 4.0 Components
ADaM Classification Process
• Identify potential features which may characterize the
phenomenon of interest
• Generate a set of training instances where each instance
consists of a set of feature values and the corresponding class
label
• Describe the instances using ARFF file format
• Preprocess the data as necessary (normalize, sample etc.)
• Split the data into training / test set(s) as appropriate
• Train the classifier using the training set
• Evaluate classifier performance using test set
• K-Fold cross validation, leave one out or other more
sophisticated methods may also be used for evaluating
classifier performance
ADaM Classification Example
• Starting with an ARFF file, the ADaM system will be used to
create a Naïve Bayes classifier and evaluate it
• The source data will be an ARFF version of the Wisconsin
breast cancer data from the University of California Irvine
(UCI) Machine Learning Database:
http://www.ics.uci.edu/~mlearn/MLRepository.html
• The Naïve Bayes classifier will be trained to distinguish
malignant vs. benign tumors based on nine characteristics
Naïve Bayes Classification
• Classification problem with m classes C1, C2, … Cm
• Given an unknown sample X, the goal is to choose a class that
is most likely based on statistics from training data
P(Ci | X) can be computed using Bayes’ Theorem:
[1] Equations from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2001.
Naïve Bayes Classification
• P(X) is constant for all classes, so finding the most likely class
amounts to maximizing P(X | Ci) P(Ci)
• P(Ci ) is the prior probability of class i. If the probabilities are not
known, equal probabilities can be assumed.
• Assuming attributes are conditionally independent:
P(xk | Ci) is the probability density function for attribute k
[1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2001.
Naïve Bayes Classification
P(xk | Ci) is estimated from the training samples
Categorical Attributes (non-numeric attributes)
–
–
Estimate P(xk | Ci) as percentage of samples of class i with value xk
Training involves counting percentage of occurrence of each
possible value for each class
Numeric attributes
–
–
–
Also use statistics of the sample data to estimate P(xk | Ci)
Actual form of density function is generally not known, so Gaussian
density is often assumed
Training involves computation of mean and variance for each
attribute for each class
Naïve Bayes Classification
Gaussian distribution for numeric attributes:
–
–
Where
is the mean of attribute k observed in samples of class Ci
And
is the standard deviation of attribute k observed in samples
of class Ci
[1] Equation from J. Han and M. Kamber, “Data Mining: Concepts and Techniques”,
Morgan Kaufmann, 2001.
Sample Data Set – ARFF
Format
Data management
•
•
•
•
Metadata?
Data?
File naming?
Documentation?
Splitting the Samples
• ADaM has utilities for splitting data sets into disjoint
groups for training and testing classifiers
• The simplest is ITSC_Sample, which splits the source
data set into two disjoint subsets
Splitting the Samples
• For this demo, we will split the breast cancer data set into two
groups, one with 2/3 of the patterns and another with 1/3 of the
patterns:
ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66
• The –i argument specifies the input file name
• The –o and –t arguments specify the names of the two output
files (-o = output one, -t = output two)
• The –p argument specifies the portion of data that goes into
output one (trn.arff), the remainder goes to output two (tst.arff)
• The –c argument tells the sample program which attribute is the
class attribute
Provenance?
• For this demo, we will split the breast cancer data set into two
groups, one with 2/3 of the patterns and another with 1/3 of the
patterns:
ITSC_Sample -c class -i bcw.arff -o trn.arff -t tst.arff –p 0.66
• What needs to be recorded and why?
• What about intermediate files and why?
• How are they logically organized?
Training the Classifier
• ADaM has several different types of classifiers
• Each classifier has a training method and an application
method
• ADaM’s Naïve Bayes classifier has the following syntax:
Training the Classifier
• For this demo, we will train a Naïve Bayes classifier:
ITSC_NaiveBayesTrain -c class -i trn.arff –b bayes.txt
• The –i argument specifies the input file name
• The –c argument specifies the name of the class attribute
• The –b argument specifies the name of the classifier file:
Applying the Classifier
• Once trained, the Naïve Bayes classifier can be used to
classify unknown instances
• The syntax for ADaM’s Naïve Bayes classifier is as follows:
Applying the Classifier
• For this demo, the classifier is run as follows:
ITSC_NaiveBayesApply -c class -i tst.arff –b bayes.txt -o res_tst.arff
•
•
•
•
The –i argument specifies the input file name
The –c argument specifies the name of the class attribute
The –b argument specifies the name of the classifier file
The –o argument specifies the name of the result file:
Evaluating Classifier
Performance
• By applying the classifier to a test set where the correct
class is known in advance, it is possible to compare the
expected output to the actual output.
• The ITSC_Accuracy utility performs this function:
Confusion matrix
Classified\ Actual
0
1
0
TRUE
POSITIVES
FALSE
POSITIVES
1
FALSE
NEGATIVES
TRUE
NEGATIVES
• Gives a guide to accuracy but samples (i.e. bias)
are important to take into account
Evaluating Classifier
Performance
• For this demo, ITSC_Accuracy is run as follows:
ITSC_Accuracy -c class -t res_tst.arff –v tst.arff –o acc_tst.txt
Python Script for
Classification
How would you modify this?
What is the provenance?
ADaM Image Classification
• Classification of image data is a bit more involved, as there is
an additional set of steps that must be performed to extract
useful features from the images before classification can be
performed.
• In addition, it is also useful to transform the data back into
image format for visualization purposes.
• As an example problem, we will consider detection of
cumulus cloud fields in GOES satellite images
– GOES satellites produce a 5 channel image every 15 minutes
– The classifier must label each pixel as either belonging to a
cumulus cloud field or not based on the GOES data
– Algorithms based on spectral properties often miss cumulus
clouds because of the low resolution of the IR channels and the
small size of clouds
– Texture features computed from the GOES visible image provide
a means to detect cumulus cloud fields.
GOES Images Preprocessing
• Segmentation is based only on the high resolution (1km)
visible channel.
• In order to remove the effects of the light reflected from the
Earth’s surface, a visible reference background image is
constructed for each time of the day.
• The reference image is subtracted from the visible image
before it is segmented.
• GOES image patches containing cumulus cloud regions, other
cloud regions, and background were selected
• Independent experts labeled each pixel of the selected image
patches as cumulus cloud or not
• The expert labels were combined to form a single “truth”
image for each of the original image patches. In cases where
the experts disagreed, the truth image was given a “don’t
know” value
GOES Images - Example
GOES Visible Image
Expert Labels
Image Quantization
• Some texture features perform better when the image is
quantized to some small number of levels before the
features are computed.
• ITSC_RelLevel performs local image quantization
Image Quantization
• For this demo, we will reduce the number of levels from 256
to just three using local image statistics:
ITSC_RelLevel –d -s 30 –i src.bin –o q4.bin –k
• The –i argument specifies the input file name
• The –o argument specifies the output file name
• The –d argument tells the program to use standard deviation
to set the cutoffs instead of a fixed value
• The –k option tells the program to keep values in the range
0, 1, 2 rather than normalizing to 0..1.
• The –s argument indicates the size of the local area used to
compute statistics
Computing Texture Features
• ADaM is currently able to compute five different types of
texture features: gray level cooccurrence, gray level run
length, association rules, Gabor filters, and MRF models
• The syntax for gray level run length computation is:
Computing Texture Features
• For this demo, we will compute gray level run length features
using a tile size of 25:
ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25
• The –i argument specifies the input file name
• The –o argument specifies the output file name
• The –l argument tells the program the number of levels in
the input image
• The –B option tells the program to write a binary version of
the ARFF file (default is ASCII)
• The –t argument indicates the size of the tiles used to
compute the gray level run length features
Provenance alert!
• For this demo, we will compute gray level run length features
using a tile size of 25:
ITSC_Glrl –i q4.bin –o glrl.arff –l 3 –B –t 25
• What needs to be documented here and why?
Converting the Label Images
• Since the labels are in the form of images, it is
necessary to convert them to vector form
• ITSC_CvtImageToArff will do this:
Converting ??????
• Since the labels are in the form of images, it is
necessary to convert them to vector form
• Consequences?
• Do you save them?
• Discussion?
Converting the Label Images
• The labels can be converted to vector form using:
ITSC_CvtImageToArff –i lbl.bin –o lbl.arff -B
• The –i argument specifies the input file name
• The –o argument specifies the output file name
• The –B argument tells the program to write the output file in
binary form (default is ASCII)
Labeling the Patterns
• Once the labels are in vector form, they can be
appended to the patterns produced by ITSC_Glrl
• ITSC_LabelPatterns will do this:
Labeling the Patterns
• The labels are assigned to patterns as follows:
ITSC_LabelPatterns –i glrl.arff –c class –l lbl.bin –L lbl.arff –o all.arff –B
• The –i argument specifies the input file name (patterns)
• The –o argument specifies the output file name The –c
argument
• The –c argument specifies the name of the class attribute in
the pattern set
• The –l argument specifies the name of the label attribute in
the label set
• The –L argument specifies the name of the input label file
• The –B argument tells the program to write the output file in
binary form (default is ASCII)
Eliminating “Don’t Know”
Patterns
• Some of the original pixels were classified differently by
different experts and marked as “don’t know”
• The corresponding patterns can be removed from the training
set using ITSC_Subset:
Eliminating “Don’t Know”
Patterns
• ITSC_Subset is used to remove patterns with unclear class
assignment. The subset is generated based on the value of
the class attribute:
ITSC_Subset –i all.arff –o subset.arff –a class –r 0 1 -B
•
•
•
•
•
The –i argument specifies the input file name
The –o argument specifies the output file name
The –a argument tells which attribute to test
The –r argument tells the legal range of the attribute
The –B argument tells the program to write the output file in
binary form (default is ASCII)
Selecting Random Samples
• Random samples are selected from the original training data
using the same ITSC_Sample program shown in the
previous demo
• The program is used in a slightly different way:
ITSC_Sample –i subset.arff –c class –o s1.arff –n 2000
•
•
•
•
The –i argument specifies the input file name
The –o argument specifies the output file name
The –c argument specifies the name of the class attribute
The –n option tells the program to select an equal number of
random samples (in this case 2000) from each class.
Python Script for Sample
Creation
What modifications here??
Merging Samples / Multiple
Images
• The procedure up to this point has created a random subset
of points from a particular image. Subsets from multiple
images can be combined using ITSC_MergePatterns:
Merging Samples / Multiple
Images
• Multiple pattern sets are merged using the following
command:
ITSC_MergePatterns –c class –o merged.arff –i s1.arff s2.arff
• The –i argument specifies the input file names
• The –o argument specifies the output file name
• The –c argument specifies the name of the class attribute
Python Script for Training
Results of Classifier
Evaluation
• The results of running this procedure using five sample
images of size 500x500 is as follows:
Applying the Classifier to
Images
• Once the classifier is trained, it can be applied to segment
images. One further program is required on the end to
convert the classified patterns back into an image:
Python Function for
Segmentation
Sample Image Results
Expert Labels
Segmentation Result
Remarks
• The procedure illustrated here is one specific example of
ADaM’s capabilities
• There are many other classifiers, texture features and other
tools that could be used for this problem
• Since all of the algorithms of a particular type work in more or
less the same way, the same general procedure could be
used with other tools
• DOWNLOAD the ADaM Toolkit
– http://datamining.itsc.uah.edu/adam/
Management
• What did you learn?
• Provenance elements?
• How to deal with both?
Summary
• Purpose of analysis should drive the type that
is conducted
• Many constraints due to prior management of
the data
• Become proficient in a variety of methods,
tools
• Many considerations around visualization,
similar to analysis, many new modes of viz.
• Management of the products is a significant
task
108
Reading
• For week 7 – data sources for project
definitions
• Note there is a lot of material to review
• Why – week 8 defines the group projects,
come familiar with the data out there!
Working with someone else's data
109