Transcript YALE

UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
TIES443
Tutorial 1
Prototyping DM Techniques
with WEKA and YALE
Open-Source Software
Mykola Pechenizkiy
Course webpage: http://www.cs.jyu.fi/~mpechen/TIES443
November 7, 2006
Department of Mathematical Information Technology
University of Jyväskylä
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
1
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Contents
• Brief Review of DM Software
– Commercial
– Open-source
• WEKA http://www.cs.waikato.ac.nz/~ml/weka/index.html
• YALE http://rapid-i.com/
• The R Project for Statistical Computing http://www.r-project.org/
• Pentaho – whole BI solutions. http://www.pentaho.com/
– Matlab – Sami will tell you more during the 2nd Tutorial
• WEKA vs. YALE Comparison
– Exploration
– Experimentation
– Visualization
• 1st Assignment
http://www.cs.jyu.fi/~mpechen/TIES443/tutorials/assignment1.pdf
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
2
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data Mining Software
• Many providers of commercial DM software
– SAS Enterprise Miner, SPSS Clementine, Statistica Data Miner, MS
SQL Server, Polyanalyst, KnowledgeSTUDIO, …
– IBM Intelligent Miner.
• Universities can now receive free copies of DB2 and Intelligent Miner
for educational or research purposes.
– See http://www.kdnuggets.com/software/suites.html for a list
• Open Source:
– WEKA (Waikato Environment for Knowledge Analysis)
– YALE (Yet Another Learning Environment)
– Many others
• MLC++, Minitab, AlphaMiner, Rattle, KNIME
– The Pentaho BI project –
• “a pioneering initiative by the Open Source development community
to provide organizations with a comprehensive set of BI capabilities
that enable them to radically improve business performance,
efficiency, and effectiveness.”
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
3
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data Mining with WEKA
The following slides are from
http://prdownloads.sourceforge.net/weka/weka.ppt
by Eibe Frank
Copyright: Martin Kramer ([email protected])
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
4
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
WEKA: the software
• Machine learning/data mining software written in Java
(distributed under the GNU Public License)
• Used for research, education, and applications
• Complements “Data Mining” book by Witten & Frank
– http://www.cs.waikato.ac.nz/~ml/weka/book.html
• Main features:
– Comprehensive set of data pre-processing tools, learning
algorithms and evaluation methods
– Graphical user interfaces (incl. data visualization)
– Environment for comparing learning algorithms
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
5
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
6
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
WEKA only deals with “flat” files
@relation heart-disease-simplified
@attribute age numeric
@attribute sex { female, male}
@attribute chest_pain_type { typ_angina, asympt, non_anginal, atyp_angina}
@attribute cholesterol numeric
@attribute exercise_induced_angina { no, yes}
@attribute class { present, not_present}
@data
63,male,typ_angina,233,no,not_present
67,male,asympt,286,yes,present
67,male,asympt,229,yes,present
38,female,non_anginal,?,no,not_present
...
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
7
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
8
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Command line tutorial
http://weka.sourceforge.net/wekadoc/index.php/en%3APrimer
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
9
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
10
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: Pre-processing the Data
• Data can be imported from a file in various
formats: ARFF, CSV, C4.5, binary
• Data can also be read from a URL or from an SQL
database (using JDBC)
• Pre-processing tools in WEKA are called “filters”
• WEKA contains filters for:
– Discretization, normalization, resampling, attribute
selection, transforming and combining attributes, …
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
11
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
12
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
13
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
14
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
15
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
16
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
17
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
18
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
19
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
20
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
21
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
22
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
23
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
24
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
25
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
26
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
27
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
28
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
29
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
30
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
31
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
32
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: building “classifiers”
• Classifiers in WEKA are models for predicting
nominal or numeric quantities
• Implemented learning schemes include:
– Decision trees and lists, instance-based classifiers,
support vector machines, multi-layer perceptrons,
logistic regression, Bayes’ nets, …
• “Meta”-classifiers include:
– Bagging, boosting, stacking, error-correcting output
codes, locally weighted learning, …
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
33
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
34
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
35
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
36
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
37
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
38
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
39
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
40
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
41
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
42
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
43
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
44
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
45
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
46
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
47
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
48
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
49
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
50
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
51
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
52
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
53
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
54
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
55
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
56
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
57
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
58
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
59
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
60
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
61
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
62
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
63
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
64
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
65
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
QuickTime™ and a TI FF (LZW) decompressor are needed to see this picture.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
66
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
QuickTime™ and a TI FF (LZW) decompressor are needed to see this picture.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
67
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
QuickTime™ and a TI FF (LZW) decompressor are needed to see this picture.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
68
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
69
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
70
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
71
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
72
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
73
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
74
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
75
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
76
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
77
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
78
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
79
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
80
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: clustering data
• WEKA contains “clusterers” for finding groups of similar
instances in a dataset
• Implemented schemes are:
– k-Means, EM, Cobweb, X-means, FarthestFirst
• Clusters can be visualized and compared to “true” clusters
(if given)
• Evaluation based on loglikelihood if clustering scheme
produces a probability distribution
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
81
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
82
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
83
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
84
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
85
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
86
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
87
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
88
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
89
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
90
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
91
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
92
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
93
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
94
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
95
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
96
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: finding associations
• WEKA contains an implementation of the Apriori
algorithm for learning association rules
– Works only with discrete data
• Can identify statistical dependencies between
groups of attributes:
– milk, butter  bread, eggs (with confidence 0.9 and
support 2000)
• Apriori can compute all rules that have a given
minimum support and exceed a given confidence
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
97
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
98
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
99
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
100
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
101
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
102
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
103
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
104
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: attribute selection
• Panel that can be used to investigate which
(subsets of) attributes are the most predictive ones
• Attribute selection methods contain two parts:
– A search method: best-first, forward selection, random,
exhaustive, genetic algorithm, ranking
– An evaluation method: correlation-based, wrapper,
information gain, chi-squared, …
• Very flexible: WEKA allows (almost) arbitrary
combinations of these two
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
105
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
106
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
107
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
108
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
109
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
110
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
111
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
112
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
113
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Explorer: Data Visualization
• Visualization very useful in practice: e.g. helps to
determine difficulty of the learning problem
• WEKA can visualize single attributes (1-d) and pairs of
attributes (2-d)
– To do: rotating 3-d visualizations (Xgobi-style)
• Color-coded class values
• “Jitter” option to deal with nominal attributes (and to
detect “hidden” data points)
• “Zoom-in” function
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
114
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
115
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
116
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
117
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
118
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
119
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
120
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
121
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
122
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
123
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
124
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
125
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
126
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Performing Experiments
• Experimenter makes it easy to compare the performance
of different learning schemes
• For classification and regression problems
• Results can be written into file or database
• Evaluation options: cross-validation, learning curve, hold-
out
• Can also iterate over different parameter settings
• Significance-testing built in!
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
127
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
128
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
129
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
130
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
131
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
132
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
133
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
134
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
135
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
136
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
137
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
138
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
139
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
140
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The Knowledge Flow GUI
• New graphical user interface for WEKA
• Java-Beans-based interface for setting up and running
machine learning experiments
• Data sources, classifiers, etc. are beans and can be
connected graphically
• Data “flows” through components: e.g.,
“data source” -> “filter” -> “classifier” -> “evaluator”
• Layouts can be saved and loaded again later
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
141
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
142
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
143
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
144
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
145
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
146
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
147
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
148
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
149
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
150
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
151
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
152
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
153
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
154
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
155
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
156
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
157
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
158
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
159
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
160
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
161
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Conclusion: try it yourself!
• WEKA is available at
http://www.cs.waikato.ac.nz/ml/weka
 Also has a list of projects based on WEKA
 YALE has different interfaces and ideas behind but it also
integrates all available DM techniques from WEKA
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
162
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The following slides are compiled from screenshots and
related descriptions available from YALE pages http://rapid-i.com/
YALE – Yet Another Learning
Environment
Artificial Intelligence Unit of the
University of Dortmund.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
163
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Features of YALE
• freely available open-source knowledge discovery
environment
• 100% pure Java (runs on every major platform and
operating system)
• KD processes are modeled as simple operator trees which
is both intuitive and powerful
• operator trees or subtrees can be saved as building blocks
for later re-use
• internal XML representation ensures standardized
interchange format of data mining experiments
• simple scripting language allowing for automatic largescale experiments
• multi-layered data view concept ensures efficient and
transparent data handling
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
164
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Features of YALE
• Flexibility in using YALE:
– graphical user interface (GUI) for interactive prototyping
– command line mode (batch mode) for automated large-scale applications
– Java API to ease usage of YALE from your own programs
• simple plugin and extension mechanisms, some plugins already exists
and you can easily add your own
• powerful plotting facility offering a large set of sophisticated highdimensional visualization techniques for data and models
• more than 350 machine learning, evaluation, in- and output, pre- and
post-processing, and visualization operators plus numerous meta
optimization schemes
• machine learning library WEKA fully integrated
• YALE’s potential application include text mining, multimedia
mining, feature engineering, data stream mining and tracking
drifting concepts, development of ensemble methods, and
distributed data mining.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
165
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Experiment Setup
the initial operator tree
which only consist of a
root node.
The "Tree View" tab is the most
often used editor for YALE
experiments.
Left: the current operator tree.
Right: a table with the parameters
of the currently selected operator.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
The lower part of the
YALE main frame
serves for displaying
and viewing log and
error messages.
166
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
After the learning operator "J48", a
breakpoint indicates that the intermediate
results can be inspected. Due to the
modular concept of YALE, it is always
possible to inspect and save
intermediate results, e.g. the results for
each individual run in a cross validation
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
167
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
add new operators to the
experiment:
• directly from the context
menu of its parent.
•the new operator dialog
shown in this screenshot.
Several search constrains exist
and a short description for each
operator is shown
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
168
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The operator trees are coded and
represented by a simple XML format.
The XML editor tab allows for fast and
direct manipulations of the current
experiment.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
169
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
All views can
also be printed
and exported to a
wide range of
graphic formats
including jpg, png,
ps and pdf.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
The "Box View" - is
another viewer for
YALE experiments.
• the box format is
an intuitive way of
representing the
nesting of the
operators.
• but editing is not
possible
170
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
"Monitor" tab provides an
overview of the currently used
memory and is an important
tool for large-scale data
mining tasks on huge data sets.
The amount of used memory
during an experiment run can
even be logged in the same way
like all other provided logging
values.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
171
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data can be imported from several file
formats with the attribute editor. Other
file formats like Arff, C45, csv, and
dBase can be loaded with specialized
operators.
Attribute Editor can be used to create
meta data descriptions from almost
arbitrary file formats. These meta data
descriptions can then be used for an input
operator which actually loads the data.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
172
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Additional attributes (features) can
easily be constructed from your data.
YALE provides several approaches to
construct the best feature space
automatically. These approaches range
from feature space transformations like
PCA, GHA, ICA or the kernel versions
to standard feature selection techniques
to several evolutionary approaches for
feature
construction
and extraction.
Tutorial 1: Introduction
to WEKA and
YALE
173
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Help features to ease the
learning phase for new
users:
• An online tutorial,
• tool tip texts,
• a beginner and expert
mode, operator info
screens,
• a GUI manual,
• and the YALE tutorial.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
174
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Data Visualization
Each time a data set is
presented in the results tab (e.g.
after loading it), several views
appear: a meta data view
describing all attributes, a data
view showing the actual data
and a plot view providing a
large set of (high-dimensional)
plotters for the data set at hand.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
175
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The basic scatter plotter: Two
of the attribute are used as
axes, the class label attribute is
used for colorization. The
legend at the top maps the
colors used to the classes or, in
case of a real-valued color plot
column, to the corresponding
real values.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
176
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The standard
scatter plotter even
allows jittering,
zooming, and
displaying example
ids. Doubleclicking a data
point opens a
visualizer. The
standard example
visualizer is
presented here.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
177
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
2D scatter plots can be put
together to a scatter plot matrix
where for all pairs of dimensions a
usual scatter plot is drawn. This
plotter is only available for less
then 10 dimensions. For higher
number of dimensions one of the
other high-dimensional data plotter
presented below should be used.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
178
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
A 3D scatter plot exists similar to
the colorized 2D scatter plot
discussed above. The viewport can
be rotated and zoomed to fit your
needs. The built-in 2D and 3D
plotters are a quick and easy way
to view your numerical and
nominal results, even as online
plot at experiment runtime!
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
179
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
SOM (Self-Organizing Map)
plotter which uses a Kohonen net
for dimensionality reduction.
Plotting of the U-, the P-, and the
U*-Matrix are supported with
different color schemes. The data
points can be colorized by one of
the data columns, e.g. with the
prediction label.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
180
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
SOM (Self-Organizing Map)
plotter which uses a Kohonen
net for dimensionality
reduction. a gray scale color
scheme was used to plot the UMatrix.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
181
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
The parallel plotter prints
the axes of all dimensions
parallel to each other. This
is the natural visualization
technique for series data but
can also be useful for other
types of data. The main
advantage of parallel plots
is that a very high number
of dimensions can be
visualized with this
technique. The dimensions
are colorized with the
feature weights. The more
yellow a dimension is
marked, the more important
this column is.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
182
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
quartile plots (also
known as box
plots) are often
used for experiment
results like
performance values
but it is possible to
summarize the
statistical properties
of data columns in
general with this
type of plot.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
183
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Histogram plots
(also known as
distribution plots)
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
184
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
RadViz is another highdimensional data plotter
where the data columns are
placed as radial dimension
anchors. Each data point is
connected to each anchor
with a spring corresponding
to the feature values. This
will lead to a fixed position in
the two-dimensional plane.
Again, weights are used to
mark the more important
columns.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
185
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
A survey plot is a sort of
vertical histogram matrix also
suitable for a large number of
dimensions. Each line
corresponds to one data point
and can be colorized by one
of the columns. The length of
each section corresponds to
the value of the data point for
that dimension. For up to
three dimensions the order of
the histograms can be
selected.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
186
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Andrews curves are another
way of visualizing highdimensional data. Each data
point is projected onto a set
of orthogonal trigonometric
functions and displayed as a
curve. It is known that
Andrews curves preserve
distances, so they have many
uses for data analysis and
exploration. Often outliers
and hidden patterns can be
well detected in these plots.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
187
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Visualization of Models and other Results
The result of a learning step
is called model. Some models
provide a graphical
representation of the
learned hypothesis. This
screenshot presents a learned
decision tree for the widely
known "labor negotiations"
data set from the UCI
repository. Results like
learned models, performance
values, data sets or selected
attributes are displayed when
the experiment is completed
or a breakpoint is reached
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
188
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
In cases where no graphical
representation of a learned
model is available, at least a
textual description of the
learned model is presented. In
this screenshot you see a
Stacking model consisting of
a rule model (the upper half)
and a neural network (starts
at the lower half). Both base
models are described by
simple and understandable
texts.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
189
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
This is a density plot (similar
to a contour plot) of the
decision function of a
Support Vector Machine
(SVM). Almost all SVM
implementations in YALE
provide a table and a plot
view of the learned model. In
this screenshot, red points
refer to support vectors, blue
points to normal training
examples. Bluish regions will
be predicted negative, reddish
regions will be predicted
positive.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
190
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
only the support vectors are
shown colorized by the
preditcted function value for
the corresponding data point.
Examples on the red side will
be predicted positive;
examples on the blue side
will be predicted negative.
There is a perfectly linear
separation in two of the
dimensions and it seems to be
that the parameters were not
chosen optimal since the
number of support vectors is
rather high.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
191
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
alpha values (Lagrange
multipliers) of the SVM are
plotted against the function
values and colorized with the
true label. We applied a slight
jittering to make more points
visible. This model seems to
be "well-learned", since only
few points have a alpha value
not equal to zero and these
are the points with function
values approximately 0.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
192
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
This surface plot presents
the result of a meta
optimization experiment:
the parameters of one of
the operators are
optimized. the plot can be
rotated and zoomed.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
193
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
194
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
195
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
196
UNIVERSITY OF JYVÄSKYLÄ
TIES443: Introduction to DM
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
Tutorial 1: Introduction to WEKA and YALE
197
UNIVERSITY OF JYVÄSKYLÄ
DEPARTMENT OF MATHEMATICAL INFORMATION TECHNOLOGY
WEKA & YALE Comparison
• You tell me in your report
• Now lets go through the first assignment
• 1st Assignment
http://www.cs.jyu.fi/~mpechen/TIES443/tutorials/assig
nment1.pdf
• My advise for you is to come back to this assignment and
WEKA and YALE tools after each forthcoming lecture to
see how the things are implemented and can be used in
practice.
TIES443: Introduction to DM
Tutorial 1: Introduction to WEKA and YALE
198