Safarii Walkthrough

Download Report

Transcript Safarii Walkthrough

This slide show presents in detail the different features
of the Multi-Relational Data Mining package Safarii
Walkthrough version 2.0. Copyright © Kiminkii, 2007. Parts of
this presentation may be reused as long as clear reference is
made to Safarii and http://www.kiminkii.com/safarii.html.
Key Features of Safarii
•
•
•
•
Multi-Relational Data Mining
Mine relational data directly, no unnatural flattening
User-friendly graphical user interface
Versatile, allows many analytical settings (predictive,
exploratory, associative)
• Platform and RDBMS-independent
• Mining process done ‘in-database’
• Scalable
Contents
• Tell me about Multi-Relational Data Mining first...
• Go straight to the Safarii Walkthrough...
–
–
–
–
–
go
go
go
Subgroup Discovery
Filtering & Pattern Teams
Building Classifiers from Collections of Patterns
Decision Lists
Pre-processing with ProSafarii
• Frequently Asked Questions
• Contact information
go
go
go
go
go
go
Or just proceed to the next slide...
Multi-Relational Data Mining
Safarii is the only available commercial
implementation of a new data analysis
paradigm called Multi-Relational Data Mining
(MRDM). It allows the efficient and
automated analysis of structured data stored
in relational databases
Multi-Relational Data Mining
Mainstream Data Mining packages require the data to fit into a single table, where
rows describe cases, and columns describe the different features. But how would
you deal with more complex domains such as web-logs, social networks, extensive
active
inactive exist?
customer descriptions, molecules
etc., if such a restriction
H3C
NH
Histamine2
HN
N
Histamine
HN
N H
N
HN
N
H3C
inactive
NH2
active
HN
N
NH2
HN
N
NH2
HN
H
N
N
NH2
CH3
O
N
CH3
NH2
O
N
The solution from relational database theory is to use multiple tables. MRDM
generalises traditional Data Mining by working with cases spread over multiple
tables. In other words, cases are structured, and consist of multiple related parts
(such as atoms within a molecule, or contracts within a customer description)
Structured Data
Structured data consists of parts, that appear as records in different tables. MRDM
creates patterns based on the properties of parts, but also on the existence of
particular parts and the relations between parts
credit card 1
owner y
credit card 1
account x
transaction 1
...
transaction n
account y
owner z
...
credit card 1
Structured Patterns
MRDM builds models based on patterns that are structured, so-called Selection
Graphs. These graphs capture structural, as well as nominal and numeric
properties of structured cases. Multiple patterns are combined into predictive
models
data
a pattern that occurs in some of the cases
credit card 1
owner y
credit card 1
account x
transaction 1
...
transaction n
...
‘all accounts that have at least one
transaction for which the resulting
balance does not exceed 303.0’
Safarii Methodology
The different tools and techniques in Safarii
are organised in the Safarii methodology.
The following slides help to explain the
structure of this methodology, and hence
how to work with Safarii
Safarii Methodology
Closed Pattern Team
Interesting Patterns
Pattern Team
Classifier
Preprocessing
(ProSafarii)
Classifier (Decision List)
The Safarii methodology consists of two main streams:
building classifiers based on multi-relational structure
directly, and finding interesting multi-relational patterns
and potentially building classifiers from those
Safarii Methodology
Closed Pattern Team
Interesting Patterns
Pattern Team
Classifier
Preprocessing
(ProSafarii)
Classifier (Decision List)
Build Classifiers Directly: Safarii lets you analyse (possibly
pre-processed) multi-relational data by inducing classifiers
that capture predictive structural features of the data.
Safarii Methodology
Closed Pattern Team
Interesting Patterns
Pattern Team
Classifier
Preprocessing
(ProSafarii)
Classifier (Decision List)
Find Interesting Patterns: the second stream is based on the
discovery of interesting (predictive) patterns or rules. In this mode,
you will find more regularities compared to building classifiers
directly, because alternative or embedded structural features are
not overlooked. This enables a more explorative survey of the
dependencies in the database. The important predictive patterns
can be combined to form classifiers in a number of ways.
Safarii Methodology
Closed Pattern Team
Interesting Patterns
Pattern Team
Classifier
Preprocessing
(ProSafarii)
Classifier (Decision List)
As an option, an accompanying tool, called ProSafarii, helps you
pre-process the data in a number of multi-relational-sensitive
ways. Optimised datasets can be mined directly using Safarii
More about ProSafarii...
go
Subgroup Discovery
Find interesting subgroups within the
database, identified by Selection Graphs,
that show significant deviation from the
whole database
Subgroup Discovery
Subgroup Discovery finds collections of subgroups within the database that show
significant deviation from the whole database. Several parameters can be set, in
order to define subgroups of interest
Find subgroups of molecules,
where mutagenicity is common
188 molecules appear, of which
66,5% mutagenic
Search deep for 1 minute,
and find subgroups of at
least 47 molecules (= 25%)
Judge, and report subgroups
on the basis of a range of
interestingness measures
(Novelty balances accuracy
and coverage)
Subgroup Discovery: search for Patterns
…
…
Subgroup Discovery: search for Patterns
…
compl
T
F
.42
.13
.12
.33
.54
.55
1.0
…
novelty(ST) = p(ST)−p(S)p(T)
= .42 − .297 = .123
(novelty between −.25 and .25, 0 means uninteresting)
A user-specified interestingness measure is used to guide the search for
predictive patterns. In this example Novelty is used. Note the high
numbers along the diagonal of the contingency table, indicating a
positive dependency between the current pattern and the target
Subgroup Discovery: Search for Patterns
…
…
…
Subgroup Discovery: Search for Patterns
…
…
…
…
Subgroup Discovery
All subgroups that satisfy the search conditions are
reported, and details of the subgroups can be inspected
Turn Selection Graphs into SQL
Each subgroup (Selection Graph) corresponds to a SQL statement that can be saved
for future deployment, or turned into a database view, such that the interesting
subgroup is always virtually present, even after updates of the original data.
ROC Space Analysis
Plot the set of discovered patterns in ROC
space, and find optimal patterns. Safarii will
report patterns that lie on the convex hull
ROC Space Analysis
Each dot represent a single subgroup.
Interesting subgroups appear in the top
left corner. Uninteresting subgroups
appear on the diagonal. Dots on the
convex hull represent subgroups that
perform better than those under the
hull
This line represent the minimum
support threshold: all subgroups of
more than 30 molecules appear above it
ROC Space Analysis
Safarii will list the subgroups appearing
on the convex hull
Creating Pattern Teams
Filter the initial set of discovered patterns
to obtain a small team of patterns that are
interesting as well as non-redundant
Creating Pattern Teams
The outcome of Subgroup Discovery will typically produce an
abundance of interesting subgroups. Within this collection of
subgroups there will be a lot of overlap and redundancy. Apply
filtering to obtain Pattern Teams: small collections of subgroups that
are predictive and each add unique expertise to the team. Multiple
mechanisms allow for teams with different properties
Joint Entropy maximises
the independence of
subgroups
Exclusive Coverage selects
subgroups that are mutually
exclusive, but cover a lot of
the database
DTM Purity selects those
subgroups that together
lead to the best performing
Decision Table Majority
classifier
SVM Purity works in the
same way, but uses a
Support Vector Machine
Pattern Team
Four patterns that appear
often, are correlated with
the target (positively or
negatively), and are
relatively uncorrelated
with each other
Pattern Team
9
9
9
8
8
8
7
7
7
6
6
6
5
5
5
4
4
4
3
3
3
2
2
2
1
1
1
0
0
0
-1
-4
-3,5
-3
-2,5
-2
-1,5
-1
-0,5
-1
-1
0
-4
-4
-3,5
-3,5
-3
-3
-2,5
-2,5
-2
-2
-1,5
-1,5
The diagram on the left shows 82 subgroups
discovered in a 2-dimensional domain. On the right
a Pattern Team of 4 subgroups captures most of the
discriminatory power.
-1
-1
-0,5
-0,5
0
0
Bayesian Network of Patterns
A Bayesian Network
captures the relationships
between patterns. Similar
patterns are connected.
The Pattern Team (blue
nodes) can be indicated
in this network. The
highlighted patterns tend
to end up in separate
clusters, as by definition,
there is little redundancy
in Pattern Teams.
Building Classifiers from
Collections of Patterns
Turn the set of relevant binary features
formed by the discovered patterns into a
predictive model
Building Classifiers from Patterns
two classifier are available for turning patterns into classifiers:
Decision Table Majority classifier
works on Pattern Teams only
Support Vector Machine
works on both Pattern Teams and the whole collection of
patterns
Pattern Team Decision Table
Four patterns lead to
potentially 16 groups of
molecules. A decision
table shows how many
molecules appear for
each combination
Decision Table Majority classifier
A Decision Table Majority
classifier uses this decision
table to classify new cases. Any
Pattern Team can be used to
build a classifier, regardless of
the filtering mechanism used.
Results on the test-set can be
compared to the actual class
Decision Table Majority classifier
A Decision Table Majority table
can be stored in a database as
a separate table. Simply
classify future cases or data
sets by joining over the first k
columns
Alternatively, the DTM classifier
can be applied to the currently
defined test set, with results
being stored in the database
Support Vector Machine
Support Vector Machines find linear hyperplanes in the
space of binary features resulting from the discovered
patterns. The hyperplane attempts to separate the
positive and negative cases as best as possible. The
hyperplane is determined by assigning weights to
patterns. The weights indicate the influence of individual
patterns on the overall model.
Propositionalisation
Use the interesting patterns discovered, to
create a single table where the original
cases are described in terms of binary
features, each corresponding to a pattern: is
a case covered or not?
Propositionalisation
id
1
2
3
f1
1
1
0
f2
0
1
0
f3
0
0
0
With a simple press of a button, a list of
discovered subgroups can be stored as a
table of binary data. Join this table with the
original target table to obtain a rich
propositional table that captures structural
information
Any propositional Data Mining tool, as well
as Safarii, can now be used to mine the
flattened information
Graph Mining Features
Make Safarii mimic the behaviour of Graph
Mining systems — on a relational database
Object Identity
Interpretation: ‘all molecules that contain a
carbon atom and an atom of low charge
(potentially the same atom)’
Can these two atoms actually be one and the
same atom?
Object Identity
‘all molecules that contain at least one carbon
atom’
or
‘all molecules that contain at least two carbon
atom’
Safarii lets you choose between two alternative semantics of
Selection Graphs: traditional or object identity. This last
mode is common in Graph Mining, and allows for a
rudimentary form of counting substructure
‘Closed’ Patterns
In many cases, patterns capture the minimal requirements to select an
interesting subgroup. Often one is interested in finding out what other features
typically hold for such a pattern. Safarii lets you blow up a Selection Graph to
get a closed pattern: no extra constraints can be added without reducing the
size of the subgroup covered by the pattern
More complex
Selection Graph,
but same subgroup
Building Decision Lists
Build a Decision List classifier from the data
directly, rather than first building a Pattern
Team and deriving a classifier from this
Decision List
Decision Lists are induced by a method
called Separate & Conquer (also known as
covering approach): run Subgroup Discovery
to find a good subgroup, then ‘remove’ this
subgroup from the database, and continue in
the same way with the remainder, etc. The
result is a Decision List, an ordered list of
Selection Graphs with associated prediction
Decision List Characteristics
Decision Lists can be plotted in ROC space,
in much the same way as collections of
rules. The area under the ROC-curve is a
good measure for its quality
Other graphs are available to
analyse the quality of each
individual decision that
appears in the Decision List
Testing Decision List
Decision Lists are not only models of the
data, they are also classifiers: they can
be used to predict held-out data (testset) for validation, or to predict future
data for which the target is unknown
Safarii lets you compare predictions made on the
test-set to the actual target of those cases. The
classification score tells you what score to expect
on future data
Deploying Decision List
In order to deploy the Decision List as a classifier for use in other
applications or on-line scoring systems, Safarii allows the
exporting of the model as a series of SQL statements
Alternatively, if the data
to be classified appears
in the same table as the
original table (identified
as a different sample),
the classifier can be
applied directly, as
shown before
Pre-processing using
ProSafarii
Use ProSafarii, the companion of Safarii, to
pre-process the initial data, using a number
of pre-defined transformations that are
specifically relevant for MRDM.
Opportunities for improvement are
identified by ProSafarii automatically
Pre-processing with ProSafarii
ProSafarii is a separate
software package for preprocessing in a multirelational domain. It
considers any given
relational database, and
identifies opportunities for
transformation
Selected operations can be
executed automatically
The resulting modified
database is visible to
Safarii, and can be mined
directly
Pre-processing with ProSafarii
Classes of
transformation
ProSafarii supports
Opportunities for
transformation
Original or modified data model
Aggregation
Enrich the information available in individual
tables, by applying aggregate functions to
the relationships between pairs of tables
ProSafarii: Aggregation
Enrich the data in one
table by aggregating over a
one-to-many relationship
with a neighbouring table
Discretisation
Transform or enrich numeric data appearing
in any table, by applying a number of MRDMsensitive discretisation procedures
ProSafarii: Discretisation
Different discretisation
procedures, sensitive to
the multi-relational
structure of the data
Two representations for the
discretised attributes
The original numeric columns can
be kept alongside the new
nominal attributes
Two Alternative Representations
• Nominal (1 attribute)
a
b
c
• Cumulative binary (n-1 attributes)
d
Sampling
Use ProSafarii to define multiple samples of
your database, or use existing definitions of
samples, in order to distinguish training and
test sets, and apply cross-validation
ProSafarii: Sampling
Divide the target table up in two or
more subsets by defining a new
attribute that identifies each sample.
Either create two samples with
specified probability, or split into n
samples with uniform distribution
Use sampling in Safarii to
compare different samples,
classify test-sets, and do crossvalidation
Training set is either currently
selected sample (positive), or all
samples except current (negative)
Frequently Asked Questions (1)
• What operating systems does Safarii run on?
– The software is written in Java, so it will run on most modern
operating systems.
• I only have single table data. Can I still use Safarii?
– Yes. Even though you will not be using Safarii to its full power, it
still offers many features that are useful, and that can not be
found in competing single table tools.
• Can I deploy models created by Safarii in an operational
environment?
– Yes. Safarii allows you to define different data samples, such that
you can apply your models to held-out data. Alternatively, you can
save any model as a collection of SQL statements, to be applied to
new data, or specific cases.
Frequently Asked Questions (2)
• Can we use Safarii under an academic license?
– Yes, we offer an academic license at highly reduced rates. Please
contact us for conditions and pricing.
• What database management systems can I use with Safarii?
– Safarii is able to mine most modern relational datasources
• Do you offer consulting services or support with Safarii?
– Yes. We can help you set up a successful MRDM project with
Safarii. We also provide a data analysis service where we analyse
data for you, without the need of working with the system
yourself or purchase of a license.
Contact Information
For all information about features of Safarii,
pricing or consulting, please contact us here:
Kiminkii
P.O. Box 171
3995 DD Houten
the Netherlands
+31 6 24 61 25 60
[email protected]
www.kiminkii.com/safarii.html