Visual divisive hierarchical clustering using k

Download Report

Transcript Visual divisive hierarchical clustering using k

A Wordification Approach to Relational
Data Mining: Early Results
Matic Perovšek, Anže Vavpetič,
Nada Lavrač
Jožef Stefan Institute, Slovenia
Overview
 Introduction
 Methodology
 Experimental results
 Conclusion
Introduction
 Relational data mining algorithms aim to induce
models and/or relational patterns from multiple
tables
 Individual-centered relational databases can be
transformed to a single-table form –
propositionalization
Motivation
 Wordification inspired by text mining techniques
 Large number of simple, easy to understand
features
 Greater scalability, handling large datasets
 Can be used as a preprocessing step to
propositional learners, as well as to declarative
modeling / constraint solving
(De Raedt et al., today’s invited talk)
Methodology
Transformation from relational database
to a textual corpus
2. TF-IDF weight calculation
1.
Transformation from relational
database to a textual corpus
 One individual of the initial relational database -
> one text document
 Features -> the words of this document
 Words constructed as a combination:
Transformation from relational
database to a textual corpus
 For each individual, the words generated for
the main table are concatenated with words
generated from the secondary (BK) tables
Example
TF-IDF weights
 No explicit use of existential variables in our
features, TF-IDF instead
 The weight of a word gives a strong indication
of how relevant is the feature for the given
individual.
 The TF-IDF weights can then be used either for
filtering words with low importance or using
them directly by a propositional learner.
Experimental results
 Slovenian traffic accidents database
 IMDB database
 Top 250 and bottom 100 movies
 Movies, actors, movie genres, directors,
director genres
 Applied the wordification methodology
 Performed association rule learning
Experimental results
Conclusion




Novel propositionalization technique called Wordification
Greater scalability
Easy to understand features
Further work:
 Test on larger databases
 Experimental comparison with other propositionalization
techniques
 Combine with propositionalization–like approach to mining
heterogeneous information networks (Grčar et al. 2012),
applicable to CLP in data preprocessing
Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched
Heterogeneous Information Networks, Computer Journal 2012