Visual divisive hierarchical clustering using k
Download
Report
Transcript Visual divisive hierarchical clustering using k
A Wordification Approach to Relational
Data Mining: Early Results
Matic Perovšek, Anže Vavpetič,
Nada Lavrač
Jožef Stefan Institute, Slovenia
Overview
Introduction
Methodology
Experimental results
Conclusion
Introduction
Relational data mining algorithms aim to induce
models and/or relational patterns from multiple
tables
Individual-centered relational databases can be
transformed to a single-table form –
propositionalization
Motivation
Wordification inspired by text mining techniques
Large number of simple, easy to understand
features
Greater scalability, handling large datasets
Can be used as a preprocessing step to
propositional learners, as well as to declarative
modeling / constraint solving
(De Raedt et al., today’s invited talk)
Methodology
Transformation from relational database
to a textual corpus
2. TF-IDF weight calculation
1.
Transformation from relational
database to a textual corpus
One individual of the initial relational database -
> one text document
Features -> the words of this document
Words constructed as a combination:
Transformation from relational
database to a textual corpus
For each individual, the words generated for
the main table are concatenated with words
generated from the secondary (BK) tables
Example
TF-IDF weights
No explicit use of existential variables in our
features, TF-IDF instead
The weight of a word gives a strong indication
of how relevant is the feature for the given
individual.
The TF-IDF weights can then be used either for
filtering words with low importance or using
them directly by a propositional learner.
Experimental results
Slovenian traffic accidents database
IMDB database
Top 250 and bottom 100 movies
Movies, actors, movie genres, directors,
director genres
Applied the wordification methodology
Performed association rule learning
Experimental results
Conclusion
Novel propositionalization technique called Wordification
Greater scalability
Easy to understand features
Further work:
Test on larger databases
Experimental comparison with other propositionalization
techniques
Combine with propositionalization–like approach to mining
heterogeneous information networks (Grčar et al. 2012),
applicable to CLP in data preprocessing
Grčar, Trdin, Lavrač: A Methodology for Mining Document-Enriched
Heterogeneous Information Networks, Computer Journal 2012