Transcript Slide 1

DataBases & Data Mining Joined
Specialization Project
„Data Mining Classification Tool”
By
Mateusz Żochowski
&
Jakub Strzemżalski
Agenda



General description of the problem
Functionality
Data Mining aspects


Algorithm and optimisation
Data Base aspects

General entities scheme
2
General Description




Universal Tool
Different kinds of objects (e.g.
preprocessed photos, hospital patients
data)
Finding similar objects
Decision problems
3
Functionality

Independent system – user operated



Using sets of data already provided or
uploading new types
Influence on the way data is processed
Possible usage in bigger systems as a
processing engine

Additional module used as a helping tool in
more complex systems
4
General Use Case
5
Data Mining

General Ideas



K-NN algorithm


Description of a object
Definition of a distance
Brief explanations of the algorithm
Optimization


Problem of comparing large number of objects
Optimized solution – using grouping idea
6
Definitions

Objects
7
K-NN

K – Nearest Neighbors



Idea standing behind k-nn
Aim - finding k-similar objects to the one
we are analyzing and eventually assigning
appropriate decision
Method - calculating distance from
analyzed object to the others in our
database and finding the closest ones
8
K-NN Graphical representation
9
Definitions

Distance


Calculations in multidimensional space
Coefficients



Alfa
wi – weights – underlining importance of
particular attributes
n – number of all the attributes
n
D(O1, O 2)   wi * ai(o1) ai(o2)
i
i 1
10
Optimalisation



The reason – cost of multidimensional
distance computation for 1-all elements
Solution – improved Knn
Result – better efficiency because of
reduced number of distance
computations due to narrowed set of
possibly similar objects
11
Step 1 - Group-oriented plane
division
12
Step 2 – new Object appeares
13
Step 3
14
Step 4
15
Step 5
16
Grouping problem



The problem – assigning object into
appropriate groups according to chosen
distance definition
Solution – some clustering algorithm
Brief example – k-means algorithm
17
DataBase – entities
18
DataBase




General structure of database results
from optimization issues
Due to universal purpose of the system
database may contain many different
tables of objects
Need of using system tables for defining
experiments
Group Member as a temporary table ?
19
Summary
There is still a lot of work to do... 
20