No Slide Title

Download Report

Transcript No Slide Title

GUHA method in Data Mining
Esko Turunen
Tampere University of Technology
Tampere, Finland
Data Mining in a Nutshell
Knowledge discovery in databases (KDD) was initially defined as the ‘non-trivial extraction of
implicit, previously unknown, and potentially useful information from data’ [Frawley, PiatetskyShapiro, Matheus, 1991]. A revised version of this definition states that ‘KDD is the non-trivial
process of identifying valid, novel, potentially useful, and ultimately understandable patterns in
data’ [Fayyad, Piatetsky-Shapiro, Smyth, 1996].
According to this definition, data mining is a step in the KDD process concerned with applying
computational techniques (i.e., data mining algorithms implemented as computer programs) to
actually find patters in the data. In a sense, data mining is the central step in the KDD process.
The other steps in the KDD process are concerned with preparing data for data mining, as well as
evaluating the discovered patterns, the results of data mining.
I Data. The input to a data mining algorithm is most commonly a single flat table comprising a
number of fields (columns) and records (rows). In general, each row represents an object and
columns represent properties of objects.
II Typical data mining tasks.
- Classification and regression; the task is to predict the value of one field from other fields. If the
class is continuous, the task is called regression. If the class is discrete the task is called
classification.
- Clustering is concerned with grouping objects into classes of similar objects. A cluster is a
collection of objects that are similar to each other and are dissimilar to objects in other clusters.
- Association analysis is the discovery of association rules. Association rules specify correlation
between frequent item sets.
- Data characterisation sums up the general characteristics or features of the target class of data:
this class is typically collected by a database query.
- Outlier detection is concerned with finding data objects that do not fit the general behaviour or
model of the data: these are called outliers.
- Evaluation analysis describes and models regularities or trends whose behaviour changes over
time.
III Outputs of data mining procedures can be
Income
- Equations
 100.000 €
> 100.000 €
e.g. TotalSpent = 189.5275 x Age + 7146.89 [€]
- Decision trees, e.g.
Age
Yes
- Predictive rules of a form
 58
> 58
IF Conjunction of conditions
THEN Conclusion, e.g.
Yes
No
IF income is  100.000 € and Gender = Male
THEN not a Big Spender
- Association rules
e.g. {Gender = ‘Female’, Age = ‘>52’} {Big Spender = ‘Yes’}
- Distance and similarity measures e.g. d( x, y ) 
n
 (x
i 1
i
 y i ) 2 , where x  (x1 ,...xn ), y  (y1 ,...yn )
- Probabilistic models e.g. Bayesian networks
(For more details see Saso Dzeroski’s Relational Data Mining)
-------------------------------------------------------------------------------------------------------------------Our aim is to study in details a particular data mining method called GUHA and it’s computer
implementation called LISp Miner. This approach is essentially as association analysis, however,
classification, clustering and outlier detection tasks can be carried out by this method.