Transcript lect5_web

Lecture 5
TIES445 Data mining
Nov-Dec 2007
Sami Äyrämö
These slides are additional material for TIES445
‹#›
A data mining algorithm

”A data mining algorithm is a well-defined procedure that takes
data as input and produces output in the form of models and
patterns”
–
”Well-defined” indicate that the procedure can be precisely encoded
as a finite set of rules
–
”Algorithm”, a procedure that always terminates after some finite
number of of steps and produces an output
–
”Computational method” has all the properties of an algorithm except
a method for guaranteeing that the procedure will terminate in a finite
number of steps

(Computational method is usually described more abstactly than
algorithm, e.g., steepest descent is a computational method)
These slides are additional material for TIES445
‹#›
Data mining tasks
Explorative (visualization)
 Descriptive (clustering, rule finding,…)
 Predictive (classification, regression,…)

These slides are additional material for TIES445
‹#›
Elements of data mining algorithms

Data mining task

Structure of the model or pattern

Score function

Search/optimization method

Data management technique
These slides are additional material for TIES445
‹#›
Structure




Structure (functional form) of the model or pattern that will
be fitted to the data
Defines the boundaries of what can be approximated or
learned
Within these boundaries, the data guide us to a particular
model or pattern
E.g., hierarchical clustering model, linear regression
model, mixture model
These slides are additional material for TIES445
‹#›
Structure: decision tree
Figure from the book ”Tan,Steinbach, Kumar, Introduction to Data Mining, Addision Wesley, 2006.”
These slides are additional material for TIES445
‹#›
Structure: MLP
Figures by Tommi Kärkkäinen
These slides are additional material for TIES445
‹#›
Score function
Judge the quality of the fitted models or patterns
based on observed data
 Minimized/maximized when fitting parameters to
our models and patterns
 Critical for learning and generalization
– Goodness-of-fitness vs. generalization



e.g., the number of neurons in neural network
E.g., misclassification error, squared
error,support/accuracy
These slides are additional material for TIES445
‹#›
Score functions: Prototype-based clustering

K
n
min
J
(
I
,
{
c
}
,
{
x
}
q
k k 1
i i 1 ),
K
{c k }k 1 ,I i

for J q (I, {c k }
K
k 1
n

i 1
q
, {x i } )   Pi (x i  c ( I )i ) ,
n
i 1
where, (I ) i  {1,..., K }, c k  R p , Pi  diag( p i ) and x i  R p .
α = 2, q=2 → K-means
α = 1, q=2 → K-spatialmedians
α = 1, q=1 → K-coord.medians
• Different staticical properties of the cluster models
• Different algorithms and computational methods for solving
These slides are additional material for TIES445
‹#›
Score function: Overfitting vs. generalization
Figures by Tommi Kärkkäinen
These slides are additional material for TIES445
‹#›
Search/optimization method
Used to search over parameters and structures
 Computational procedures and algorithms used
to find the maximum/minimum of the score
function for particular models or patterns
– Includes:

Computational
methods used to optimize the score
function, e.g., steepest descent
Search-related parameters, e.g., the maximum
number of iterations or convergence specification for
an iterative algorithm

Single-fixed structure (e.g., kth order polynomial
function of the inputs) or family of different
structures (i.e., search over both structures and
their associated parameters spaces)
These slides are additional material for TIES445
‹#›
Search/optimization: K-means-like clustering
1.
2.
3.
4.
Initialize the cluster prototypes
Assign each data point to the closest cluster
prototype
Compute the new estimates (may require
another iterative algorithm) for the cluster
prototypes
Termination: stop if termination criteria are
satisfied (usually no changes in I)
These slides are additional material for TIES445
‹#›
Data management technique
Storing, indexing, and retrieving data
 Not usually specified by statistical or machine
learning algorithms
– A common assumption is that the data set is
small enough to reside in the main memory so
that random access of any data point is free
relative to actual computational costs
 Massive data sets may exceed the capacity of
available main memory
– The physical location of the data and the
manner in which data it is accessed can be
critically important in terms of algorithm
efficiency

These slides are additional material for TIES445
‹#›
Data management technique: memory

A general categorization of different memory structures
1. Registers of processors: direct acces, no slowdown
2. On-processor or on-board cache: fast semiconductor
memory on the same chip as the processor
3. Main memory: Normal semiconductor memory (up to
several gigabytes)
4. Disk cache: intermediate storage between main
memory and disks
5. Disk memory: Terabytes. Access time milliseconds.
6. Magnetic tape: Access time even minutes.
These slides are additional material for TIES445
‹#›
Data management: index structures





B-trees
Hash indices
Kd-trees
Multidimensional indexing
Relational datatables
These slides are additional material for TIES445
‹#›
Examples
CART
Backpropagation
APriori
Task
Classification
and regression
Regression
Rule pattern
discovery
Structure
Decision tree
Neural network
(non-linear function)
Association rules
Score function
Cross-validated
loss function
Squared error
Support/accuracy
Search method
Greedy search
over structures
Gradient descent on
parameters
Breadth-First
search
Data
management
technique
Unspecified
Unspecified
Linear scans
These slides are additional material for TIES445
‹#›