Machine Learning and Statistical Analysis
Download
Report
Transcript Machine Learning and Statistical Analysis
Jong Youl Choi
Computer Science Department
([email protected])
Social Bookmarking
Socialized
Bookmarks
Tags
1
Motivations
Social indexing or collaborative annotation
Collect knowledge from people
Extract information
Challenges
Vast amount of data
Very dynamic
Unsupervised data
2
Principles of Machine Learning
Bayes’ theorem and maximum likelihood
Machine Learning Algorithms
Clustering analysis
Dimension reduction
Classification
Parallel Computing
General parallel computing architecture
Parallel algorithms
3
Definition
Algorithms or techniques that enable computer (machine) to
“learn” from data. Related with many areas such as data
mining, statistics, information theory, etc.
Algorithm Types
Unsupervised learning
Supervised learning
Reinforcement learning
Topics
Models
▪ Artificial Neural Network (ANN)
▪ Support Vector Machine (SVM)
Optimization
▪ Expectation-Maximization (EM)
▪ Deterministic Annealing (DA)
4
Posterior probability of i, given X
i 2 : Parameter
X : Observations
P(i) : Prior (or marginal) probability
P(X|i) : likelihood
Maximum Likelihood (ML)
Used to find the most plausible i 2 , given X
Computing maximum likelihood (ML) or log-likelihood
Optimization problem
5
Problem
Estimate hidden parameters (={, })
from the given data extracted from
k Gaussian distributions
Gaussian distribution
Maximum Likelihood
(Mitchell , 1997)
With Gaussian (P = N),
Solve either brute-force or numeric method
6
Problems in ML estimation
Observation X is often not complete
Latent (hidden) variable Z exists
Hard to explore whole parameter space
Expectation-Maximization algorithm
Object : To find ML, over latent distribution P(Z |X,)
Steps
0. Init – Choose a random old
1. E-step – Expectation P(Z |X, old)
2. M-step – Find new which maximize likelihood.
3. Go to step 1 after updating old à new
7
Definition
Grouping unlabeled data into clusters, for the purpose of
inference of hidden structures or information
Dissimilarity measurement
Distance : Euclidean(L2), Manhattan(L1), …
Angle : Inner product, …
Non-metric : Rank, Intensity, …
Types of Clustering
Hierarchical
▪ Agglomerative or divisive
Partitioning
▪ K-means, VQ, MDS, …
(Matlab helppage)
8
Find K partitions with the total
intra-cluster variance minimized
Iterative method
Initialization : Randomized yi
Assignment of x (yi fixed)
Update of yi (x fixed)
Problem?
Trap in local minima
(MacKay, 2003)
9
Deterministically avoid local minima
No stochastic process (random walk)
Tracing the global solution by changing
level of randomness
Statistical Mechanics
(Maxima and Minima, Wikipedia)
Gibbs distribution
Helmholtz free energy F = D – TS
▪ Average Energy D = < Ex>
▪ Entropy S = - P(Ex) ln P(Ex)
▪ F = – T ln Z
In DA, we make F minimized
10
Analogy to physical annealing process
Control energy (randomness) by temperature (high low)
Starting with high temperature (T = 1)
▪ Soft (or fuzzy) association probability
▪ Smooth cost function with one global minimum
Lowering the temperature (T ! 0)
▪ Hard association
▪ Revealing full complexity, clusters are emerged
Minimization of F, using E(x, yj) = ||x-yj||2
Iteratively,
11
Definition
Process to transform high-dimensional data into lowdimensional ones for improving accuracy, understanding, or
removing noises.
Curse of dimensionality
Complexity grows exponentially
in volume by adding extra
dimensions
Types
(Koppen, 2000)
Feature selection : Choose representatives (e.g., filter,…)
Feature extraction : Map to lower dim. (e.g., PCA, MDS, … )
12
Finding a map of principle components
(PCs) of data into an orthogonal space,
such that
y = W x where W 2 Rd£h (hÀd)
x2
x1
PCs – Variables with the largest variances
Orthogonality
Linearity – Optimal least mean-square error
Limitations?
Strict linearity
specific distribution
Large variance assumption
13
Like PCA, reduction of dimension by y = R x where R is a
random matrix with i.i.d columns and R 2 Rd£p (pÀd)
Johnson-Lindenstrauss lemma
When projecting to a randomly selected subspace, the distance
are approximately preserved
Generating R
Hard to obtain orthogonalized R
Gaussian R
Simple approach
choose rij = {+31/2,0,-31/2} with probability 1/6, 4/6, 1/6
respectively
14
Dimension reduction preserving distance proximities
observed in original data set
Loss functions
Inner product
Distance
Squared distance
Classical MDS: minimizing STRAIN, given
From , find inner product matrix B (Double centering)
From B, recover the coordinates X’ (i.e., B=X’X’T )
15
SMACOF : minimizing STRESS
Majorization – for complex f(x),
find auxiliary simple g(x,y) s.t.:
Majorization for STRESS
(Cox, 2001)
Minimize tr(XT B(Y) Y), known as Guttman transform
16
Competitive and unsupervised learning process for
clustering and visualization
Learning
Choose the best similar model
Input
Model
vector mj with xi
Update the winner and its
neighbors by
mk = mk + (t) (t)(xi – mk)
(t) : learning rate
(t) : neighborhood size
Result : similar data getting closer in the model space
17
Definition
A procedure dividing data into the given set of categories
based on the training set in a supervised way
Generalization Vs. Specification
Hard to achieve both
Underfitting
Overfitting
Avoid overfitting(overtraining)
▪
▪
▪
▪
Early stopping
Holdout validation
K-fold cross validation
Leave-one-out cross-validation
Validation Error
Training Error
(Overfitting, Wikipedia)
18
Perceptron : A computational unit with binary threshold
Weighted Sum
Activation Function
Abilities
Linear separable decision surface
Represent boolean functions (AND, OR, NO)
Network (Multilayer) of perceptrons
Various network architectures and capabilities
(Jain, 1996)
19
Learning weights – random initialization and updating
Error-correction training rules
Difference between training data and output: E(t,o)
Gradient descent (Batch learning)
▪ With E = Ei ,
Stochastic approach (On-line learning)
▪ Update gradient for each result
Various error functions
Adding weight regularization term ( wi2) to avoid overfitting
Adding momentum (wi(n-1)) to expedite convergence
20
Q: How to draw the optimal linear
separating hyperplane?
A: Maximizing margin
Margin maximization
The distance between H+1 and H-1:
Thus, ||w|| should be minimized
Margin
21
Constraint optimization problem
Given training set {xi, yi} (yi 2 {+1, -1}):
Minimize :
Lagrangian equation with saddle points
Minimized w.r.t the primal variable w and b:
Maximized w.r.t the dual variables i (all i ¸ 0)
xi with i > 0 (not i = 0) is called support vector (SV)
22
Soft Margin (Non-separable case)
Slack variables i < C
Optimization with additional constraint
Non-linear SVM
Map non-linear input to feature space
Kernel function k(x,y) = h(x), (y)i
Kernel classifier with support vectors si
Input Space
Feature Space
23
Memory Architecture
Shared Memory
Symmetric Multiprocessor (SMP)
OpenMP, POSIX, pthread, MPI
Easy to manage but expensive
Distributed Memory
Commodity, off-the-shelf processors
MPI
Cost effective but hard to maintain
(Barney, 2007)
Decomposition Strategy
Task – E.g., Word, IE, …
Data – scientific problem
Pipelining – Task + Data
(Barney, 2007)
24
Shrinking
Recall : Only support vectors (i>0) are
used in SVM optimization
Predict if data is either SV or non-SV
Remove non-SVs from problem space
Parallel SVM
Partition the problem
Merge data hierarchically
Each unit finds support vectors
Loop until converge
(Graf, 2005)
25
26