Diapositiva 1 - Taiwan Evolutionary Intelligence Laboratory

Download Report

Transcript Diapositiva 1 - Taiwan Evolutionary Intelligence Laboratory

Brief discussion on
DSMGA-II,
Optimization,
and Clustering
Peng Chun-Jen
Taiwan Evolutionary Intelligence Laboratory
2016/09/21 Group Meeting Presentation
Outline
• Brief on Optimization
• Mask order selection of DSMGA-II
– MIMIC : A distribution approach of MBGA
– Kernel K-means : A clustering approach
• Feature Selection and Feature Transformation
– Selection methods : Filtering and Wrapping
– Transformation methods: PCA and ICA
Disclaimer
• I’m sharing things I wish I knew when I was a noob
• Mostly Qualitative description, since I’m still a noob
• This is an “subjective” review on Udacity ML Lesson
Semi-supervised vs. Unsupervised
• Definition by Wiki
– Semi-supervised learning is a class of supervised learning
tasks and techniques that also make use of unlabeled data
for training – typically a small amount of labeled data
with a large amount of unlabeled data.
– Unsupervised learning is the ML task of inferring a
function to describe hidden structure from unlabeled data.
Semi-supervised vs. Unsupervised
Reference: https://upload.wikimedia.org/wikipedia/commons/d/d0/Example_of_unlabeled_data_in_semisupervised_learning.png
Reference: http://insidebigdata.com/wp-content/uploads/2014/11/customer_segmentation_vert.jpg
(Semi-)Unsupervised Learning
• Definition by Wiki
– Unsupervised learning is the ML task of inferring a
function to describe hidden structure from unlabeled data.
• Optimization : GA!!
• Clustering : MBGA??
• Blind signal separation : MBGA??
• Anomaly detection, NN, EM, …
Optimization
• Small input space: Generate and test
• Derivative : Calculus, Newton’s Method
• Real-world:
– Large input space
– No derivative
– Complex function
– Possible multiple local optimal
Reference: https://www.rmiguides.com/international/_images/slideshows/everest/5.jpg
Optimization
• Definition :
– Find 𝑥𝑚𝑎𝑥 ∈ 𝑋 𝑠𝑜 𝑡ℎ𝑎𝑡 𝑓 𝑥𝑚𝑎𝑥 = max 𝑓(𝑥)
𝑥
• Hill Climbing
– Stuck at local max
• Random Restart
– Might not have
points in max hill
Reference:
https://classconnection.s3.amazonaws.com/608/flashcards/1910608/jpg/fig_4_1_hillclimbing1349062649219.jpg
Optimization
• Hill Climbing
• Random Restart
• Simulated Annealing
– Parameter tuning
– Need random restart
Reference: http://i.makeagif.com/media/3-09-2014/XT0sjr.gif
Optimization
• Hill Climbing
• Random Restart
• Simulated Annealing
– Parameter tuning
– Need random restart
Reference: http://www.itm.uni-stuttgart.de/research/alpso/bilder/pso.gif
Optimization
• Hill Climbing
• Random Restart
• Simulated Annealing
• Particle Swarm Optimization
Reference: http://www.frontiersin.org/files/Articles/126819/fendo-05-00239HTML/image_m/fendo-05-00239-g001.jpg
– Only compare self and nearest neighbor, (and global best)
– Population converge easily
and stuck at misleading local optimal
Optimization
• Hill Climbing
• Random Restart
• Simulated Annealing
• PSO
Reference: http://giphy.com/gifs/9Mld29SbBA9xe
• Genetic Algorithm
– PSO: Cross Over evenly spreads information in population
– SA: Use XO and mutation to generate better neighbors
Optimization
• Hill Climbing
• Random Restart
• Simulated Annealing
• PSO
• Genetic Algorithm
Reference:
https://classconnection.s3.amazonaws.com/608/flashcards/1910608/jpg/fig_4_1_hillclimbing1349062649219.jpg
– Parameter Tuning (population size)
– NFE wasted on ineffective points
Optimization
• Hill Climbing
Lessons:
• Random Restart
• Multiple points (population)
• Simulated Annealing
• Delayed reward
• PSO
• Neighbors’ info.
• Genetic Algorithm
• Info. Spread in population
Optimization
• Genetic Algorithm
– How to spread information among population?
• Need to learn structure before XO
• Lots of work done in MBGA field
– How to pose new neighbors in promising neighborhood?
• Need to capture history and distribution
• (More ES approach in populationless GA?
Use TABU distribution to generate new chromosome)
Outline
• Brief on Optimization
• Mask order selection of DSMGA-II
– MIMIC : A distribution approach of MBGA
– Kernel K-means : A clustering approach
• Feature Selection and Feature Transformation
– Selection methods : Filtering and Wrapping
– Transformation methods: PCA and ICA
Improve DSMGA-II
• Mask order selection of DSMGA-II
– Use mask(s) to generate N clusters,
and estimate which clustering is better to choose masks
• How to generate clusters with only distance matrix?
– PAM or Kernel K-means or MDS : A clustering approach
• How to evaluate clusters?
– Kullback-Leibler Divergence : A distribution approach
Outline
• Brief on Optimization
• Mask order selection of DSMGA-II
– MIMIC : A distribution approach of MBGA
– Kernel K-means : A clustering approach
• Feature Selection and Feature Transformation
– Selection methods : Filtering and Wrapping
– Transformation methods: PCA and ICA
EDA (Estimation of Distribution Algorithms)
• Learn Structure with Distribution
– Generates new individuals from a probability distribution
instead of crossover or mutation
Reference: Prof. Yu’s GA Lecture Slides, Lecture 07 Road to Competence slide 27
EDA (Estimation of Distribution Algorithms)
• Learn Structure with Distribution
– Generates new individuals
from a probability distribution
instead of crossover or mutation
Reference: https://www.wikiwand.com/en/Estimation_of_distribution_algorithm
MIMIC
• Mutual Information Maximization for Input Clustering
• MIMIC is an EDA proposed in [de Bonet et al., 1997]
MIMIC
• Assumption : Variables are dependent
– Generates new individuals from a probability distribution
– Find distribution with dependency tree
𝑋 = 𝑥1 𝑥2 𝑥3 … 𝑥𝑛
𝑃 𝑋 = 𝑃 𝑥1 𝑥2 𝑥3 … 𝑥𝑛 𝑃 𝑥2 𝑥3 … 𝑥𝑛 𝑃 𝑥𝑛−1 𝑥𝑛 𝑃 𝑥𝑛
𝑃𝜋 𝑋 = 𝑃 𝑥𝑖1 𝑥𝑖2 𝑃 𝑥𝑖2 𝑥𝑖3 𝑃 𝑥𝑖𝑛−1 𝑥𝑖𝑛 𝑃 𝑥𝑖𝑛
⇒ 𝑡𝑜𝑜 𝑑𝑖𝑓𝑓𝑖𝑐𝑢𝑙𝑡
⇒ 𝑒𝑎𝑠𝑖𝑒𝑟 𝑡𝑜 𝑔𝑒𝑛𝑒𝑟𝑎𝑡𝑒 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
– Need to estimate how well 𝑃𝜋 𝑋 describes 𝑃 𝑋
Kullback-Leibler Divergence
• Also known as discrimination info., relative entropy
– The amount of information lost
when Q is used to approximate P
–
Discrete
Continuous
– Properties
𝐷𝐾𝐿 (𝑃| 𝑄 ≠ 𝐷𝐾𝐿 (𝑄| 𝑃
Reference: https://www.wikiwand.com/en/Kullback%E2%80%93Leibler_divergence
Kullback-Leibler Divergence
For 𝑃𝜋 𝑋 to describes 𝑃 𝑋 best, we have to minimize 𝐷𝐾𝐿 (𝑃| 𝑃𝜋
Need min.
𝑱′ 𝝅 𝑿 = −𝚺𝒉 𝒙𝒊 +
= −𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 )
Minimizing −𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 ) is Maximizing Mutual Info. 𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 )
Kullback-Leibler Divergence
We need maximum mutual information between all dependent variables
Mutual Info. 𝑰 (𝒙𝒊 ; 𝝅 𝒙𝒊 )
for 𝑃𝜋 𝑋 to describes 𝑃 𝑋 best, i.e. min 𝐷𝐾𝐿 (𝑃| 𝑃𝜋
So given a MI matrix,
we need to find
Maximum Spanning Tree!
For dense matrix, use Prim Algo.
Reference: Prof. Yu’s GA Lecture Slides, Lecture 07a-LTGA DSMGA2
Prim’s Algorithm
Given a MI matrix, we need to find Maximum Spanning Tree!
Start with random node, Maximum Spanning Tree is similar to LTGA
MIMIC Lessons
• Assumption : Variables are dependent
– 𝑃𝜋 𝑋 Use “Pair-wise” Mutual information
• Find distribution with dependency tree
– MIMIC generated structure similar to LTGA
– What about ILS?
• Use KLD to optimize distribution
– Might be a good tool for to estimate models in DSMGA-II
Outline
• Brief on Optimization
• Mask order selection of DSMGA-II
– MIMIC : A distribution approach of MBGA
– Kernel K-means : A clustering approach
• Feature Selection and Feature Transformation
– Selection methods : Filtering and Wrapping
– Transformation methods: PCA and ICA
Clustering
• Connective-based clustering (Hierarchical clustering)
– Single Linkage Clustering
• Centroid-based clustering
– K-means
• Soft clustering
– EM
Clustering
• Connective-based clustering (Hierarchical clustering)
– Single Linkage Clustering (LTGA, ILS)
Reference: http://www.multid.se/genex/onlinehelp/clustering_distances.png
Reference: http://image.slidesharecdn.com/log6kntt4i4dgwfwbpxw-signature75c4ed0a4b22d2fef90396cdcdae85b38911f9dce0924a5c52076dfcf2a19db1-poli151222094044/95/hierarchical-clustering-15-638.jpg?cb=1450777272
Clustering
• Centroid-based clustering
– K-means
– Assume structure is
centroid-based?
– Need Euclidean distances
so cannot directly apply
Reference: http://simplystatistics.org/2014/02/18/k-means-clustering-in-a-gif/
Clustering
• Soft clustering
– Expectation Maximization
– Solve overlapping?
– Solve Cross-competition?
Reference:
https://upload.wikimedia.org/wikipedia/commons/6/69/EM_Clustering_of_Old_
Faithful_data.gif
DSMGA-II Clustering
• Given a distance matrix, how to find clusters?
– K-medoids, aka PAM (Partitioning Around Medoids)
– Kernel K-means
– MDS (Multidimensional scaling)
K-medoids
• Less requirement for data than k-means
• K-means need Euclidean distances to find means
• K-medoids only needs dissimilarity between two data
• Use (1/Mutual Information) as dissimilarity
Reference: http://cfile5.uf.tistory.com/image/033D5E35511F467D29C80E
The Kernel Trick Revisited
Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013
The Kernel Trick Revisited


Map points to feature space using basis function
𝜑(𝑥)
Replace dot product 𝜑(𝑥).𝜑(𝑦)with kernel entry
𝐾(𝑥, 𝑦)
Mercer’s condition:
To expand Kernel function K(x,y) into a dot product, i.e. K(x,y)=(x)(y), K(x, y) has to be positive
semi-definite function, i.e., for any function f(x) whose is finite, the following inequality holds
 dxdyf ( x)K ( x, y) f ( y)  0
Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013
Kernel k-means
Minimize sum of squared error:
Kernel k-kmeans:means:
min
2
n m
u
x

c
  ij i
j
i 1 j 1
Replace with 𝜑(𝑥)
n
min
m
u
i 1 j 1
ij
 ( xi ) c~j
m
uij  {0,1}
u
j 1
ij
2
1
Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013
Kernel k-means

Cluster centers:
1
~
cj 
nj

n
 u (x )
i 1
ij
i
Substitute for centers:
n
m
 u
i 1 j 1
ij
 ( xi ) c~j
2
2
n
m
  uij
i 1 j 1
1 n
 ( xi )  ulj ( xl )
n j l 1
Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013
Kernel k-means
• Use kernel trick:
n
m
 u
i 1 j 1
ij
 ( xi )c~j
2
 traceK   traceUKU 
• Optimization problem:
min
traceK   traceUKU   max traceUKU 
• K is the n x n kernel matrix, U is the optimal
normalized cluster membership matrix
Credit: Radha Chitta’s slides about Kernel K-means, April 16, 2013
MDS
• Multidimensional scaling
– place each object in N-dimensional space such that the
between-object distances are preserved as well as possible.
– Then do K-means with projected 2-d Euclidean distance
Outline
• Brief on Optimization
• Mask order selection of DSMGA-II
– MIMIC : A distribution approach of MBGA
– Kernel K-means : A clustering approach of MBGA
• Feature Selection and Feature Transformation
– Selection methods : Filtering and Wrapping
– Transformation methods: PCA and ICA
Feature extractor/generator
Data
Feature selection
Feature transformation
Agent
Features
Classifications
Goal
PCA
• Principal Component Analysis
– first principal component has the largest possible variance
Reference:
http://weigend.com/files/teaching/stanford/2008/stanford20
08.wikispaces.com/file/view/pca_example.gif
ICA
• Independent Component Analysis
– separate independent sources linearly mixed in sensors
Reference:
https://onionesquereality.files.wordpress.com/2010/01/coc
ktail-party-problem.jpg
Reference
1.
2.
3.
4.
5.
De Bonet, J. S., Isbell, C. L., & Viola, P. (1997). MIMIC: Finding optima by
estimating probability densities. Advances in neural information
processing systems, 424-430.
Dhillon, I. S., Guan, Y., & Kulis, B. (2004, August). Kernel k-means: spectral
clustering and normalized cuts. In Proceedings of the tenth ACM SIGKDD
international conference on Knowledge discovery and data mining (pp.
551-556). ACM.
Van der Laan, M., Pollard, K., & Bryan, J. (2003). A new partitioning
around medoids algorithm. Journal of Statistical Computation and
Simulation, 73(8), 575-584.
Kleinberg, J. (2003). An impossibility theorem for clustering. Advances in
neural information processing systems, 463-470.
http://alexhwilliams.info/itsneuronalblog/2015/10/01/clustering2/