Transcript pptx - CUNY
Supervised Learning Recap
Machine Learning
Last Time
• Support Vector Machines
• Kernel Methods
Today
• Review of Supervised Learning
• Unsupervised Learning
– (Soft) K-means clustering
– Expectation Maximization
– Spectral Clustering
– Principle Components Analysis
– Latent Semantic Analysis
Supervised Learning
• Linear Regression
• Logistic Regression
• Graphical Models
– Hidden Markov Models
• Neural Networks
• Support Vector Machines
– Kernel Methods
Major concepts
•
•
•
•
•
•
•
Gaussian, Multinomial, Bernoulli Distributions
Joint vs. Conditional Distributions
Marginalization
Maximum Likelihood
Risk Minimization
Gradient Descent
Feature Extraction, Kernel Methods
Some favorite distributions
• Bernoulli
• Multinomial
• Gaussian
Maximum Likelihood
• Identify the parameter values that yield the
maximum likelihood of generating the
observed data.
• Take the partial derivative of the likelihood function
• Set to zero
• Solve
• NB: maximum likelihood parameters are the
same as maximum log likelihood parameters
Maximum Log Likelihood
• Why do we like the log function?
• It turns products (difficult to differentiate) and
turns them into sums (easy to differentiate)
• log(xy) = log(x) + log(y)
• log(xc) = c log(x)
•
Risk Minimization
• Pick a loss function
– Squared loss
– Linear loss
– Perceptron (classification) loss
• Identify the parameters that minimize the loss
function.
– Take the partial derivative of the loss function
– Set to zero
– Solve
Frequentists v. Bayesians
• Point estimates vs. Posteriors
• Risk Minimization vs. Maximum Likelihood
• L2-Regularization
– Frequentists: Add a constraint on the size of the
weight vector
– Bayesians: Introduce a zero-mean prior on the
weight vector
– Result is the same!
L2-Regularization
• Frequentists:
– Introduce a cost on the size of the weights
• Bayesians:
– Introduce a prior on the weights
Types of Classifiers
• Generative Models
– Highest resource requirements.
– Need to approximate the joint probability
• Discriminative Models
– Moderate resource requirements.
– Typically fewer parameters to approximate than
generative models
• Discriminant Functions
– Can be trained probabilistically, but the output does not
include confidence information
Linear Regression
• Fit a line to a set of points
Linear Regression
• Extension to higher dimensions
– Polynomial fitting
– Arbitrary function fitting
• Wavelets
• Radial basis functions
• Classifier output
Logistic Regression
• Fit gaussians to data for each class
• The decision boundary is where the PDFs cross
• No “closed form” solution to the gradient.
• Gradient Descent
Graphical Models
• General way to describe the dependence
relationships between variables.
• Junction Tree Algorithm allows us to efficiently
calculate marginals over any variable.
Junction Tree Algorithm
• Moralization
– “Marry the parents”
– Make undirected
• Triangulation
– Remove cycles >4
• Junction Tree Construction
– Identify separators such that the running intersection
property holds
• Introduction of Evidence
– Pass slices around the junction tree to generate
marginals
Hidden Markov Models
• Sequential Modeling
– Generative Model
• Relationship between observations and state
(class) sequences
Perceptron
• Step function used for squashing.
• Classifier as Neuron metaphor.
Perceptron Loss
• Classification Error vs. Sigmoid Error
– Loss is only calculated on Mistakes
Perceptrons
use
strictly
classification
error
Neural Networks
• Interconnected Layers of Perceptrons or
Logistic Regression “neurons”
Neural Networks
• There are many possible configurations of
neural networks
– Vary the number of layers
– Size of layers
Support Vector Machines
• Maximum Margin Classification
Small Margin
Large Margin
Support Vector Machines
• Optimization Function
• Decision Function
Visualization of Support Vectors
25
Questions?
• Now would be a good time to ask questions
about Supervised Techniques.
Clustering
• Identify discrete groups of similar data points
• Data points are unlabeled
Recall K-Means
• Algorithm
– Select K – the desired number of clusters
– Initialize K cluster centroids
– For each point in the data set, assign it to the
cluster with the closest centroid
– Update the centroid based on the points assigned
to each cluster
– If any data point has changed
clusters, repeat
k-means output
Soft K-means
• In k-means, we force every data point to exist
in exactly one cluster.
• This constraint can be relaxed.
Minimizes the
entropy of cluster
assignment
Soft k-means example
Soft k-means
• We still define a cluster by a centroid, but we
calculate the centroid as the weighted mean
of all the data points
• Convergence is based on a stopping threshold
rather than changed assignments
Gaussian Mixture Models
• Rather than identifying clusters by “nearest”
centroids
• Fit a Set of k Gaussians to the data.
GMM example
Gaussian Mixture Models
• Formally a Mixture Model is the weighted sum
of a number of pdfs where the weights are
determined by a distribution,
Graphical Models
with unobserved variables
• What if you have variables in a Graphical
model that are never observed?
– Latent Variables
• Training latent variable models is an
unsupervised learning application
uncomfortable
sweating
amused
laughing
Latent Variable HMMs
• We can cluster sequences using an HMM with
unobserved state variables
• We will train the latent variable models using
Expectation Maximization
Expectation Maximization
• Both the training of GMMs and Gaussian
Models with latent variables are accomplished
using Expectation Maximization
– Step 1: Expectation (E-step)
• Evaluate the “responsibilities” of each cluster with the
current parameters
– Step 2: Maximization (M-step)
• Re-estimate parameters using the existing
“responsibilities”
• Related to k-means
Questions
• One more time for questions on supervised
learning…
Next Time
• Gaussian Mixture Models (GMMs)
• Expectation Maximization