John Shawe-Taylor (UCL CS): Statistical modelling & computational

Download Report

Transcript John Shawe-Taylor (UCL CS): Statistical modelling & computational

Statistical Modelling and Computational
Learning
John Shawe-Taylor
Department of Computer Science
UCL
Detecting patterns in data
• The aim of statistical modelling and computational
learning is to identify some stable pattern from a
finite sample of data.
• Based on the detected pattern, we can then
process future data more accurately, more
efficiently, etc.
• The approach is useful when we are unable to
specify the precise pattern a priori – the sample
must be used to identify it.
CPC Launch
2
Statistical Modelling vs Computational
Learning
• Statistical Modelling:
– Interested in inferring underlying structure, eg clusters,
parametric model, underpinning network
– Frequently use prior distribution and inference to
estimate posterior over possible models
• Computational Learning:
– Interested in ability to predict/process future data
accurately, eg classifiers, regression, mappings
– Aims to give assurances of performance on unseen
data
CPC Launch
3
Types of analysis
• Classification: which of two or more classes does
an example belong to? Eg document filtering.
• Regression: what is the real valued function of the
input that matches the output values? Eg function
approximation.
• Novelty detection: which new examples are
unusual or anomalous? Eg engine monitoring.
• Ranking: what is the rank that should be assigned
to a given data item? Eg recommender systems.
• Clustering: natural structure in the data
• Network: interactions that will persist
CPC Launch
4
Principled methods
• The key theoretical question is whether detected
patterns are spurious or stable – are they in this
sample by chance or are they implicit in the
distribution/system generating the data.
• This is a probabilistic or statistical question:
estimate the stability of any patterns detected in
the finite sample.
• What is the chance that we have been fooled by
the sample?
CPC Launch
5
Uniform convergence
• Typically we are attempting to assess an infinite
(or very large) class of potential patterns.
• The question is whether the finite sample
behaviour of the pattern will match its future
presence.
• For one function this needs a bound on the tail of
the distribution, but for a class we need
convergence uniformly over the class, cf. multiple
hypothesis testing.
CPC Launch
6
High dimensional spaces
• There is a further problem that modern
applications typically involve high dimensional
data.
• Kernel methods also project data into high
dimensional spaces in order to ensure that linear
methods are powerful enough.
• Provides a general toolkit of methods, but..
• it kills more traditional statistical methods of
analysis – the curse of dimensionality.
CPC Launch
7
Luckiness framework
(S-T, Bartlett, Williamson and Anthony, 1998)
• This allows you to use evidence of an effective
low dimensionality even when in high dimensions.
• Examples are:
– size of the margin in a support vector machine
classifier,
– norm of the weight vector in (ridge) regression
– sparsity of a solution
– evidence in the Bayesian posterior
– residual in principle components analysis
CPC Launch
8
Luckiness Framework
• Wins if the distribution of examples aligns with the
pattern
• Hence can work with far richer set of hypotheses
• Best known example is the support vector
machine that uses the margin as a measure of
luckiness
CPC Launch
9
Other luckiness approaches
• Large margins are just one way of detecting
fortuitous distributions
• Sparsity of representation appears to be a
more fundamental measure
• Gives rise to different algorithms, but the
luckiness of large margins can also be
justified by a sparsity argument
• Brings us full circle back to Ockham’s razor:
parsimonious description
CPC Launch
10
Kernel-based learning
• The idea that we look for patterns in richer
sets of hypotheses has given rise to kernel
methods
• Data is projected into a rich feature space
where linear patterns are sought using an
implicit kernel representation
• The luckiness is typically measured by the
norm of the linear function required for the
task – equivalent to the margin in the case of
classification
CPC Launch
11
Modularity of kernel methods
• Algorithms can be applied to different types of
data by simply defining an appropriate kernel
function to compare data items
• Range of algorithms has been extended to
include regression, ranking, novelty detection,
• Can be used for statistical tests that test the
null hypothesis that two samples are drawn
from the same distribution
• Also enables links with statistical modelling
techniques
CPC Launch
12
Kernel design methods
• Structure kernels use dynamic programming
to compute sums over possible matches, eg
String kernels
• Kernels from probabilistic models – eg
Fisher kernels.
• Kernel PCA to perform feature selection: eg
Latent Semantic kernels.
• Kernel alignment/PLS: tuning features to the
task.
CPC Launch
13
Cross-modal analysis
• Not limited to same representation.
• Consider web images and associated text
• Kernel for image includes wavelets and
colour histograms – for text is bag of
words.
• Creates a content based image retrieval
system
CPC Launch
14
Conclusion
• Statistical Modelling and Computational Learning aim to find
patterns in data
 SM interested in reliability of pattern, CL in quality of prediction
 Using bounds to guide algorithm design can overcome problems
with high dimensions
 Combined with kernels allows the use of linear methods
efficiently in high dimensional spaces
 Methods provide a general toolkit for the practitioner making use
of adaptive systems
 Particular applications require specific tasks and
representations/kernels that define new research challenges
CPC Launch
15
Conclusion
• Statistical Modelling and Computational Learning aim to
find patterns in data: SM interested in reliability of pattern,
CL in quality of prediction
• Using bounds to guide algorithm design can overcome
problems with high dimensions
• Combined with kernels allows the use of linear methods
efficiently in high dimensional spaces
• Methods provide a general toolkit for the practitioner
making use of adaptive systems
• Particular applications require specific tasks and
representations/kernels
On-line optimisation of web pages
• Have to decide what content to place on a web
page: eg advertising banners
• Have some measure of response: eg click
through, purchases made, etc.
• Can be viewed as an example of the one arm
bandit problem – need to trade exploration and
exploitation on-line
• Touch Clarity was a company providing this
technology for high street banks and others
CPC Launch
17
Upper confidence bounds
• Basic algorithm maintains estimates of the response rates
(RR) for each arm (piece of content) together with
standard deviations
• Serve arm for which mean + S.D. is maximum: guaranteed
to either be good arm or reduce the S.D.
• Provable bounds on loss compared to performing the best
arm all along
• Can be generalised to situation where additional
information linearly determines response rate (RR)
• Kernel methods enable handling of non-linear RRs
CPC Launch
18