John Shawe-Taylor (UCL CS): Statistical modelling & computational
Download
Report
Transcript John Shawe-Taylor (UCL CS): Statistical modelling & computational
Statistical Modelling and Computational
Learning
John Shawe-Taylor
Department of Computer Science
UCL
Detecting patterns in data
• The aim of statistical modelling and computational
learning is to identify some stable pattern from a
finite sample of data.
• Based on the detected pattern, we can then
process future data more accurately, more
efficiently, etc.
• The approach is useful when we are unable to
specify the precise pattern a priori – the sample
must be used to identify it.
CPC Launch
2
Statistical Modelling vs Computational
Learning
• Statistical Modelling:
– Interested in inferring underlying structure, eg clusters,
parametric model, underpinning network
– Frequently use prior distribution and inference to
estimate posterior over possible models
• Computational Learning:
– Interested in ability to predict/process future data
accurately, eg classifiers, regression, mappings
– Aims to give assurances of performance on unseen
data
CPC Launch
3
Types of analysis
• Classification: which of two or more classes does
an example belong to? Eg document filtering.
• Regression: what is the real valued function of the
input that matches the output values? Eg function
approximation.
• Novelty detection: which new examples are
unusual or anomalous? Eg engine monitoring.
• Ranking: what is the rank that should be assigned
to a given data item? Eg recommender systems.
• Clustering: natural structure in the data
• Network: interactions that will persist
CPC Launch
4
Principled methods
• The key theoretical question is whether detected
patterns are spurious or stable – are they in this
sample by chance or are they implicit in the
distribution/system generating the data.
• This is a probabilistic or statistical question:
estimate the stability of any patterns detected in
the finite sample.
• What is the chance that we have been fooled by
the sample?
CPC Launch
5
Uniform convergence
• Typically we are attempting to assess an infinite
(or very large) class of potential patterns.
• The question is whether the finite sample
behaviour of the pattern will match its future
presence.
• For one function this needs a bound on the tail of
the distribution, but for a class we need
convergence uniformly over the class, cf. multiple
hypothesis testing.
CPC Launch
6
High dimensional spaces
• There is a further problem that modern
applications typically involve high dimensional
data.
• Kernel methods also project data into high
dimensional spaces in order to ensure that linear
methods are powerful enough.
• Provides a general toolkit of methods, but..
• it kills more traditional statistical methods of
analysis – the curse of dimensionality.
CPC Launch
7
Luckiness framework
(S-T, Bartlett, Williamson and Anthony, 1998)
• This allows you to use evidence of an effective
low dimensionality even when in high dimensions.
• Examples are:
– size of the margin in a support vector machine
classifier,
– norm of the weight vector in (ridge) regression
– sparsity of a solution
– evidence in the Bayesian posterior
– residual in principle components analysis
CPC Launch
8
Luckiness Framework
• Wins if the distribution of examples aligns with the
pattern
• Hence can work with far richer set of hypotheses
• Best known example is the support vector
machine that uses the margin as a measure of
luckiness
CPC Launch
9
Other luckiness approaches
• Large margins are just one way of detecting
fortuitous distributions
• Sparsity of representation appears to be a
more fundamental measure
• Gives rise to different algorithms, but the
luckiness of large margins can also be
justified by a sparsity argument
• Brings us full circle back to Ockham’s razor:
parsimonious description
CPC Launch
10
Kernel-based learning
• The idea that we look for patterns in richer
sets of hypotheses has given rise to kernel
methods
• Data is projected into a rich feature space
where linear patterns are sought using an
implicit kernel representation
• The luckiness is typically measured by the
norm of the linear function required for the
task – equivalent to the margin in the case of
classification
CPC Launch
11
Modularity of kernel methods
• Algorithms can be applied to different types of
data by simply defining an appropriate kernel
function to compare data items
• Range of algorithms has been extended to
include regression, ranking, novelty detection,
• Can be used for statistical tests that test the
null hypothesis that two samples are drawn
from the same distribution
• Also enables links with statistical modelling
techniques
CPC Launch
12
Kernel design methods
• Structure kernels use dynamic programming
to compute sums over possible matches, eg
String kernels
• Kernels from probabilistic models – eg
Fisher kernels.
• Kernel PCA to perform feature selection: eg
Latent Semantic kernels.
• Kernel alignment/PLS: tuning features to the
task.
CPC Launch
13
Cross-modal analysis
• Not limited to same representation.
• Consider web images and associated text
• Kernel for image includes wavelets and
colour histograms – for text is bag of
words.
• Creates a content based image retrieval
system
CPC Launch
14
Conclusion
• Statistical Modelling and Computational Learning aim to find
patterns in data
SM interested in reliability of pattern, CL in quality of prediction
Using bounds to guide algorithm design can overcome problems
with high dimensions
Combined with kernels allows the use of linear methods
efficiently in high dimensional spaces
Methods provide a general toolkit for the practitioner making use
of adaptive systems
Particular applications require specific tasks and
representations/kernels that define new research challenges
CPC Launch
15
Conclusion
• Statistical Modelling and Computational Learning aim to
find patterns in data: SM interested in reliability of pattern,
CL in quality of prediction
• Using bounds to guide algorithm design can overcome
problems with high dimensions
• Combined with kernels allows the use of linear methods
efficiently in high dimensional spaces
• Methods provide a general toolkit for the practitioner
making use of adaptive systems
• Particular applications require specific tasks and
representations/kernels
On-line optimisation of web pages
• Have to decide what content to place on a web
page: eg advertising banners
• Have some measure of response: eg click
through, purchases made, etc.
• Can be viewed as an example of the one arm
bandit problem – need to trade exploration and
exploitation on-line
• Touch Clarity was a company providing this
technology for high street banks and others
CPC Launch
17
Upper confidence bounds
• Basic algorithm maintains estimates of the response rates
(RR) for each arm (piece of content) together with
standard deviations
• Serve arm for which mean + S.D. is maximum: guaranteed
to either be good arm or reduce the S.D.
• Provable bounds on loss compared to performing the best
arm all along
• Can be generalised to situation where additional
information linearly determines response rate (RR)
• Kernel methods enable handling of non-linear RRs
CPC Launch
18