“Pattern Recognition and Machine Learning” by Christopher Bishop

Download Report

Transcript “Pattern Recognition and Machine Learning” by Christopher Bishop

PATTERN RECOGNITION
AND MACHINE LEARNING
CHAPTER 1: INTRODUCTION
• Christopher M. Bishop
•
•
Lecturer: Xiaopeng Hong
These slides follow closely the course textbook “Pattern
Recognition and Machine Learning” by Christopher Bishop and
the slides “Machine Learning and Music” by Prof. Douglas Eck
免责声明
Contents
CV
IP
PR
Feature
Extraction
SP
Pattern
Classification
ML
Statistical
Symbol
Probability Information Mathematical
Theory
Theory
logic
…
PR & ML
• Pattern recognition has its origins in
engineering, whereas machine learning
grew out of computer science. However,
these activities can be viewed as two
facets of the same field, and together
they have undergone substantial
development over the past ten years.
Learning
• Learning denotes changes in the system
that is adaptive in the sense that they
enable the system to do the same task or
tasks drawn from the same population
more effectively the next time.
• by H. Simon
• 如果一个系统能够通过执行某种过程而改进它的
性能,这就是学习。
by 陆汝钤教授
Related Publications
• Conference
–
–
–
–
–
–
–
–
–
–
–
ICML
KDD
NIPS
IJCNN
AIML
IJCAI
COLT
CVPR
ICCV
ECCV
…
•
Journal
– Machine Learning (ML)
– Journal of Machine Learning
Research
– Annals of Statistics
– Data Mining and Knowledge
Discovery
– IEEE-KDE
– IEEE-PAMI
– Artificial Intelligence
– Journal of Artificial
Intelligence Research
– Computational Intelligence
– Neural Computation
– IEEE-NN
– Research, Information and
Computation
– …
History
• Classical statistical methods
• PAC
• Division in feature space
• Generalization
• Ensemble learning
BP Network
F. Rosenblatt. Perceptron
M. Minsky. “Perceptron”
Leslie Gabriel Valiant
• He introduced the "probably approximately correct"
(PAC) model of machine learning that has helped the
field of computational learning theory grow.
by Wikipedia
• 将计算复杂性作为一个必须考虑的因素。算法的复杂性必
须是多项式的。为了达到这个目的,不惜牺牲模型精度。
• “对任意正数ε>0,0≤δ<1,|F(x)-f(x)|≤ε成立的概率大于1δ”
• 对这个理念,传统统计学家难以接受
by 王珏教授
Vladimir N. Vapnik
• “不能将估计概率密度这个更为困难的问题作为解决机器学习分类或
回归问题的中间步骤,因此,他直接将问题变为线性判别问题其本质
是放弃机器学习建立的模型对自然模型的可解释性。”
• “泛化” “有限样本统计”
– 泛化作为机器学习的核心问题
– 在线性特征空间上设计算法
– 泛化 最大边缘
• “与其一无所有,不如求其次”。这是统计学的传统无法接受的
by 王珏教授
Robert Schapire
• “对任意正数ε>0,0≤δ<1,|F(x)-f(x)|≤ε成立的概率大于
1/2 + δ”
• 构造性证明了 PAC弱可学习的充要条件是PAC强可学习
• 集群学习有两个重要的特点:
– 使用多个弱模型代替一个强模型
– 决策方法是以弱模型投票,并以少数服从多数的原则决定解答。
by 王珏教授
Example
Handwritten Digit Recognition
28
28
d=784
Pre-processing  feature extraction
1. reduce variability ; 2. speed up computation
Polynomial Curve Fitting
Sum-of-Squares Error
Function
0th Order Polynomial
1st Order Polynomial
3rd Order Polynomial
9th Order Polynomial
Over-fitting
Root-Mean-Square (RMS) Error:
Polynomial Coefficients
Data Set Size:
9th Order Polynomial
Data Set Size:
9th Order Polynomial
Regularization
• Penalize large coefficient values
Regularization:
Regularization:
Regularization:
vs.
Polynomial Coefficients
Probability Theory
统计机器学习/模式分类问题 可以在 贝叶斯的框架下表示
We now seek a more principled approach to solving problems
in pattern recognition by turning to a discussion of probability
theory. As well as providing the foundation for nearly all of
the subsequent developments in this book.
ML
MAP
Bayesian
Probability Theory
Apples and Oranges
Probability Theory
• Marginal Probability
Joint Probability
• Conditional Probability
Probability Theory
• Sum Rule
Product Rule
The Rules of Probability
• Sum Rule
• Product Rule
Bayes’ Theorem
posterior  likelihood ×
prior
Probability Densities
Transformed Densities
Expectations
Conditional Expectation
(discrete)
Approximate Expectation
(discrete and continuous)
Variances and
Covariances
The Gaussian
Distribution
Gaussian Mean and
Variance
The Multivariate
Gaussian
Gaussian Parameter
Estimation
Likelihood function
Maximum (Log)
Likelihood
Properties of
and
Curve Fitting Re-visited
precision para.
Maximum Likelihood
Determine
by minimizing sum-of-squares error,
.
1. mean  WML
2. precision  β
Predictive Distribution
MAP: A Step towards
Bayes
Determine
by minimizing regularized sum-of-squares error,
.
fully Bayesian approach
Bayesian Curve Fitting
Section 3.3
Bayesian Predictive
Distribution
1.3 Model Selection
Many parameters…
we need to determine the values of such parameters, and the
principal objective in doing so is usually to achieve the best
predictive performance on new data.
we may wish to consider a range of different types of model
in order to find the best one for our particular application.
Section 3.3
Model Selection
• Cross-Validation
complexity  information criteria  tend to favour overly
simple models
Section 3.4 & 4.4.1
1.4 Curse of
Dimensionality
we will have to deal with spaces of high dimensionality
comprising many input variables.
this poses some serious challenges and is an important factor
influencing the design of pattern recognition techniques.
Curse of Dimensionality
Curse of Dimensionality
Polynomial curve fitting, M = 3
1.5 Decision Theory
Determination of p(x, t) from a set of training data is an
example of inference and is typically a very difficult problem
whose solution forms the subject of much of this book.
In a practical application, we often make a specific
prediction for the value of t, or more generally take a
specific action based on our understanding of the values t is
likely to take, and this aspect is the subject of decision theory.
Decision Theory
• Inference step
•
Determine either
or
• Decision step
•
For given x, determine optimal t.
.
Minimum
Misclassification Rate
If our aim is to minimize the
chance of assigning x to the
wrong class, then intuitively
we would choose the class
having the higher posterior
probability.
decision regions
Minimum Expected Loss
• Example: classify medical images as ‘cancer’ or ‘normal’
Truth
Decision
loss function / cost function
Minimum Expected Loss
Regions
are chosen to minimize
Reject Option
Why Separate Inference
and Decision?
•
•
•
•
Minimizing risk (loss matrix may change over time)
Reject option
Unbalanced class priors
Combining models
Decision Theory for
Regression
• Inference step
•
Determine
.
• Decision step
•
For given x, make optimal
prediction, y(x), for t.
• Loss function:
Generative vs
Discriminative
• Generative approach:
•
Model
•
Use Bayes’ theorem
• Discriminative approach:
•
Model
directly
Discriminant function
1.6 Information Theory
Information theory will also prove useful in our development of
pattern recognition and machine learning techniques
Considering a discrete random variable x, how much
information is received when we observe a specific value for
this variable. The amount of information can be viewed as
the ‘degree of surprise’ on learning the value of x.
Entropy
Important quantity in
• coding theory
• statistical physics
• machine learning
Entropy
• Coding theory: x discrete with 8 possible states; how
many bits to transmit the state of x?
• All states equally likely
Entropy
Conditional Entropy
The Kullback-Leibler
Divergence
Mutual Information
Conclusion
•
•
•
•
•
•
•
•
Machine Learning
Generalization
Classification/Regression
fitting/over-fitting
Regularization
Bayes’Theorem
Bayes’Decision
Entropy/KLD/MI
Q&A