P(A|B) = P(A∩B)/P(B) = P(B|A)

Download Report

Transcript P(A|B) = P(A∩B)/P(B) = P(B|A)

Introduction
Machine Learning
febr. 10.
Machine Learning
How can we design a computer system
whose performance improves by
learning from experience?
Exams
• Oral exam
• Task solving exam
• ML software project
Spam filtering
Face/person recognition
demo
Recommendation systems
Robotics
Natural Language Processing
other application areas
– Biometrics
– Object recognition on images
– DNA seqencing
– Financial data mining/prediction
– Process mining and optimisation
Pattern Classification, Chapter 1
Big Data
11
Rule-based systems vs.
Machine learning
• Domain expert is needed for
– writing rules OR
– giving training sample
• Which one is better?
– Can the expert design rule-based
systems?
– Is the problem specific or general?
12
Gépi tanulás
jelen és jövő
• egyre több alkalmazásban van jelen
– „úszunk az adatban, miközben
szomjazunk az információra”
– technológiai fejlettség és elterjedtség
– igény az egyre nagyobb fokú
automatizálásra és perszonalizációra
• Vannak megoldott problémák, de
számos nyitott kutatási kérdés is!
http://www.ml-class.org/course
14
Definition
Machine Learning (Mitchell): „a computer
program said to learn from experience E
with respect to some class of tasks T and
performance measure P, if its performance
at tasks in T, as measured by P, improves
with experience E.”
Most of the materials in these slides
were taken from
Pattern Classification (2nd ed) by R.
O. Duda, P. E. Hart and D. G. Stork,
John Wiley & Sons, 2000
with the permission of the authors
and the publisher
17
Example
Classify fishes
see bass
Classes
salmon
Goal: to learn a modell from training data which
can categorise fishes
(eg. salmons are shorter)
Pattern Classification, Chapter 1
18
Classification(T)
– Supervised learning: Based on
training examples (E), learn a modell
which works fine on previously
unseen examples.
– Classification: a supervised learning
task of categorisation of entities into
predefined set of classes
Pattern Classification, Chapter 1
19
Pattern Classification, Chapter 1
Basic definitions
Feature (or attribute)
Instance
(or entity, sample)
ID
Length (cm) Lightness
Type
1
28
0.5
salmon
2
23
0.7
salmon
3
17
0.5
sea bass
Class label
21
Example Preprocessing
– Image processing steps
• E.g segmentation of fish contour and background
– Feature extraction
• Extraction of features/attributes from images which
are atomic variables
• Typically numerical or categorical
Pattern Classification, Chapter 1
22
Example features
•
•
•
•
•
length
lightness
width
number of paddles
position of mouth
Pattern Classification, Chapter 1
23
Length is a weak discriminator of fish types.
Pattern Classification, Chapter 1
24
Lightness is better
Pattern Classification, Chapter 1
25
Performance evaluation (P)
– false positive/negative errors
– E.g. if the threshold is decreased the number
of sea basses falsly classified to salmon
decreases
Decision theory
Pattern Classification, Chapter 1
26
Feature vector
A vector of features describing a
particular instance.
InstanceA
xT = [x1, x2]
Lightness
Width
Pattern Classification, Chapter 1
27
Pattern Classification, Chapter 1
28
Feature space
Be careful by adding to many features
– noisy features (eg. measurement errors)
– Unnecessary (pl. information content is similar
to other feature)
We need features which might have
discriminative power.
Feature set engineering is highly taskspecific!
Pattern Classification, Chapter 1
29
This is not ideal. Remember supervised learning principle!
Pattern Classification, Chapter 1
30
Pattern Classification, Chapter 1
31
Modell selection
• Number of features?
• Complexity of the task?
• Classifier speed?
• Task and data-dependent!
Pattern Classification, Chapter 1
32
The machine learning
lifecycle
•
•
•
•
•
Data preparation
Feature engineering
Modell selection
Modell training
Performance evaluation
Pattern Classification, Chapter 1
33
Data preparation
Do we know whether we collected
enough and representative sample
for training a system?
Pattern Classification, Chapter 1
34
Modell selection and training
– These topics are the foci of this
course
– Investigate the data for modell
selection!
No free lunch!
Pattern Classification, Chapter 1
35
Performance evaluation
• There are various evaluation
metrics
• Simulation of supervised learning:
1. split your data into two parts
2. train your modell on the training set
3. predict and evaluate your modell on
the test set (unknow during training)
Pattern Classification, Chapter 1
36
Topics of the course
•
•
•
•
•
•
•
Classification
Regression
Clustering
Recommendation systems
Learning to rank
Structure prediction
Reinforcement learning
Probability theory
retro
Probability
(atomic) events (A) and probability space ()
Axioms:
- 0 ≤ P(A) ≤ 1
- P()=1
- If A1, A2, … mutually exclusive events (Ai ∩Aj =
, i  j) then
P(k Ak) =  k P(Ak)
38
- P(Ø) = 0
- P(¬A)=1-P(A)
- P(A  B)=P(A)+P(B) – P(A∩B)
- P(A) = P(A ∩ B)+P(A ∩¬B)
- If A  B, then P(A) ≤ P(B) and
P(B-A) = P(B) – P(A)
39
Conditional probability
Conditional probability is the probability
of some event A, given the occurrence
of some other event B.
P(A|B) = P(A∩B)/P(B)
Chain rule:
P(A∩B) = P(A|B)·P(B)
Example:
A: headache, B: influenza
P(A) = 1/10, P(B) = 1/40, P(A|B)=?
40
Conditional probability
Independence of events
A and B are independent iff
P(A|B) = P(A)
Corollary:
P(AB) = P(A)P(B)
P(B|A) = P(B)
42
Product rule
A1, A2, …, An arbitrary events
P(A1A2…An) = P(An|A1…An-1)
P(An-1|A1…An-2)…P(A2| A1)P(A1)
If A1, A2, …, An events form a complete
probability space and P(Ai) > 0 for each
i, then
P(B) = ∑j=1n P(B | Ai)P(Ai)
43
Bayes rule
P(A|B) =
P(A∩B)/P(B) =
P(B|A)P(A)/P(B)
44
Random variable
ξ:  → R
Random variable vectors…
45
cumulative distribution
function (CDF),
F(x) = P( < x)
F(x1) ≤ F(x2), if x1 < x2
limx→-∞ F(x) = 0, limx→∞ F(x) = 1
F(x) is non-decreasing and rightcontinuous
46
Discrete vs continous
random variables
Discrete:
its value set forms a finite of infinate
series
Continous: we assume that f(x) is valid on
the (a, b) interval
47
Probability density functions
(pdf)
F(b) - F(a) = P(a <  < b) = a∫b f(x)dx
f(x) = F ’(x) és F(x) = .-∞∫x f(t)dt
Histogram
Empirical estimation of a density
49
Independence of random
variables
 and  are independent, iff any
a ≤ b, c ≤ d
P(a ≤  ≤ b, c ≤  ≤ d) =
P(a ≤  ≤ b) P(c ≤  ≤ d).
50
Composition of random
variables
Discrete case:
=+
iff  and  are independent
rn = P( = n) =
k=- P( = n - k,  = k)
51
Expected value
 can take values x1, x2, … with p1, p2, …
probability then
M() = i xipi
continous case:
M() = -∞∫ xf(x)dx
52
Properties of expected value
M(c) = cM()
M( + ) = M() + M()
If  and  are independent random
variables, then M() = M()M()
53
Standard deviation
D() = (M[( - M())2])1/2
D2() = M(2) – M2()
54
Properties of standard
deviation
- D2(a + b) = a2D2()
- if 1, 2, …, n are independent random
variables then
D2(1 + 2 + … + n) = D2(1) + D2(2) + … +
D2(n)
55
Correlation
Covariance:
c = M[( - M())( - M())]
c is 0 if  and  are independent
Correlation coefficient:
r = c / ((D()D()),
normalised covariance into [-1,1]
56
Well-known distributions
Normal/Gauss
Binomial:
 ~ B(n,p)
M() = np
D() = np(1-p)
57