Machine Learning Lecture 1

Download Report

Transcript Machine Learning Lecture 1

Machine Learning
Saarland University, SS 2007
Lecture 10, Friday June 22nd, 2007
(Everything you always wanted to know
about statistics … but were afraid to ask)
Holger Bast
[with input from Ingmar Weber]
Max-Planck-Institut für Informatik
Saarbrücken, Germany
Overview of this lecture

Maximum likelihood vs. unbiased estimators
– Example: normal distribution
– Example: drawing numbers from a box

Things you keep on reading in the ML literature
– marginal distribution
– prior
– posterior

Statistical tests
– hypothesis testing
– discussion of its (non)sense
[example]
Maximum likelihood vs. unbiased estimators

Example: maximum likelihood estimator from Lecture 8,
Example 2
– μ(x1,…,xn) = 1/n ∙ Σi xi
σ2(x1,…,xn) = 1/n ∙ Σi (xi – μ)2
– X1,…,Xn independent identically distributed random variables
with mean μ and variance σ2
– E μ(X1,…,Xn) = μ
[blackboard]
– E σ2(X1,…,Xn) = (n–1) / n ∙ σ2 ≠ σ2
[blackboard]
– unbiased variance estimator = 1 / (n-1) ∙ Σi (xi – μ)2

Example: number x drawn from box with numbers 1..n
for unknown n
– maximum likelihood estimator: n = x
[blackboard]
– unbiased estimator: n = 2x – 1
[blackboard]
Marginal distribution

Joint probability distribution, for example
– pick a random MPII staff member
– random variables X = department, Y = gender
– for example, Pr(X = D3, Y = female)

D1
D2
D3
D4
D5
male
0.24
0.09
0.13
0.25
0.11
0.82
female
0.03
0.03
0.04
0.04
0.04
0.18 Pr(female)
0.27
0.12
0.17
Pr(D3)
0.29
0.15
Note:
– matrix entries sum to 1
– in general, Pr(X = x, Y = y) ≠ Pr(X = x) ∙ Pr(Y = y)
[holds if and only if X and Y are independent]
Frequentism vs. Bayesianism

Frequentism
– probability = relative frequency in large number of trials
– associated with random (physical) system
– only applied to well-defined events in well-defined space
for example: probability of a die showing 6

Bayesianism
– probability = degree of belief
– no random process at all needs to be involved
– applied to arbitrary statements
for example: probability that I will like a new movie
Prior / Posterior probability

Prior
– guess about the data, no random experiment behind
– go on computing with the guess like with a probability
– for example: Z1,…,Zn from E-Step of EM algorithm

Posterior
– probability related to an event that has already happened
– for example: all our likelihoods from Lectures 8 and 9

Note: these are no well-defined technical terms
– but often used as if, which is confusing
– the Bayesianism way …
Hypothesis testing



Example: do two samples have the same mean?
–
e.g., two groups of patients in a medical experiment,
one group with medication and one group without
–
for example, 8.6 4.3 3.2 5.1
and 2.1 4.2 7.6 3.2 2.9
Test
–
Formulate null hypothesis, e.g. equal means
–
compute probability p of the given (or more extreme) data,
assuming that the null hypothesis is true
[blackboard]
Outcome
–
p ≤ α = 0.05  hypothesis rejected with significance level 95%
one says: the difference of the means is statistically significant
–
p > α = 0.05  the hypothesis cannot be rejected
one says: the difference of the means is statistically insignificant
Hypothesis testing — BEWARE!

What one would ideally like:
– given this data, what is the probability that my hypothesis if true?
– formally: Pr(H | D)

What one gets from hypothesis testing
– given that my hypothesis is true, what is the probability of this (or
more extreme) data
– formally: Pr(D | H)
– but Pr(D | H) could be low for other reasons than the hypothesis!!
[blackboard example]

Useful at all?
– OK: challenge theory by attempting to reject it
– NO: confirm theory by rejecting corresponding null hypothesis
Literature

Read the wonderful articles by Jacob Cohen
– Things I have learned (so far)
American Psychologist, 45(12):1304–1312, 1990
– The earth is round (p < .05)
American Psychologist 49(12):997–1003, 1994