lecture8 - University of Arizona

Download Report

Transcript lecture8 - University of Arizona

LING 696B:
Midterm review: parametric
and non-parametric inductive
inference
1
Big question:

How do people generalize?
2
Big question:


How do people generalize?
Examples related to language:



Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
3
Big question:


How do people generalize?
Examples related to language:




Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
What is the nature of inductive
inference?
4
Big question:


How do people generalize?
Examples related to language:




Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
What is the nature of inductive
inference?

What role does statistics play?
5
Two paradigms of statistical
learning (I)

Fisher’s paradigm: inductive inference
through likelihood -- p(X|)



X: observed set of data
: parameters of the probability density
function p, or an interpretation of X
We expect X to come from an infinite
population observing p(X|)

Representational bias: the form of p(X|)
constrains what kind things you can learn
6
Learning in Fisher’s paradigm

Philosophy: finding the infinite
population so that the chance of seeing
X is large (idea from Bayes)




Knowing the universe by seeing individuals
Randomness is due to the finiteness of X
Maximum likelihood: find  so p(X|)
reaches the maximum
Natural consequence: the more X you
see, the better you learn about p(X|)
7
Extending Fisher’s paradigm to
complex situations

Statisticians cannot specify p(X|) for you!



Extending p(X|) to include hidden variables


Must come from understanding of the structure
that generates X, e.g. grammar
Needs a supporting theory that guides the
construction of p(X|) -- “language is special”
The EM algorithm
Making bigger model from smaller models

Iterative learning through coordinate-wise ascent
8
Example: unsupervised
learning of categories



X: instances of pre-segmented speech sounds
: mixture of a fixed number of category
models
Representational bias:




Discreteness
Distribution of each category (bias from mixture
components)
Hidden variable: category membership
Learning: EM algorithm
9
Example: unsupervised
learning of phonological words



X: instances of word-level signals
: mixture model + phonotactic model +
word segmentation
Representational bias:




Discreteness
Distribution of each category (bias from mixture
components)
Combinatorial structure of phonological words
Learning: coordinate-wise ascent
10
From Fisher’s paradigm to
Bayesian learning


Bayesian: wants to learn the posterior
distribution p(|X)
Bayesian formula: p(|X)  p(X|) p()
= p(X, )


Same as ML when p() is uniform
Still needs a theory guiding the
construction of p() and p(X|)

More on this later
11
Attractions of generative
modeling

Has clear semantics



p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
12
Attractions of generative
modeling

Has clear semantics




p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
Can make “infinite generalizations”

Synthesize from p(X, ) can tell us
something about the generalization
13
Attractions of generative
modeling

Has clear semantics




Can make “infinite generalizations”


p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
Synthesize from p(X, ) can tell us
something about the generalization
A very general framework

Theory of everything?
14
Challenges to generative
modeling

The representational bias can be wrong
15
Challenges to generative
modeling

The representational bias can be wrong

But “all models are wrong”
16
Challenges to generative
modeling

The representational bias can be wrong


But “all models are wrong”
Unclear how to choose from different
classes of models
17
Challenges to generative
modeling

The representational bias can be wrong


But “all models are wrong”
Unclear how to choose from different
classes of models

E.g. The destiny of K
18
Challenges to generative
modeling

The representational bias can be wrong


But “all models are wrong”
Unclear how to choose from different
classes of models


E.g. The destiny of K
Simplicity is relative, e.g. f(x)=a*sin(bx)+c
19
Challenges to generative
modeling

The representational bias can be wrong


Unclear how to choose from different
classes of models



But “all models are wrong”
E.g. The destiny of K
Simplicity is relative, e.g. f(x)=a*sin(bx)+c
Computing max{p(X|)} can be very hard

Bayesian computation may help
20
Challenges to generative
modeling

Even finding X can be hard for language
21
Challenges to generative
modeling

Even finding X can be hard for language


Probability distribution over what?
Example: statistical syntax, choices of X




String of words
Parse trees
Semantic interpetations
Social interactions
22
Challenges to generative
modeling

Even finding X can be hard for language


Example: X for statistical syntax?





Probability distribution over what?
String of words
Parse trees
Semantic interpetations
Social interactions
Hope: staying on low levels of language
will make the choice of X easier
23
Two paradigms of statistical
learning (II)


Vapnik’s critique for generative
modeling:
“Why solve a more general problem
before solving a specific one ?”
Example: Generative approach to 2class classification (supervised)
Likelihood ratio test:
Log[p(x|A)/p(x|B)]
A, B are parametric models
24
Non-parametric approach to
inductive inference

Main idea: don’t want to know the universe
first, then generalize



Universe is complicated, representational bias
often inappropriate
Very few data to learn from, compared to
dimensionality of space
Instead, want to generalize directly from old
data to new data

Rules v.s. analogy?
25
Examples of non-parametric
learning (I):

Nearest neighbor classification:


Analogy-based learning by dictionary
lookup
Generalize to K-nearest neighbors
26
Examples of non-parametric
learning (II)

Radial Basis networks for supervised
learning: F(x) = i ai K(x, xi)



K(x, xi) a non-linear similarity function
centered at xi , with tunable parameters
Interpretation: “soft/smooth” dictionary
lookup/analogy within a population
Learning: find ai from (xi, yi) pairs -- a
regularized regression problem
min i [f(x)-yi]2 + || f ||2
27
Radial basis
functions/networks


Each data point xi is associated with a
K(x, xi) -- a radial basis function
Linear combinations of enough K(x, xi)
can approximate any smooth function
from RnR



Universal approximation property
Network interpretation
(see demo)
28
How is this different from
generative modeling?


Do not assume a fixed space to search
for the best hypothesis
Instead, this space grows with the
amount of data



Basis of the space: K(x, xi)
Interpretation: local generalization from old
data xi to new data x
F(x) = i ai K(x, xi) represents an
ensemble generlization from {xi} to x
29
Examples of non-parametric
learning (III)

Support Vector Machines (last time):
linear separation f(x) = sign(<w,x>+b)
30
Max margin classification

The solution is also a direct
generalization from old data, but sparse
mostly zero
f(x) = sign(<w,x>+b)
31
Interpretation of support
vectors

Support vectors have non-zero
contribution to the generalization

“prototypes” for analogical learning
mostly zero
f(x) = sign(<w,x>+b)
32
Kernel generalization of SVM

The solution looks very much like RBF
networks:



RBF net: F(x) = i ai K(x, xi)
Many old data contribute to generalization
SVM: F(x) = sign(i ai K(x, xi) + b)
Relatively few old data contribute
Dense/sparse solution is due to
different goals (see demo)
33
Transductive inference with
support vectors

One more wrinkle: now I’m putting two
points there, but don’t tell you the color
34
Transductive SVM

Not only old data affect generalization,
the new data affect each other too
35
A general view of nonparametric inductive inference

A function approximation problem:
knowing that (x1, y1), …, (xN, yN) are
input and output of some unknown
function F, how can we approximate F
and generalize to new values of x?



Linguistics: find the universe for F
Psychology: find the best model that
“behaves” like F
In realistic terms, non-parametric
methods often win
36
Who’s got the answer?

Parametric approach can also
approximate functions

Model the joint distribution p(x,y|)
37
Who’s got the answer?

Parametric approach can also
approximate functions


Model the joint distribution p(x,y|)
But the model is often difficult to build

E.g. a realistic experimental task
38
Who’s got the answer?

Parametric approach can also
approximate functions


But the model is often difficult to build


Model the joint distribution p(x,y|)
E.g. a realistic experimental task
Before reaching a conclusion, we need
to know how people learn

They may be doing both
39
Where does neural net fit?

Clearly not generative: does not reason
with probability
40
Where does neural net fit?


Clearly not generative: does not reason
with probability
Somewhat different from analogy-type
of non-parametric: the network does
not directly reason from old data

Difficult to interpret the generalization
41
Where does neural net fit?


Clearly not generative: does not reason
with probability
Somewhat different from analogy-type
of non-parametric: the network does
not directly reason from old data


Difficult to interpret the generalization
Some results available for limiting cases

Similar to non-parametric methods when
hidden units are infinite
42
A point that nobody gets right

Small sample dilemma: people learn
from very few examples (compared to
dimension of data), yet any statistical
machinery needs many



Parametric: ML estimate approaches the
true distribution with infinite sample
Non-parametric: universal approximation
requires infinite sample
The limit is taken in the wrong direction
43