lecture8 - University of Arizona
Download
Report
Transcript lecture8 - University of Arizona
LING 696B:
Midterm review: parametric
and non-parametric inductive
inference
1
Big question:
How do people generalize?
2
Big question:
How do people generalize?
Examples related to language:
Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
3
Big question:
How do people generalize?
Examples related to language:
Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
What is the nature of inductive
inference?
4
Big question:
How do people generalize?
Examples related to language:
Categorizing a new stimuli
Assign structure to a signal
Telling whether a form is grammatical
What is the nature of inductive
inference?
What role does statistics play?
5
Two paradigms of statistical
learning (I)
Fisher’s paradigm: inductive inference
through likelihood -- p(X|)
X: observed set of data
: parameters of the probability density
function p, or an interpretation of X
We expect X to come from an infinite
population observing p(X|)
Representational bias: the form of p(X|)
constrains what kind things you can learn
6
Learning in Fisher’s paradigm
Philosophy: finding the infinite
population so that the chance of seeing
X is large (idea from Bayes)
Knowing the universe by seeing individuals
Randomness is due to the finiteness of X
Maximum likelihood: find so p(X|)
reaches the maximum
Natural consequence: the more X you
see, the better you learn about p(X|)
7
Extending Fisher’s paradigm to
complex situations
Statisticians cannot specify p(X|) for you!
Extending p(X|) to include hidden variables
Must come from understanding of the structure
that generates X, e.g. grammar
Needs a supporting theory that guides the
construction of p(X|) -- “language is special”
The EM algorithm
Making bigger model from smaller models
Iterative learning through coordinate-wise ascent
8
Example: unsupervised
learning of categories
X: instances of pre-segmented speech sounds
: mixture of a fixed number of category
models
Representational bias:
Discreteness
Distribution of each category (bias from mixture
components)
Hidden variable: category membership
Learning: EM algorithm
9
Example: unsupervised
learning of phonological words
X: instances of word-level signals
: mixture model + phonotactic model +
word segmentation
Representational bias:
Discreteness
Distribution of each category (bias from mixture
components)
Combinatorial structure of phonological words
Learning: coordinate-wise ascent
10
From Fisher’s paradigm to
Bayesian learning
Bayesian: wants to learn the posterior
distribution p(|X)
Bayesian formula: p(|X) p(X|) p()
= p(X, )
Same as ML when p() is uniform
Still needs a theory guiding the
construction of p() and p(X|)
More on this later
11
Attractions of generative
modeling
Has clear semantics
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
12
Attractions of generative
modeling
Has clear semantics
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
Can make “infinite generalizations”
Synthesize from p(X, ) can tell us
something about the generalization
13
Attractions of generative
modeling
Has clear semantics
Can make “infinite generalizations”
p(X|) -- prediction/production/synthesis
p() -- belief/prior knowledge/initial bias
p(|X) -- perception/interpretation
Synthesize from p(X, ) can tell us
something about the generalization
A very general framework
Theory of everything?
14
Challenges to generative
modeling
The representational bias can be wrong
15
Challenges to generative
modeling
The representational bias can be wrong
But “all models are wrong”
16
Challenges to generative
modeling
The representational bias can be wrong
But “all models are wrong”
Unclear how to choose from different
classes of models
17
Challenges to generative
modeling
The representational bias can be wrong
But “all models are wrong”
Unclear how to choose from different
classes of models
E.g. The destiny of K
18
Challenges to generative
modeling
The representational bias can be wrong
But “all models are wrong”
Unclear how to choose from different
classes of models
E.g. The destiny of K
Simplicity is relative, e.g. f(x)=a*sin(bx)+c
19
Challenges to generative
modeling
The representational bias can be wrong
Unclear how to choose from different
classes of models
But “all models are wrong”
E.g. The destiny of K
Simplicity is relative, e.g. f(x)=a*sin(bx)+c
Computing max{p(X|)} can be very hard
Bayesian computation may help
20
Challenges to generative
modeling
Even finding X can be hard for language
21
Challenges to generative
modeling
Even finding X can be hard for language
Probability distribution over what?
Example: statistical syntax, choices of X
String of words
Parse trees
Semantic interpetations
Social interactions
22
Challenges to generative
modeling
Even finding X can be hard for language
Example: X for statistical syntax?
Probability distribution over what?
String of words
Parse trees
Semantic interpetations
Social interactions
Hope: staying on low levels of language
will make the choice of X easier
23
Two paradigms of statistical
learning (II)
Vapnik’s critique for generative
modeling:
“Why solve a more general problem
before solving a specific one ?”
Example: Generative approach to 2class classification (supervised)
Likelihood ratio test:
Log[p(x|A)/p(x|B)]
A, B are parametric models
24
Non-parametric approach to
inductive inference
Main idea: don’t want to know the universe
first, then generalize
Universe is complicated, representational bias
often inappropriate
Very few data to learn from, compared to
dimensionality of space
Instead, want to generalize directly from old
data to new data
Rules v.s. analogy?
25
Examples of non-parametric
learning (I):
Nearest neighbor classification:
Analogy-based learning by dictionary
lookup
Generalize to K-nearest neighbors
26
Examples of non-parametric
learning (II)
Radial Basis networks for supervised
learning: F(x) = i ai K(x, xi)
K(x, xi) a non-linear similarity function
centered at xi , with tunable parameters
Interpretation: “soft/smooth” dictionary
lookup/analogy within a population
Learning: find ai from (xi, yi) pairs -- a
regularized regression problem
min i [f(x)-yi]2 + || f ||2
27
Radial basis
functions/networks
Each data point xi is associated with a
K(x, xi) -- a radial basis function
Linear combinations of enough K(x, xi)
can approximate any smooth function
from RnR
Universal approximation property
Network interpretation
(see demo)
28
How is this different from
generative modeling?
Do not assume a fixed space to search
for the best hypothesis
Instead, this space grows with the
amount of data
Basis of the space: K(x, xi)
Interpretation: local generalization from old
data xi to new data x
F(x) = i ai K(x, xi) represents an
ensemble generlization from {xi} to x
29
Examples of non-parametric
learning (III)
Support Vector Machines (last time):
linear separation f(x) = sign(<w,x>+b)
30
Max margin classification
The solution is also a direct
generalization from old data, but sparse
mostly zero
f(x) = sign(<w,x>+b)
31
Interpretation of support
vectors
Support vectors have non-zero
contribution to the generalization
“prototypes” for analogical learning
mostly zero
f(x) = sign(<w,x>+b)
32
Kernel generalization of SVM
The solution looks very much like RBF
networks:
RBF net: F(x) = i ai K(x, xi)
Many old data contribute to generalization
SVM: F(x) = sign(i ai K(x, xi) + b)
Relatively few old data contribute
Dense/sparse solution is due to
different goals (see demo)
33
Transductive inference with
support vectors
One more wrinkle: now I’m putting two
points there, but don’t tell you the color
34
Transductive SVM
Not only old data affect generalization,
the new data affect each other too
35
A general view of nonparametric inductive inference
A function approximation problem:
knowing that (x1, y1), …, (xN, yN) are
input and output of some unknown
function F, how can we approximate F
and generalize to new values of x?
Linguistics: find the universe for F
Psychology: find the best model that
“behaves” like F
In realistic terms, non-parametric
methods often win
36
Who’s got the answer?
Parametric approach can also
approximate functions
Model the joint distribution p(x,y|)
37
Who’s got the answer?
Parametric approach can also
approximate functions
Model the joint distribution p(x,y|)
But the model is often difficult to build
E.g. a realistic experimental task
38
Who’s got the answer?
Parametric approach can also
approximate functions
But the model is often difficult to build
Model the joint distribution p(x,y|)
E.g. a realistic experimental task
Before reaching a conclusion, we need
to know how people learn
They may be doing both
39
Where does neural net fit?
Clearly not generative: does not reason
with probability
40
Where does neural net fit?
Clearly not generative: does not reason
with probability
Somewhat different from analogy-type
of non-parametric: the network does
not directly reason from old data
Difficult to interpret the generalization
41
Where does neural net fit?
Clearly not generative: does not reason
with probability
Somewhat different from analogy-type
of non-parametric: the network does
not directly reason from old data
Difficult to interpret the generalization
Some results available for limiting cases
Similar to non-parametric methods when
hidden units are infinite
42
A point that nobody gets right
Small sample dilemma: people learn
from very few examples (compared to
dimension of data), yet any statistical
machinery needs many
Parametric: ML estimate approaches the
true distribution with infinite sample
Non-parametric: universal approximation
requires infinite sample
The limit is taken in the wrong direction
43