Spectrum Estimation in Helioseismology

Download Report

Transcript Spectrum Estimation in Helioseismology

Statistical Approaches to Inverse Problems
DIIG seminars on inverse problems
Insight and Algorithms
Niels Bohr Institute
Copenhagen, Denmark
27-29 May 2002
(Revised 13 May 2003)
P.B. Stark
Department of Statistics
University of California
Berkeley, CA 94720-3860
www.stat.berkeley.edu/~stark
Abstract
It is useful to distinguish between the intrinsic uncertainty of an
inverse problem and the uncertainty of applying any particular
technique to “solve” the inverse problem. The intrinsic
uncertainty depends crucially on the prior constraints on the
unknown (including prior probability distributions in the case of
Bayesian analyses), on the forward operator, on the statistics of
the observational errors, and on the nature of the properties of the
unknown one wishes to estimate. I will try to convey some
geometrical intuition for uncertainty, and the relationship
between the intrinsic uncertainty of linear inverse problems and
the uncertainty of some common techniques applied to them.
References & Acknowledgements
Donoho, D.L., 1994. Statistical Estimation and Optimal Recovery, Ann.
Stat., 22, 238-270.
Donoho, D.L., 1995. Nonlinear solution of linear inverse problems by
wavelet-vaguelette decomposition, Appl. Comput. Harm. Anal.,2, 101-126.
Evans, S.N. and Stark, P.B., 2002. Inverse Problems as Statistics, Inverse
Problems, 18, R1-R43 (in press).
Stark, P.B., 1992. Inference in infinite-dimensional inverse problems:
Discretization and duality, J. Geophys. Res., 97, 14,055-14,082.
Stark, P.B., 1992. Minimax confidence intervals in geomagnetism,
Geophys. J. Intl., 108, 329-338.
Created using TexPoint by G. Necula, http://raw.cs.berkeley.edu/texpoint
Outline
• Inverse Problems as Statistics
– Ingredients; Models
– Forward and Inverse Problems—applied perspective
– Statistical point of view
– Some connections
• Notation; linear problems; illustration
– Example: geomagnetism from satellite observations
– Example: seismic velocity from t(p) and x(p)
– Example: differential rotation of the Sun from normal mode splitting
• Identifiability and uniqueness
– Sketch of identifiablity and extremal modeling
– Backus-Gilbert theory
– Example: solar differential rotation
– Example: seismic velocity in Earth’s core
Outline, contd.
• Decision Theory
– Decision rules and estimators
– Comparing decision rules: Loss and Risk
– Strategies; Bayes/Minimax duality
– Mean distance error and bias
– Illustration: Regularization
– Illustration: Minimax estimation of linear functionals
– Example: Gauss coefficients of the magnetic field
• Distinguishing Models: metrics and consistency
Inverse Problems as Statistics
• Measurable space X of possible data.
• Set  of possible descriptions of the world—models.
• Family P = {Pq : q 2 } of probability distributions on X,
indexed by models q.
•Forward operator q a Pq maps model q into a probability
measure on X .
Data X are a sample from Pq.
Pq is whole story: stochastic variability in the “truth,”
contamination by measurement error, systematic error, censoring,
etc.
Models
• Set  usually has special structure.
•  could be a convex subset of a separable Banach space T.
(geomag, seismo, grav, MT, …)
• Physical significance of q generally gives qaPq reasonable
analytic properties, e.g., continuity.
Forward Problems in Geophysics
Composition of steps:
– transform idealized description of Earth into perfect,
noise-free, infinite-dimensional data (“approximate
physics”)
– censor perfect data to retain only a finite list of numbers,
because can only measure, record, and compute with such
lists
– possibly corrupt the list with measurement error.
Equivalent to single-step procedure with corruption on par with
physics, and mapping incorporating the censoring.
Inverse Problems
Observe data X drawn from distribution Pθ for some unknown
q. (Assume  contains at least two points; otherwise, data
superfluous.)
Use data X and the knowledge that q to learn about q; for
example, to estimate a parameter g(q) (the value g(θ) at θ of a
continuous G-valued function g defined on ).
Example: Geomagnetism
Geomagetic model parametrization
Geomagnetic inverse problem
Geophysical Inverse Problems
• Inverse problems in geophysics often “solved” using applied
math methods for Ill-posed problems (e.g., Tichonov
regularization, analytic inversions)
• Those methods are designed to answer different questions; can
behave poorly with data (e.g., bad bias & variance)
• Inference  construction: statistical viewpoint more appropriate
for interpreting geophysical data.
Elements of the Statistical View
Distinguish between characteristics of the problem, and
characteristics of methods used to draw inferences.
One fundamental property of a parameter:
g is identifiable if for all η, z  Θ,
{g(η)  g(z)}  {Ph  Pz}.
In most inverse problems, g(θ) = θ not identifiable, and few
linear functionals of θ are identifiable.
Deterministic and Statistical Connections
Identifiability—distinct parameter values yield distinct
probability distributions for the observables—similar to
uniqueness—forward operator maps at most one model into the
observed data.
Consistency—parameter can be estimated with arbitrary accuracy
as the number of data grows—related to stability of a recovery
algorithm—small changes in the data produce small changes in
the recovered model.
 quantitative connections too.
More Notation
Let T be a separable Banach space, T * its normed dual.
Write the pairing between T and T *
<•, •>: T * x T  R.
Linear Forward Problems
A forward problem is linear if
• Θ is a subset of a separable Banach space T
• X = Rn, X = (Xj)j=1n
• For some fixed sequence (κj)j=1n of elements of T*,
Xj = h kj, q i + ej, q 2 ,
where e = (ej)j=1n is a vector of stochastic errors whose
distribution does not depend on θ.
Linear Forward Problems, contd.
•Linear functionals {κj} are the “representers”
•Distribution Pθ is the probability distribution of X. Typically,
dim(Θ) = ; at least, n < dim(Θ), so estimating θ is an
underdetermined problem.
Define
K : T  Rn
q a (<κj, θ>)j=1n .
Abbreviate forward problem by X = Kθ + ε, θ  Θ.
Linear Inverse Problems
Use X = Kθ + ε, and the constraint θ  Θ to estimate or draw
inferences about g(θ).
Probability distribution of X depends on θ only through Kθ, so if
there are two points
θ1, θ2  Θ such that Kθ1 = Kθ2 but
g(θ1)g(θ2),
then g(θ) is not identifiable.
Ex: Sampling w/ systematic and random error
Observe
Xj = f(tj) + rj + ej, j = 1, 2, …, n,
• f 2 C, a set of smooth of functions on [0, 1]
• tj 2 [0, 1]
• |rj| 1, j=1, 2, … , n
• ej iid N(0, 1)
Take  = C £ [-1, 1]n, X = Rn, and q = (f, r1, …, rn).
Then Pq has density
(2p)-n/2 exp{-j=1n (xj – f(tj)-rj)2}.
Sketch: Identifiability
Pz = Ph
Pq

X = Rn
q
h
Kz
Kq
X = Kqe
z
{Pz = Ph} ; {h = z}, so q not identifiable
g(q)
R
g(h) g(z)
g cannot be estimated with bounded bias
{Pz = Ph} ; {g(h) = g(z)}, so g not identifiable
Backus-Gilbert Theory
Let  = T be a Hilbert space.
Let g 2 T = T* be a linear parameter.
Let {kj}j=1n µ T*. Then:
g(q) is identifiable iff g = L ¢ K for some 1 £ n matrix L.
If also E[e] = 0, then L ¢ X is unbiased for g.
If also e has covariance matrix S = E[eeT], then the MSE
of L ¢ X is L ¢ S ¢ LT.
Sketch: Backus-Gilbert
Pq

X = Rn
q
R
L¢X
g(q) = L ¢ Kq
Kq
X = Kqe
Example: Differential solar rotation
• Stellar oscillations known since late
1700s.
• Sun's oscillation observed in 1960 by
Leighton, Noyes, Simon. Explained as
trapped acoustic waves by Ulrich,
Leibacher, Stein, 1970-1.
Source: SOHO-SOI/MDI website
•>107 modes predicted. >250,000
identified; 106 soon
Formal error bars inflated by 200. Hill et al., 1996. Science 272, 1292-1295
Pattern is Superposition of Modes
• Like vibrations of a spherical
guitar string
• 3 “quantum numbers” l, m, n
• l and m are spherical surface
wavenumbers
• n is radial wavenumber
Source: GONG website
Waves Trapped in Waveguide
• Low l modes sample more
deeply
• p-modes do not sample core
well
• Sun essentially opaque to
EM; transparent to sound &
to neutrinos
Source: forgotten!
Spectrum is very Regular
• Explanation as modes, plus
stellar evolutionary theory,
predict details of spectrum
• Details confirmed in data by
Deubner, 1975
Source: GONG
Oscillations Taste Solar Interior
• Frequencies sensitive to material properties
• Frequencies sensitive to differential rotation
• If Sun were spherically symmetric and did not rotate,
frequencies of the 2l+1 modes with the same l and n would be
equal
• Asphericity and rotation break the degeneracy
(Scheiner measured 27d equatorial rotation from sunspots by
1630. Polar ~33d.)
• Like ultrasound for the Sun
Linear forward problem for differential rotation
Dnlmn = s klmnW(r,q)r dr dq
Language change: q is latitude, W is rotation model.
Relationship assumes eigenfunctions and radial
structure known.
Observational errors usually assumed to be zeromean independent normal random variables with
known variances.
Different Modes sample Sun differently
Left: raypath for l=100, n=8 and l=2, n=8 p-modes
Right: raypath for l=5, n=10 g-mode.
g-modes have not been observed
l=20 modes. Left: m=20. Middle: m=16. (Doppler velocities)
Right: section through eigenfunction of l=20, m=16, n =14.
Gough et al., 1996. Science 272, 1281-1283
Linear Combinations of Splitting Kernels
Cuts through kernels for rotation:
A: l=15, m=8. B: l=28, m=14. C: l=28, m=24.
D: two targeted combinations: 0.7R, 60o;
0.82R, 30o
Thompson et al., 1996. Science 272, 13001305.
Estimated rotation rate as a
function of depth at three
latitudes.
Source: SOHO-SOI/MDI
website
Backus-Gilbert++: Necessary conditions
Let g be an identifiable real-valued parameter. Suppose  θ0Θ,
a symmetric convex set Ť  T, cR, and ğ: Ť  R such that:
1. θ0 + Ť  Θ
2. For t Ť, g(θ0 + t) = c + ğ(t), and ğ(-t) = -ğ(t)
3. ğ(a1t1 + a2t2) = a1ğ(t1) + a2ğ(t2), t1, t2  Ť, a1, a2  0, a1+a2 =
1, and
4. supt  Ť | ğ(t)| <.
Then  1×n matrix Λ s.t. the restriction of ğ to Ť is the restriction
of Λ.K to Ť.
Backus-Gilbert++: Sufficient Conditions
Suppose g = (gi)i=1m is an Rm-valued parameter that can be
written as the restriction to Θ of Λ.K for some m×n matrix Λ.
Then
1. g is identifiable.
2. If E[ε] = 0, Λ.X is an unbiased estimator of g.
3. If, in addition, ε has covariance matrix Σ = E[εεT], the
covariance matrix of Λ.X is Λ.Σ.ΛT whatever be Pθ.
Decision Rules
A (randomized) decision rule
δ: X  M1(A)
x a δx(.),
is a measurable mapping from the space X of possible data to the
collection M1(A) of probability distributions on a separable
metric space A of actions.
A non-randomized decision rule is a randomized decision rule
that, to each x X, assigns a unit point mass at some value
a = a(x)  A.
Why randomized rules?
• In some problems, have better behavior.
• Allowing randomized rules can make the set of decisions
convex (by allowing mixtures of different decisions), which
makes the math easier.
• If the risk is convex, Rao-Blackwell theorem says that the
optimal decision is not randomized. (More on this later.)
Example: randomization natural
Coin has chance 1/3 of landing with one side showing; chance
2/3 of the other showing. Don’t know which side is which.
Want to decide whether P(heads) = 1/3 or 2/3.
Toss coin 10 times. X = #heads.
Toss fair coin once. U = #heads.
Use data to pick the more likely scenario, but if data don’t
help, decide by tossing a fair coin.
Estimators
An estimator of a parameter g(θ) is a decision rule d for which
the space A of possible actions is the space G of possible
parameter values.
ĝ=ĝ(X) is common notation for an estimator of g(θ).
Usually write non-randomized estimator as a G-valued function
of x instead of a M1(G)-valued function.
Comparing Decision Rules
 Infinitely many decision rules and estimators.
Which one to use?
The best one!
But what does best mean?
Loss and Risk
• 2-player game: Nature v. Statistician.
• Nature picks θ from Θ.
θ is secret, but statistician knows Θ.
• Statistician picks δ from a set D of rules.
δ is secret.
• Generate data X from Pθ, apply δ.
• Statistician pays loss L(θ, δ(X)). L should be dictated by
scientific context, but…
• Risk is expected loss: r(θ, δ) = EqL(θ, δ(X))
• Good rule d has small risk, but what does small mean?
Strategy
Rare that one d has smallest risk 8q.
• d is admissible if not dominated (if no estimator does at least
as well for every q, and better for some q).
• Minimax decision minimizes r(d) ´ supqr(θ, δ) over dD
• Minimax risk is r* ´ infd 2 D r(δ)
• Bayes decision minimizes
rp(d) ´ sr(q,d)p(dq) over dD
for a given prior probability distribution p on .
• Bayes risk is rp* ´ infd 2 D rp(d).
Minimax is Bayes for least favorable prior
Pretty generally for convex , D, concave-convexlike r,
If minimax risk >> Bayes risk, prior π controls the apparent
uncertainty of the Bayes estimate.
Common Risk: Mean Distance Error (MDE)
Let dG denote the metric on G, and let ĝ be an estimator of g.
MDE at θ of ĝ is
MDEθ(ĝ, g) = Eq d(ĝ(X), g(θ)).
When metric derives from norm, MDE is called mean norm error
(MNE).
When the norm is Hilbertian, MNE2 is called mean squared error
(MSE).
Shrinkage
Suppose X » N(q, I) with dim(q) = d ¸ 3.
X not admissible for q for squared-error loss (Stein, 1956).
Dominated by dS(X) = X(1 – a/(b + ||X||2)) for small a and big b.
James-Stein better: dJS(X) = X(1-a/||X||2), for 0 < a · 2(d-2).
Better if take positive part of shrinkage factor:
dJS+(X) = X(1-a/||X||2)+, for 0 < a · 2(d-2).
Not minimax, but close.
Implications for Backus-Gilbert estimates of d¸ 3 linear
functionals.
9 extensions to other distributions; see Evans & Stark (1996).
Bias
When G is a Banach space, can define bias at θ of ĝ:
biasθ(ĝ, g) = Eq [ĝ - g(θ)]
(when the expectation is well-defined).
• If biasθ(ĝ, g) = 0, say ĝ is unbiased at θ (for g).
• If ĝ is unbiased at θ for g for every θ, say ĝ is unbiased for
g. If such ĝ exists, say g is unbiasedly estimable.
• If g is unbiasedly estimable then g is identifiable.
Example: Bounded Normal Mean
Observe X » N (q, 1). Know a priori q 2 [-t, t].
Want to estimate g(q) = q.
Let f(¢) be the standard normal density.
Let F(¢) be the standard normal cumulative distribution function.
Suppose we elect to use squared-error loss:
L(q, d) = (q - d)2
r(q, d) = Eq L(q, d(X)) = Eq (q - d(X))2
r(d) = supq 2  r(q, d) = supq 2  Eq (q - d(X))2
r* = infd 2 D supq 2  Eq (q - d(X))2
Risk of X for bounded normal mean
Consider simple estimator d(X) = X.
EX = q, so X is unbiased for q, and q is unbiasedly estimable.
r(q, X) = Eq (q – X)2 = Var(X) = 1.
Consider Bayesian prior to capture the constraint q 2 [-t, t]:
q» p = U[-t, t], the uniform distribution on the interval [-t, t].
rp(X) = s-tt r(q, X) p(dq) = s-tt 1£ (2t)-1 dq = 1.
In this example, frequentist risk of X equals Bayes risk of X for
uniform prior p.
X is not the best: Truncation
Easy to find an estimator better than X from both frequentist and
Bayes perspectives.
Truncation estimate dT
dT is biased, but has smaller MSE than X, whatever be q 2 .
(dT is the constrained maximum likelihood estimate.)
Risk of dT
x
Pq(X < -t)
f(x|q)
0
-t
t
q
-t q
dT
0
t
Minimax Estimation of BNM
Truncation estimate better than X, but not minimax.
Clear that r* ¸ min(1, t2): MSE(X) = 1, and r(0) = t2.
Minimax MSE estimator is a nonlinear shrinkage estimator.
Minimax MSE risk is t2/(1+t2).
Bayes estimation of BNM
Posterior density of q given x is
Posterior Mean
• The mean of the posterior density minimizes the Bayes risk,
when the loss is squared error:
Philip B. Stark:
Bayes Estimator is Nonlinear Shrinkage
function f = bayesUnif(x, tau)
f = x - (normpdf(tau - x) - normpdf(-tau-x))./(normcdf(tau-x) - normcdf(-tau-x));
return
6
X
4
dT
dp*
2
0
Bayes estimator dp*, t=3
-2
-4
-6
-6
-4
-2
0
2
4
6
For t = 3, Bayes risk rp* ¼ 0.7 (by simulation) .
Minimax risk r* = 0.75.
Philip B. Stark:
Philip B. Stark:
Philip
B. Stark:
nsim
= 20000;
Bayes/Minimax Risks
risk = 0;
tau = 5;
t
for theta = -tau:.01:tau
pred = bayesunif(theta + randn(nsim,1),tau);
risk = risk + (pred - theta)'*(pred - theta)/nsim;
end
0.5
rp *
r *
(simulation)
0.08
0.33
1
0.25
0.50
2
0.55
0.67
3
0.70
0.75
4
0.77
0.80
5
0.82
0.83
1
1
risk/length(-tau:.01:tau)
t!1
Difference between knowing q 2 [-t, t], and q » U[-t, t].
Sketch: Regularization

X = Rn
q
error
0
z
g(z)
R
g(q)
bias
Kq
Kz X = Kqe
Consistency of “Occam’s Inversion”
• Common approach: minimize norm (or other regularization
functional) subject to mean data misfit · 1.
• Sometimes called Occam’s Inversion (Constable, Parker and
Constable): simplest hypothesis consistent with the data.
• In many circumstances, this estimator is inconsistent: as
number of data grows, greater and greater chance that the
estimator is 0. Allowable misfit grows faster than norm of
noise-free data image.
• In common situations, consistency of the general approach
requires data redundancy and averaging.
Singular Value Decomposition, Linear Problems
• Assume  ½ T , separable Hilbert space;
{ej}j=1n iid N(0, s2)
{kj}j=1n linearly independent
• K is compact: infinite-dimensional null space.
Let K*: <n ! T be the adjoint operator to K.
• 9 n triples {(nj, xj, lj)}j=1n, nj 2 T, xj 2 X and lj 2 <+, such that
– Knj = lj xj,
– K* xj = lj nj.
• {nj}j=1n can be chosen to be orthonormal in T;
{xj}j=1n can be chosen to be orthonormal in X.
• lj > 0, 8 j. Order s.t. l1 ¸ l2 ¸ L > 0.
• {(nj, xj, lj)}j=1n are singular value decomposition of K.
Singular Value Weighting
•Can write minimum norm model that fits data exactly as
dMN(X) = j=1n lj-1 (xj ¢ X) nj.
Write q = q|| + q? (span of {nj} and its orthocomplement)
Biasq(dMN) = Eq dMN(X) - q = q?.
Varq dMN = Eq || j=1n lj-1 (xj ¢ e) nj ||2 = s2 j=1n lj-2.
Components associated with small lj make variance big: noise
components multiplied by lj-1.
Singular value truncation: reconstruct q using {nj} with lj ¸ t:
dSVT = j=1m lj-1 (xj ¢ X) nj, where m = max {k : lk > t}.
Mollifies the noise magnification but increases bias.
Bias of SVT
Bias of SVT bigger than of MNE by projection of q onto span {nj}j=m+1n.
Variance of SVT smaller by s2 j = m + 1n lj-2.
With adequate prior information about q (to control bias) can exploit biasvariance tradeoff to reduce MSE.
SVT in family of estimators that re-weight the singular functions in the
reconstruction:
dw = j=1n w(lj) (xj ¢ X) nj.
Regularization using norm penalty, with regularization parameter l,
corresponds to
w(u) = u/(u2 + l).
These tend to have smaller norm smaller than maximum likelihood
estimate: can be viewed as shrinkage.
Examples of Singular Functions
• Linear, time-invariant filters: complex sinusoids
• Circular convolution: sinusoids
• Band and time-limiting: prolate spheroidal wavefunctions
• Main-field geomagnetism: spherical harmonics, times radial
polynomials in r-1
• Abel transform: Jacobi polynomials and Chebychev
polynomials
See Donoho (1995) for more examples and references.
Minimax Estimation of Linear parameters
• Observe
X = Kq + e 2 Rn, with
– q 2  µ T, T a separable Hilbert space
–  convex
– {ei}i=1n iid N(0,s2).
• Seek to learn about g(q):  ! R, linear, bounded on 
For variety of risks (MSE, MAD, length of fixed-length
confidence interval), minimax risk is controlled by modulus of
continuity of g, calibrated to the noise level.
Full problem no harder than hardest 1-dimensional subproblem;
reduces to BNM (Donoho, 1994).
Example: Geomagnetism
 = {q 2 l2(w) : l=11 wl m=-ll | qlm |2 · q }.
Estimate g(q) = qlm.
Symmetry of  and linearity of K, g, let us characterize the modulus:
The problem is to maximize a linear functional of a vector in the
intersection of two ellipsoids. In the main-field geomagnetism
problem, as the data sampling becomes more uniform over the
spherical idealization of a satellite orbit, both the norm (prior
information) and the operator K are diagonalized by spherical
harmonics.
Modulus of Continuity

X = Rn
Kz

h
R
z
g(h) g(z)
Kh
Distinguishing two models
Data tell the difference between two models z and h if the L1
distance between Pz and Ph is large:
L1 and Hellinger distances
Consistency in Linear Inverse Problems
• Xi = kiq + ei, i=1, 2, 3, …
q, subset of separable Banach space
{ki} * linear, bounded on 
{ei} iid 
• q consistently estimable w.r.t. weak topology iff {Tk}, Tk
Borel function of X1, . . . , Xk, s.t. q, h>0, k *,
limk Pq{|kTk - kq|>h} = 0
Importance of the Error Distribution
• µ a prob. measure on ;
µa(B) = µ(B-a), a 
• Pseudo-metric on **:
• If restriction to  converges to metric compatible with weak
topology, can estimate q consistently in weak topology.
• For given sequence of functionals {ki}, µ rougher 
consistent estimation easier.
Summary
• “Solving” inverse problem means different things to different
audiences.
• Statistical viewpoint is useful abstraction.
Physics in mapping q a Pq
Prior information in constraint q.
• There is more information in the assertion q » p, with p
supported on , than there is in the constraint q 2 .
• Separating “model” q from parameters g(q) of interest is
useful: Sabatier’s “well posed questions.” Many interesting
questions can be answered without knowing the entire model.
• Thinking about measures of performance is useful.
• Difficulty of problem  performance of specific method.