Powerpoint document - University of California, Davis
Download
Report
Transcript Powerpoint document - University of California, Davis
Data Modeling
Patrice Koehl
Department of Biological Sciences
National University of Singapore
http://www.cs.ucdavis.edu/~koehl/Teaching/BL5229
[email protected]
Data Modeling
Data Modeling: least squares
Data Modeling: Non linear least squares
Data Modeling: robust estimation
Data Modeling
Data Modeling: least squares
Linear least squares
Least squares
Suppose that we are fitting N data points (xi,yi) (with errors si on each
data point) to a model Y defined with M parameters aj:
Y(x;a1,a2,...,aM )
The standard procedure is least squares: the fitted values for the
parameters aj are those that minimize:
2
æ
ö
y i - Y (x;a1,...,aM )
2
c = åç
÷
si
ø
i=1 è
N
Where does this come from?
Least squares
Let us suppose that:
The data points are independent of each other
Each data point as a measurement error that is random, distributed as a
Gaussian distribution around the “true” value Y(xi)
The probability of the data points, given the model Y is then:
2 ùü
ìï é æ
1 y i - Y(x i ) ö ï
P(data / Model) µ Õíexp ê- ç
÷ úý
si
ø úûïþ
i=1 ï
î êë 2 è
N
Least squares
Application of Bayes ‘s theorem:
P(Model /Data) µ P(Data/ Model)P(Model)
With no information on the models, we can assume that the prior probability
P(Model) is constant.
Finding the coefficients a1,…aM that maximizes P(Model/Data) is then
equivalent to finding the coefficients that maximizes P(Data/Model).
This is equivalent to maximizing its logarithm, or minimizing the negative of its
logarithm, namely:
2
æ
ö
1 y i - Y (x)
å 2 çè s ÷ø
i
i=1
N
Fitting data to a straight line
Fitting data to a straight line
This is the simplest case:
Y(x) = ax + b
Then:
2
æ
ö
y i - ax i - b
2
c = åç
÷
si
ø
i=1 è
N
The parameters a and b are obtained from the two equations:
N
x i ( y i - ax i - b)
¶c 2
= 0 = -2å
2
¶a
s
i
i=1
N
¶c 2
y i - ax i - b
= 0 = -2å
2
¶b
s
i
i=1
Fitting data to a straight line
Let us define:
N
S =å
1
s
i=1
then
2
i
N
Sx = å
xi
s
i=1
2
i
aSxx + bSx
aSx + bS
a and b are given by:
N
Sy = å
yi
s
i=1
2
i
Sxx = å
= Sxy
= Sy
a=
N
Sxy S - Sx Sy
Sxx S - Sx2
Sxx Sy - Sx Sxy
b=
Sxx S - Sx2
xi2
s
i=1
2
i
N
Sxy = å
i=1
xi yi
si2
Fitting data to a straight line
We are not done!
Uncertainty on the values of a and b:
sa
2
sb
2
S
=
SS xx - Sx2
Sx
=
SS xx - Sx2
Evaluate goodness of fit:
-Compute c2 and compare to N-M (here N-2)
-Compute residual error on each data point: Y(xi)-yi
-Compute correlation coefficient R2
Fitting data to a straight line
General Least Squares
Y(x) = a1X1(x) + a2 X 2 (x) + ...+ aM X M (x)
Then:
2
æ
ö
y i - a1 X1 (x i ) - ... - aM X M (x i )
2
c = åç
÷
si
ø
i=1 è
N
The minimization of c2 occurs when the derivatives of c2 with respect to the
parameters a1,…aM are 0. This leads to M equations:
¶c 2 N 1
= å ( y i - a1 X1(x i ) - ... - aM X M (x i )) X k (x i ) = 0
¶ak i=1 s i
General Least Squares
Define design matrix A such that
Aij =
X j (x i )
si
General Least Squares
Define two vectors b and a such that
and a contains the parameters
bi =
yi
si
Note that c2 can be rewritten as:
c = Aa - b
2
2
The parameters a that minimize c2 satisfy:
( A A)a = A b
T
T
These are the normal equations for the linear least square problem.
General Least Squares
How to solve a general least square problem:
1) Build the design matrix A and the vector b
2) Find parameters a1,…aM that minimize
c = Aa - b
2
2
(usually solve the normal equations)
3) Compute uncertainty on each parameter aj:
if C = ATA, then
s(a j )2 = C -1( j, j)
Data Modeling
Non linear least squares
In the general case, g(X1,…,Xn) is a non linear function of the
parameters X1,…Xn;
c2 is then also a non linear function of these parameters:
æ Y - g( X ,..., X ;t ) ö
i
1
n
÷÷
c2 = f ( X1,..., X n ) = åçç
si
i=1 è
ø
N
2
Finding the parameters X1,…,Xn is then treated as finding
X1,…,Xn that minimize c2.
Minimizing c2
Some definitions:
Gradient:
The gradient of a smooth function f with continuous first and
second derivatives is defined as:
é ¶f
Ñf (X ) = ê
ë ¶x1
¶f
...
¶xi
¶f ù
...
¶xN úû
¶2 f
¶x1¶x j
...
¶2 f
¶xi ¶x j
...
¶2 f
¶xN ¶x j
¶2 f
¶x1¶x N
...
¶2 f
¶xi ¶x N
...
¶2 f
¶xN2
Hessian
The n x n symmetric matrix of second derivatives, H(x), is called
the Hessian:
æ ¶2 f
ç
2
ç ¶x1
ç ...
ç ¶2 f
H ( x) = ç
ç ¶xi ¶x1
ç ...
ç ¶2 f
ç
è ¶xN ¶x1
...
...
...
...
...
...
...
...
...
...
ö
÷
÷
÷
÷
÷
÷
÷
÷
÷
ø
Minimizing c2
Minimization of a multi-variable function is usually an iterative process,
in which updates of the state variable x are computed using the
gradient and in some (favorable) cases the Hessian.
Steepest descent (SD):
The simplest iteration scheme consists of following the “steepest
descent” direction:
(a sets the minimum
along the line defined
by the gradient)
k +1
k
k
x
= x - aÑf ( x )
Usually, SD methods leads to improvement quickly, but then exhibit
slow progress toward a solution.
They are commonly recommended for initial minimization iterations,
when the starting function and gradient-norm values are very large.
Minimizing c2
Conjugate gradients (CG):
In each step of conjugate gradient methods, a search vector pk is
defined by a recursive formula:
pk +1 = -Ñf (xk )+ bk +1 pk
The corresponding new position is found by line minimization along
pk:
xk +1 = xk + lk pk
the CG methods differ in their definition of .
Minimizing c2
Newton’s methods:
Newton’s method is a popular iterative method for finding the 0 of a
one-dimensional function:
xk +1 = xk -
g (xk )
g ' (xk )
x3 x2
x1
x0
It can be adapted to the minimization of a one –dimensional function,
in which case the iteration formula is:
g ' (xk )
xk +1 = xk g ' ' (xk )
Minimizing c2
The equivalent iterative scheme for multivariate functions is based on:
xk +1 = xk - H
-1
k
(xk )Ñf (xk )
Several implementations of Newton’s method exist, that avoid
Computing the full Hessian matrix: quasi-Newton, truncated Newton,
“adopted-basis Newton-Raphson” (ABNR),…
Data analysis and Data Modeling
Data Modeling: robust estimation
Robust estimation of parameters
Least squares modeling assume a Gaussian statistics for the experimental
data points; this may not always be true however. There are other possible
distributions that may lead to better models in some cases.
One of the most popular alternatives is to use a distribution of the form:
r(x) = e
-x
Let us look again at the simple case of fitting a straight line in a set of
data points (ti,Yi), which is now written as finding a and b that minimize:
N
Z(a,b) = å Yi - at i - b
i=1
b = median(Y-at) and a is found by non linear minimization
Robust estimation of parameters