m (Crash Course in Scientific Computing)
m (XVIII. Fitting)
Line 3: Line 3:
  
 
We have seen that interpolation works great theoretically but can lead to unpleasant results. Let alone Runge's phenomenon, we certainly do not want to constrain our model to pass by all points, taking into account irrelevant and meaningless fluctuations, noise, etc. While this is theoretically possible if we provide sufficient degrees of freedom, this is probably meaningless. What we typically want, instead, is the best overall agreement, without stretching too far our underlying model, which usually comes with its fixed number of degrees of freedom anyway. This problem is that of ''fitting'', which is of extreme importance in most quantitative science.
 
We have seen that interpolation works great theoretically but can lead to unpleasant results. Let alone Runge's phenomenon, we certainly do not want to constrain our model to pass by all points, taking into account irrelevant and meaningless fluctuations, noise, etc. While this is theoretically possible if we provide sufficient degrees of freedom, this is probably meaningless. What we typically want, instead, is the best overall agreement, without stretching too far our underlying model, which usually comes with its fixed number of degrees of freedom anyway. This problem is that of ''fitting'', which is of extreme importance in most quantitative science.
 +
 +
Weierstrass Approximation Theorem
  
 
Say that we have $N$ data points $y_1$, ..., $y_N$ obtained for the values $x_1$, ..., $x_N$ that we expect are accounted for by the "model" $f$, i.e., a function $f(x,\vec\beta)$ of $\nu$ parameters $\beta_1$, ..., $\beta_\nu$ kept in a vector $\vec\beta$:
 
Say that we have $N$ data points $y_1$, ..., $y_N$ obtained for the values $x_1$, ..., $x_N$ that we expect are accounted for by the "model" $f$, i.e., a function $f(x,\vec\beta)$ of $\nu$ parameters $\beta_1$, ..., $\beta_\nu$ kept in a vector $\vec\beta$:
Line 8: Line 10:
 
$$r_i=f(x_i,\vec\beta)-y_i\,.$$
 
$$r_i=f(x_i,\vec\beta)-y_i\,.$$
  
We want to minimize the sum of the squared residuals:
+
A good way to fit our data with the model is by minimizing the sum of the squared ''residuals'' (how much is left out by the model from the actual data):
  
 
$$S=\sum_{i=1}^N r_i^2\,.$$
 
$$S=\sum_{i=1}^N r_i^2\,.$$
  
This we do by varying $\beta$s such that $S$ is minimum, which is achieved when all $\partial S/\partial\beta_j$ terms for $1\le j\le\nu$ are canceled simultaneously.
+
We take the square as otherwise, positive and negative residuals would cancel without providing an exact fit. To find the optimum, that is, smallest, sum of squares, we vary $\beta$s such that $S$ is minimum, which is achieved when all $\partial S/\partial\beta_j$ terms for $1\le j\le\nu$ are canceled simultaneously:
  
 
$$\frac{\partial S}{\partial\beta_j}=2\sum_{i=1}^N(f(x_i,\vec\beta)-y_i)\frac{\partial f(x_i,\vec\beta)}{\partial\beta_j}=0\,,\quad\text{for $j=1,\dots,\nu$}\,.\label{eq:lstsq}$$
 
$$\frac{\partial S}{\partial\beta_j}=2\sum_{i=1}^N(f(x_i,\vec\beta)-y_i)\frac{\partial f(x_i,\vec\beta)}{\partial\beta_j}=0\,,\quad\text{for $j=1,\dots,\nu$}\,.\label{eq:lstsq}$$
Line 70: Line 72:
 
The $(\Phi^T\Phi)^{-1}\Phi^T$ could be problematic if large (many fitting functions and/or many data points) and actually computed in this form, in which case one can see it as the notation for a so-called "[https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse pseudo-inverse]" that deals with such an object in an efficient and stable way.
 
The $(\Phi^T\Phi)^{-1}\Phi^T$ could be problematic if large (many fitting functions and/or many data points) and actually computed in this form, in which case one can see it as the notation for a so-called "[https://en.wikipedia.org/wiki/Moore%E2%80%93Penrose_pseudoinverse pseudo-inverse]" that deals with such an object in an efficient and stable way.
  
Let us now turn from the general theory to a particular example. A convenient basis of function is that of polynomials, that is:
+
Let us now turn from the general theory to particular cases of interest.
 +
 
 +
By far, the most famous and used type of fitting is with a linear function, of the type:
 +
 
 +
$$f(x;\beta)=\beta_0+x\beta_1\,.$$
 +
 
 +
This is known as [https://en.wikipedia.org/wiki/Linear_regression linear regression]. In this case, our matrix become simply:
 +
 
 +
$$\Phi^T\Phi=\pmatrix{
 +
1 & 1 & 1 & \dots\\
 +
x_1 & x_2 & x_3 & \dots
 +
}\pmatrix{
 +
1 & x_1 \\
 +
1 & x_2 \\
 +
1 & x_3 \\
 +
\vdots & \vdots
 +
}
 +
=
 +
\pmatrix{
 +
N & \sum x_i\\
 +
\sum x_i & \sum x_i^2
 +
}\,,$$
 +
 
 +
The inverse of a $2\times2$ matrix is the inverse of its determinant times the matrix where diagonal elements are swapped and off-diagonal ones are changed sign, so:
 +
 
 +
$$(\Phi^T\Phi)^{-1}
 +
=
 +
\frac{1}{N\sum x_i^2-(\sum x_i)^2}
 +
\pmatrix{
 +
\sum x_i^2 & -\sum x_i\\
 +
-\sum x_i & N
 +
}\,,
 +
$$
 +
 
 +
and since:
 +
 
 +
$$\Phi^T y=\pmatrix{
 +
\sum y_i \\
 +
\sum x_iy_i
 +
}$$
 +
 
 +
this brings us to the final, so-called ''closed-form'' result:
 +
 
 +
$$\beta=(\Phi^T\Phi)^{-1}\Phi^T y=
 +
\frac{1}{N\sum x_i^2-(\sum x_i)^2}
 +
\pmatrix{
 +
\sum x_i^2\sum y_i-\sum x_i\sum x_iy_i \\
 +
-\sum x_i\sum y_i + N\sum x_iy_i&
 +
}\,,
 +
$$
 +
 
 +
which is also exact, but remains so even without the computer. These formulas are used by most people as coming from the theory black box. Now you know where this comes from.
 +
 
 +
As for interpolation, let us now progress from the linear case to polynomials. In this case:
  
 
$$f(x;\beta)\equiv\sum_{k=0}^N\beta_k x^k$$
 
$$f(x;\beta)\equiv\sum_{k=0}^N\beta_k x^k$$
Line 85: Line 140:
 
Let us try with [[Julia]]:
 
Let us try with [[Julia]]:
  
 +
 +
XXX
 +
 +
We conclude with fitting with functions other than polynomials. In computer science, they typically provide enough generality, but in physics, many functional dependencies are of a different type. For instance, it is customary to encounter exponential behaviour (which requires an infinite number of polynomials).
 +
 +
One can use the linear squares rigorously as long as the coefficients $\beta$ are linear. Sometimes it could appear there is a clever trick to extend its reach of action. Say we have data distributed according to (lifetimes of a radioactive sample of size $N(0)$):
 +
 +
$$N(t)=N(0)\exp(-\gamma t)$$
 +
 +
Then we could do the trick:
 +
 +
$$\ln(N(t))=\ln(N(0))-\gamma t$$
 +
 +
which is a linear function of the coefficients $(\ln N(0), \gamma$) and we could even use the analytical results above.
 +
 +
Tricks can be clever or they can be foolish and deceiving.
 +
 +
In this case, this is considered bad practice (although it could handy for a good-enough job quickly done). However, the noise is not distributed linearly, due to the nonlinear nature of the function. Therefore the regression analysis (the linear least square) will bias the data. Explore this case as an exercise by fitting nonlinearly and using the exact result above. For each time generate a cloud of data so that you can get access to the noise as well, and fit the mean as function of time. (You can also use the <tt>exprnd()</tt> function to simulate decay times.)
 +
 +
YYY
 +
 +
Clearly, seen as a line, this is not a good fit! While it is in fact excellent, as seen on the exponential fit figure, because the long time events are rare, and thus have more noise.
  
 
Nonlinear fit with Julia:
 
Nonlinear fit with Julia:
Line 102: Line 179:
 
fit.jacobian: estimated Jacobian at solution
 
fit.jacobian: estimated Jacobian at solution
 
</pre>
 
</pre>
 
<syntaxhighlight lang="python">
 
\frac{\partial S}{\partial\beta_j}=2
 
</syntaxhighlight>
 
 
Weierstrass Approximation Theorem
 
  
 
{{WLP6}}
 
{{WLP6}}

Revision as of 08:42, 4 April 2022

Crash Course in Scientific Computing

XVIII. Fitting

We have seen that interpolation works great theoretically but can lead to unpleasant results. Let alone Runge's phenomenon, we certainly do not want to constrain our model to pass by all points, taking into account irrelevant and meaningless fluctuations, noise, etc. While this is theoretically possible if we provide sufficient degrees of freedom, this is probably meaningless. What we typically want, instead, is the best overall agreement, without stretching too far our underlying model, which usually comes with its fixed number of degrees of freedom anyway. This problem is that of fitting, which is of extreme importance in most quantitative science.

Weierstrass Approximation Theorem

Say that we have $N$ data points $y_1$, ..., $y_N$ obtained for the values $x_1$, ..., $x_N$ that we expect are accounted for by the "model" $f$, i.e., a function $f(x,\vec\beta)$ of $\nu$ parameters $\beta_1$, ..., $\beta_\nu$ kept in a vector $\vec\beta$:

$$r_i=f(x_i,\vec\beta)-y_i\,.$$

A good way to fit our data with the model is by minimizing the sum of the squared residuals (how much is left out by the model from the actual data):

$$S=\sum_{i=1}^N r_i^2\,.$$

We take the square as otherwise, positive and negative residuals would cancel without providing an exact fit. To find the optimum, that is, smallest, sum of squares, we vary $\beta$s such that $S$ is minimum, which is achieved when all $\partial S/\partial\beta_j$ terms for $1\le j\le\nu$ are canceled simultaneously:

$$\frac{\partial S}{\partial\beta_j}=2\sum_{i=1}^N(f(x_i,\vec\beta)-y_i)\frac{\partial f(x_i,\vec\beta)}{\partial\beta_j}=0\,,\quad\text{for $j=1,\dots,\nu$}\,.\tag{1}$$

This is a complicated problem in general, that typically requires iterations over trials and errors.

There is a case where Eq. (1) is considerably simplified, namely, if the partial derivative can be cancelled out, which occurs for so-called linear least squares where the parameters $\beta$ are merely linear coefficients of functions (that can be, of course, nonlinear). That is to say, our test function reads:

$$f(x;\beta)=\sum_{k=1}^\nu\beta_k\phi_k(x)\,,$$

in which case:

$$\frac{\partial f(x_i;\beta)}{\partial\beta_j}=\phi_j(x_i)\,.$$

We put this back in each term of the differential in $\beta$:

$$\frac{\partial S}{\partial\beta_j}=2\sum_{i=1}^N\left(\left[\sum_{k=1}^\nu\beta_k\phi_k(x_i)\right]-y_i\right)\phi_j(x_i)=0$$

and now, it remains to rearrange this equality so that we can write it in matrix form. To do so, we have to remember these properties (and notations):

Concatenation:

$$\sum_k A_{ik}B_{kj}=(AB)_{ij}$$

Transposition:

$$A^T_{ij}=A_{ji}$$

so that our equation above becomes:

$$\sum_{i=1}^N\sum_{k=1}^\nu\beta_k\phi_k(x_i)\phi_j(x_i)=\sum_{i=1}^N y_i\phi_j(x_i)\,,$$

We introduce the matrix $\Phi$ such that:

$$\Phi_{ij}=\phi_j(x_i)\tag{2}\,,$$

in which case, with some further grouping and reassembling:

$$\sum_{k=1}^\nu\sum_{i=1}^N\Phi^T_{ji}\Phi_{ik}\beta_k=\sum_{i=1}^N \Phi_{ij}y_i\,,$$

So that we can get rid of the $i$ indices:

$$\sum_{k=1}^\nu(\Phi^T\Phi)_{jk}\beta_k=(\Phi^T y)_j\,,$$

where $y=(y_1,\dots,y_N)^T$, and now getting rid of $k$:

$$[(\Phi^T\Phi)\beta]_j=(\Phi^T y)_j\,,$$

with $\beta$ our vector of unknown parameters. As the equality holds for all $1\le j\le\nu$, it holds for the full vector, providing us with the final equation in its matrix form:

$$(\Phi^T\Phi)\vec\beta=\Phi^T y\,.$$

Assuming $(\Phi^T\Phi)$ is well conditioned (in particular, not singular), we get the formal solution by brute force inversion:

$$\vec\beta=(\Phi^T\Phi)^{-1}\Phi^T y\,.$$

The $(\Phi^T\Phi)^{-1}\Phi^T$ could be problematic if large (many fitting functions and/or many data points) and actually computed in this form, in which case one can see it as the notation for a so-called "pseudo-inverse" that deals with such an object in an efficient and stable way.

Let us now turn from the general theory to particular cases of interest.

By far, the most famous and used type of fitting is with a linear function, of the type:

$$f(x;\beta)=\beta_0+x\beta_1\,.$$

This is known as linear regression. In this case, our matrix become simply:

$$\Phi^T\Phi=\pmatrix{ 1 & 1 & 1 & \dots\\ x_1 & x_2 & x_3 & \dots }\pmatrix{ 1 & x_1 \\ 1 & x_2 \\ 1 & x_3 \\ \vdots & \vdots } = \pmatrix{ N & \sum x_i\\ \sum x_i & \sum x_i^2 }\,,$$

The inverse of a $2\times2$ matrix is the inverse of its determinant times the matrix where diagonal elements are swapped and off-diagonal ones are changed sign, so:

$$(\Phi^T\Phi)^{-1} = \frac{1}{N\sum x_i^2-(\sum x_i)^2} \pmatrix{ \sum x_i^2 & -\sum x_i\\ -\sum x_i & N }\,, $$

and since:

$$\Phi^T y=\pmatrix{ \sum y_i \\ \sum x_iy_i }$$

this brings us to the final, so-called closed-form result:

$$\beta=(\Phi^T\Phi)^{-1}\Phi^T y= \frac{1}{N\sum x_i^2-(\sum x_i)^2} \pmatrix{ \sum x_i^2\sum y_i-\sum x_i\sum x_iy_i \\ -\sum x_i\sum y_i + N\sum x_iy_i& }\,, $$

which is also exact, but remains so even without the computer. These formulas are used by most people as coming from the theory black box. Now you know where this comes from.

As for interpolation, let us now progress from the linear case to polynomials. In this case:

$$f(x;\beta)\equiv\sum_{k=0}^N\beta_k x^k$$

and our $\Phi$ matrix (which by the way goes by the name of a Gramian) reads:

$$\Phi=\pmatrix{ 1 & x_1 & x_1^2 & x_1^3 & \dots \\ 1 & x_2 & x_2^2 & x_2^3& \dots \\ \vdots & & & & \vdots \\ 1 & x_N & x_N^2 & x_N^3 & \dots }$$

Let us try with Julia:


XXX

We conclude with fitting with functions other than polynomials. In computer science, they typically provide enough generality, but in physics, many functional dependencies are of a different type. For instance, it is customary to encounter exponential behaviour (which requires an infinite number of polynomials).

One can use the linear squares rigorously as long as the coefficients $\beta$ are linear. Sometimes it could appear there is a clever trick to extend its reach of action. Say we have data distributed according to (lifetimes of a radioactive sample of size $N(0)$):

$$N(t)=N(0)\exp(-\gamma t)$$

Then we could do the trick:

$$\ln(N(t))=\ln(N(0))-\gamma t$$

which is a linear function of the coefficients $(\ln N(0), \gamma$) and we could even use the analytical results above.

Tricks can be clever or they can be foolish and deceiving.

In this case, this is considered bad practice (although it could handy for a good-enough job quickly done). However, the noise is not distributed linearly, due to the nonlinear nature of the function. Therefore the regression analysis (the linear least square) will bias the data. Explore this case as an exercise by fitting nonlinearly and using the exact result above. For each time generate a cloud of data so that you can get access to the noise as well, and fit the mean as function of time. (You can also use the exprnd() function to simulate decay times.)

YYY

Clearly, seen as a line, this is not a good fit! While it is in fact excellent, as seen on the exponential fit figure, because the long time events are rare, and thus have more noise.

Nonlinear fit with Julia:

m(t, p) = p[1] * exp.(p[2] * t)
p0 = [0.5, 0.5]
myfit = curve_fit(m, xdata, ydata, p0)

This will return a composite typeLsqFitResult, of which we can extract:

fit.dof: degrees of freedom
fit.param: best fit parameters
fit.resid: vector of residuals
fit.jacobian: estimated Jacobian at solution