Used to solve a non-linear least squares problems.
where:
- is the vector of parameters we want to optimize
- is the residual vector
- Each is a non-linear function measuring the error in the ith observation
- 1/2 is for convenience when computing the derivative
What constitutes a non-linear least squares problem?
Under gaussian assumptions, minimizing the non-linear least squares problem is equivalent to giving us the maximum likelihood estimate
Key Assumption
Gauss-Newton assumes that our initial guess is relatively near our optimum. At its core, Gauss-Newton is an extension of Newton’s Method but attempting to handle the deriving the hessian problem.
From our example in Newton’s Method, if we try to actually derive the Jacobian and Hessian, we get:
\text{Hessian: }\frac{\partial^2 J(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}{\text{op}}} = \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right) + \cancelto{ 0 }{ \sum_{i=1}^{n} u_i(\mathbf{x}{\text{op}}) \left(\frac{\partial^2 u_i(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}_{\text{op}}}\right) }
The Hessian here is tough to compute, especially when you need to consider the non-linearity. >[!error] To get around this, we assume that we are already near the local minimum and thus our error $u_{i}(\mathbf{x}_{op})$ is small (ideally 0). So we just get rid of the hessian altogether and approximate the hessian. >\text{Hessian: }\frac{\partial^2 J(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}{\text{op}}} = \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right) + \cancelto{ 0 }{ \sum_{i=1}^{n} u_i(\mathbf{x}{\text{op}}) \left(\frac{\partial^2 u_i(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}_{\text{op}}}\right) }
\text{Hessian: }\frac{\partial^2 J(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}{\text{op}}} = \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)
\Rightarrow \quad \left(\frac{\partial^2 J(\mathbf{x})}{\partial \mathbf{x} \partial \mathbf{x}^T}\bigg|{\mathbf{x}{\text{op}}}\right) \delta\mathbf{x}^* = -\left(\frac{\partial J(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)
\left(\left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)\right) \delta\mathbf{x}^* = -\left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \mathbf{u}(\mathbf{x}_{\text{op}})
Which leads to the same derivation as in the **EXAMPLE** # How to Gauss-Newton At its core, we linearize the residuals, solve the linear least squares problem, update our estimate, rinse and repeat. 1. Initialize an initial guess $\mathbf{x}_0$ , following guesses are denoted as $\mathbf{x}_{k}$ 2. Linearize the residuals (this is using first order [[Taylor Series]])r_i(\mathbf{x}_k + \Delta\mathbf{x}) \approx r_i(\mathbf{x}_k) + \nabla r_i(\mathbf{x}_k)^T \Delta\mathbf{x}
Where $\nabla r_{i}$ is the gradient of the residual In matrix (vector) form\mathbf{r}(\mathbf{x}_k + \Delta\mathbf{x}) \approx \mathbf{r}(\mathbf{x}_k) + \mathbf{J}_k \Delta\mathbf{x}
Where $\mathbf{J}_{k}$ is the [[Jacobian]] of $r$ evaluated at $\mathbf{x}_{k}$ 3. Formulate the linear least squares problem\min_{\Delta\mathbf{x}} \frac{1}{2}|\mathbf{r}(\mathbf{x}k + \Delta\mathbf{x})|^2 \approx \min{\Delta\mathbf{x}} \frac{1}{2}|\mathbf{r}(\mathbf{x}_k) + \mathbf{J}_k \Delta\mathbf{x}|
f(\Delta\mathbf{x}) = \frac{1}{2}(\mathbf{r}_k + \mathbf{J}_k \Delta\mathbf{x})^T (\mathbf{r}k + \mathbf{J}k \Delta\mathbf{x});;\text{where};;r{k}=r(\mathbf{x}{k})
f(\Delta\mathbf{x}) = \frac{1}{2}\mathbf{r}_k^T\mathbf{r}_k + \mathbf{r}_k^T\mathbf{J}_k\Delta\mathbf{x} + \frac{1}{2}\Delta\mathbf{x}^T\mathbf{J}_k^T\mathbf{J}_k\Delta\mathbf{x}
\frac{\partial f}{\partial \Delta\mathbf{x}} = \mathbf{J}_k^T\mathbf{r}_k + \mathbf{J}_k^T\mathbf{J}_k\Delta\mathbf{x}
\mathbf{J}_k^T\mathbf{J}_k\Delta\mathbf{x} +\mathbf{J}_k^T\mathbf{r}_k= 0
\Delta\mathbf{x}_k = -(\mathbf{J}_k^T\mathbf{J}_k)^{-1}\mathbf{J}_k^T\mathbf{r}_k
\mathbf{J}_k^T\mathbf{J}_k\Delta\mathbf{x}_k = -\mathbf{J}_k^T\mathbf{r}_k
\mathbf{x}_{k+1} = \mathbf{x}_k + \Delta\mathbf{x}_k
J(\mathbf{x}) = \frac{1}{2}\mathbf{u}(\mathbf{x})^T \mathbf{u}(\mathbf{x})
\mathbf{u}(\mathbf{x})=\mathbf{r}(\mathbf{x})
We **linearize $\mathbf{u}(\mathbf{x}_{op}+\delta \mathbf{x})$**\mathbf{u}(\mathbf{x}{op}+\delta \mathbf{x})\simeq \mathbf{u}(\mathbf{x}{op})+\left( \frac{ \partial \mathbf{u}(\mathbf{x}) }{ \partial \mathbf{x} } |{\mathbf{x}{op}}\right)\delta \mathbf{x}
Substituting into $J(\mathbf{x}_{op}+\delta \mathbf{x})$J(\mathbf{x}{op}+\delta \mathbf{x})\simeq \frac{1}{2} \left(\mathbf{u}(\mathbf{x}{op})+\left( \frac{ \partial \mathbf{u}(\mathbf{x}) }{ \partial \mathbf{x} } |{\mathbf{x}{op}}\right)\delta \mathbf{x}\right)^{T}\left(\mathbf{u}(\mathbf{x}{op})+\left( \frac{ \partial \mathbf{u}(\mathbf{x}) }{ \partial \mathbf{x} } |{\mathbf{x}_{op}}\right)\delta \mathbf{x}\right)
\frac{\partial J(\mathbf{x}{\text{op}} + \delta\mathbf{x})}{\partial \delta\mathbf{x}} = \left(\mathbf{u}(\mathbf{x}{\text{op}}) + \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)\delta\mathbf{x}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right) = 0
\Rightarrow \quad \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right) \delta\mathbf{x}^* = -\left(\frac{\partial \mathbf{u}(\mathbf{x})}{\partial \mathbf{x}}\bigg|{\mathbf{x}{\text{op}}}\right)^T \mathbf{u}(\mathbf{x}_{\text{op}})
We solve for $\delta \mathbf{x}^{*}$ to update our operating point until we are happy\mathbf{x}{op}\leftarrow \mathbf{x}{op}+\delta \mathbf{x}^{*}
# Patches for Practicality 1. Adding a step size\mathbf{x}{op}\leftarrow \mathbf{x}{op}+\alpha;\delta \mathbf{x}^{*}
