Derivation of variational Bayesian for nonlinear models

Scope

The derivation procedure for the formulation of the Variational Bayesian (VB) method proposed in the references listed below is addressed here in details together with an improvement in the noise model as compared to the one used in these works; namely, the noise model can adopt a prescribed non-uniform covariance matrix.

Main criterion (maximization of free energy)

With \(\boldsymbol{y}\): measured data, \(\boldsymbol{w}=[\boldsymbol{w}_i]\): latent parameters, \(P(\boldsymbol{w})\): priors and \(q(\boldsymbol{w})\): approximated posteriors, the following free energy must be maximized.

(1)\[\require{cancel} F = \int q(\boldsymbol{w}) \log \left[ \frac{P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})}{q(\boldsymbol{w})} \right] d\boldsymbol{w} \, .\]

We use the mean field approximation for posteriors \(q(\boldsymbol{w})\):

(2)\[q(\boldsymbol{w}) = \prod_{i=1}^m q_{i}(\boldsymbol{w}_i) \, ,\]

and apply the calculus of variations to fulfil the maximization of the free energy. For this purpose, we regroup the free energy defined above for every and each index \(i\).

(3)\[\begin{split}F &= \int q_{i} \, q_{\cancel{i}} \, \log\left[P(\boldsymbol{y}|\boldsymbol{w})\, P(\boldsymbol{w})\right] - q_{i} \,q_{\cancel{i}} \, \log[q_{i}] - q_{i} \, q_{\cancel{i}} \, \log[q_{\cancel{i}}] \;d \boldsymbol{w} ; \; \mbox{for each } i \, .\\\end{split}\]

where \(_{\cancel{i}}\) indicates all indices except \(i\), and \(q_\cancel{i}\) represents the product of all distributions except \(q_i\).

We consider the above equation for every and each \(_i\) and shift integrants corresponding to \(\boldsymbol{w}_\cancel{i}\) to a separate term (\(g\)), resulting in the following expression.

(4)\[\begin{split}F &= \int g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i)\right) \;d\boldsymbol{w}_i \, ,\\\end{split}\]

with

\[\begin{split}g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i)\right) &= \int f\left(\boldsymbol{w}, q(\boldsymbol{w})\right) \;d\boldsymbol{w}_\cancel{i} \, ,\\\end{split}\]

where \(f\left(\boldsymbol{w}, q(\boldsymbol{w})\right)\) is the whole integrant in (3).

From variational calculus, the maximum of F represented in (4) is the solution of the Euler-Lagrange equation.

\[\begin{split}\frac{\partial}{\partial q_i(\boldsymbol{w}_i)} \left[ g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i), q'_{i}(\boldsymbol{w}_i)\right) \right]- \frac{d}{d\boldsymbol{w}_i}\left\{ \frac{\partial}{\partial q'_i(\boldsymbol{w}_i)}\left[g( \boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i), q'_{i}(\boldsymbol{w}_i)) \right] \right\}&=0.\\\end{split}\]

The second term vanishes, since \(g\) does not depend on \(q'_i(\boldsymbol{w}_i)\). Substituting the first term using the mean field approximation of F yields

\[\begin{split}0&= \frac{\partial }{\partial q_i} \left[ \int q_{i} \, q_{\cancel{i}} \, \log\left[P(\boldsymbol{y}|\boldsymbol{w})\, P(\boldsymbol{w})\right] - q_{i} \,q_{\cancel{i}} \, \log[q_{i}] - q_{i} \, q_{\cancel{i}} \, \log[q_{\cancel{i}}] \;d \boldsymbol{w}_{\!\cancel{i}} \right]\\ &= \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})] - (q_{\cancel{i}} \, \log[q_{i}] + \frac{q_{i} \, q_{\cancel{i}}}{q_{i}}) - q_{\cancel{i}} \,\log[q_{\cancel{i}}] \; d\boldsymbol{w}_{\!\cancel{i}}.\\\end{split}\]

With the property of a density function to integrate to one, it follows:

\[\begin{split}0&= \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w})]d\boldsymbol{w}_{\cancel{i}} - \log[q_{i}] - 1 - \int q_{\cancel{i}}\log[q_{\cancel{i}}]d\boldsymbol{w}_{\!\cancel{i}}.\\\end{split}\]

Moving the second term to the other side of the equation and realizing that the latter two terms do not depend on \(q_{i}\), we obtain the following set of governing equations for the maximization of the free energy.

(5)\[\begin{split}\log[q_{i}] = \int q_{\cancel{i}}\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\boldsymbol{w}_{\!\cancel{i}} + \mbox{const.} \quad ; \; \mbox{for each } i \, .\\\end{split}\]

Note that, (5) agrees with eq.(5) of Chappell paper:

\[\begin{split}\log[q_{i}] & \propto \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w})] d\boldsymbol{w}_{\!\cancel{i}} \, .\\\end{split}\]

Model of noise (error)

We define the noise (error) \(\boldsymbol{e}\) as the gap between the data \(\boldsymbol{y}\) and the model response \(g(\theta)\):

(6)\[\begin{split}\boldsymbol{e}=\boldsymbol{e}(\theta)=:\boldsymbol{y}-g(\theta) \, ,\\\end{split}\]

and model it through a multivariate normal distribution, i.e.:

(7)\[\boldsymbol{e} \sim \mathcal{MVN}(\boldsymbol{0}, \Phi^{-1} \color{red}{C_e} ) \quad \rightarrow \quad P(\boldsymbol{e}|\Phi)=\frac{\Phi^{N/2}}{\sqrt{(2\pi)^{N}\color{red}{|C_e|}}}e^{-0.5\,\Phi\,\boldsymbol{e}^T. \color{red}{C_e^{-1}} .\boldsymbol{e}} \; ,\]

where \(N\) is the length of model error signal and \(\color{red}{C_e}\) is a known covariance matrix scaled by means of the latent parameter \(\Phi\) to represent the covariance of the distribution.

With such a setup, the latent parameters will become:

(8)\[\boldsymbol{w} = [\boldsymbol{w}_{\theta}, \boldsymbol{w}_{\Phi}] \; ,\]

which also implies \(i \in \left\{ \theta, \Phi \right\}\) in (2) and (5), resulting in the following governing equations for the inference.

(9)\[\log[q_{\theta}] = \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi + \mbox{const.} \, .\]
(10)\[\log[q_{\Phi}] = \int q_{\theta} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\theta + \mbox{const.} \, .\]

Log likelihood

The likelihood is by definition:

(11)\[\mbox{Likelihood} =: P(\boldsymbol{y}|\boldsymbol{w}) = P(\boldsymbol{y}|\theta, \Phi) = \underbrace{P(\boldsymbol{y}|\boldsymbol{e}, \theta, \Phi)}_{=1} \, P(\boldsymbol{e}(\theta)|\Phi)=P(\boldsymbol{e}|\Phi) \; .\]

According to (7), the log likelihood then reads:

(12)\[\log\left[ \mbox{Likelihood} \right] =: \log\left[ P(\boldsymbol{y}|\boldsymbol{w}) \right] = \frac{N}{2}\log\left[\Phi\right]-\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + \mbox{const.} \; .\]

Priors

We consider an \(\mathcal{MVN}\) distribution for the prior of \(\theta\) (\(N_{\theta}\) is the length of parameter vector \(\theta\)):

(13)\[\theta \sim \mathcal{MVN}(\boldsymbol{m}_0, \Lambda_0^{-1}) \quad \rightarrow \quad P(\theta|\boldsymbol{m}_0,\Lambda_0)=\sqrt{ \frac{|\Lambda_0|}{(2\pi)^{N_{\theta}}} } \; e^{-0.5\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0)} \; ,\]

and a Gamma distribution for the prior of the noise parameter \(\Phi\):

(14)\[\Phi \sim \Gamma(c_0,s_0) \quad \rightarrow \quad P(\Phi|c_0,s_0)=\frac{1}{\Gamma(c_0)}\frac{\Phi^{c_0-1}}{s_0^{c_0}}e^{-\frac{\Phi}{s_0}} \; .\]

One assumption in the derivations that follow is, that the priors of the latent parameters \(\theta\) and \(\Phi\) are independent, i.e.:

(15)\[P(\boldsymbol{w}) = P(\theta, \Phi) = P(\theta)\,P(\Phi)\]

\(\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right]\)

We need to expand the term \(\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right]\) as a required step for solving (9) and (10). It is important to recognize, that this term is log of the posterior plus log of the constant evidence: \(\int P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) d\boldsymbol{w}\). Using (12), (15), (13) and (14), and adding all constant terms to one term, we obtain:

(16)\[\begin{split}\begin{split} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] &= \log \left[ P(\boldsymbol{y}|\theta,\Phi) \right] + \log \left[ P(\theta) \right] + \log \left[ P(\Phi) \right] \\ &=\frac{N}{2}\log\left[\Phi\right]-\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \\ &-\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \\ &+(c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \color{blue}{\mathrm{const}\lbrace \boldsymbol{\theta},\Phi \rbrace} \, . \end{split}\end{split}\]

This equation can be regrouped in the following way that is more handy for the next steps.

(17)\[\begin{split}\begin{split} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] &= -\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \quad ; \\ &f_{\theta}=: -\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \\ &f_{\Phi}=:(c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] \, . \end{split}\end{split}\]

The constant term that is expressed below will not influence the solution of (9) and (10), however, it will be used for the computation of the free energy.

\[\color{blue}{\mathrm{const} \lbrace \boldsymbol{\theta},\Phi \rbrace = -\frac{N+N_{\theta}}{2}\log[2\pi] -\frac{1}{2}\log[\color{red}{|C_e|}] + \frac{1}{2} \log[\mathrm{det}(\Lambda_0)] + \log[1/\Gamma(c_0)]-c_0\log[s_0]} \, .\]

Approximated posteriors

To become able to derive suitable update equations, which will follow up, we select the approximated posteriors in exactly the same form as the priors, i.e.:

(18)\[q_{\theta} = \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) \quad \rightarrow \quad q(\theta)=\sqrt{ \frac{|\Lambda|}{(2\pi)^N_{\theta}} } \; e^{-0.5\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m})} \; ,\]
(19)\[q_{\Phi} = \Gamma(\Phi;\,c,s) \quad \rightarrow \quad q(\Phi)=\frac{1}{\Gamma(c)}\frac{\Phi^{c-1}}{s^{c}}e^{-\frac{\Phi}{s}} \; ,\]

which introduces \(\boldsymbol{m},\,\Lambda,\,c,\,s\) as the deterministic latent parameters to be identified.

Taylor expansion of the model error

Another simplification made in the VB is to approximate the model error defined in (6) by means of a first-order Taylor expansion around the posterior mean \(\boldsymbol{m}\):

(20)\[\boldsymbol{e} = \boldsymbol{e}(\theta) =: \boldsymbol{y} - g(\theta) \approx \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \, ,\]

with:

(21)\[\boldsymbol{k} =: \boldsymbol{e}(\boldsymbol{m}) = \boldsymbol{y} - g(\boldsymbol{m}) \, ,\]

and

(22)\[\boldsymbol{J} =: \frac{dg(\theta)}{d\theta}|_{\theta=\boldsymbol{m}} = -\frac{d\boldsymbol{e}(\theta)}{d\theta}|_{\theta=\boldsymbol{m}}\, .\]

Update equations

Update of \(\theta\)

Considering (17) and (18), we start with separately expanding both sides of (9).

(23)\[\begin{split}\begin{split} \log \left[ q_{\theta} \right] &= -\frac{1}{2}\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m}) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} \\ &=-\frac{1}{2} \left( \theta^T\Lambda\theta + \theta^T\Lambda\boldsymbol{m} + \boldsymbol{m}^T\Lambda\theta \right) + \mathrm{const_2}\left\{\boldsymbol{\theta} \right\} \\ &; \quad \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\} = \frac{1}{2}\log\left[\,|\Lambda|\,\right] -\frac{N_{\theta}}{2}\log[2\pi]} \, . \end{split}\end{split}\]
(24)\[\begin{split}\begin{split} \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi &= -\frac{1}{2}\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \underbrace{\int \Phi q_{\Phi} d\Phi}_{=sc} + f_{\theta} \underbrace{\int q_{\Phi} d\Phi}_{=1} + \int \left( f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) q_{\Phi} d\Phi \\ &= -\frac{1}{2}sc\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + \mbox{const}\left\{\theta\right\} \, , \end{split}\end{split}\]

where the two identities \(\int \Phi q_{\Phi} d\Phi=sc\) and \(\int q_{\Phi} d\Phi=1\) have been used. Equation (24) can be further simplified according to (20) and \(f_{\theta}\) defined in (17):

(25)\[\begin{split}\begin{split} \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi =&-\frac{1}{2}sc\,\left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) ^T.\color{red}{C_e^{-1}}. \left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) \\ &-\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) + \mbox{const}\left\{\theta\right\} \, . \\ =& -\frac{1}{2} \left( \theta^T.h_1.\theta - \theta^T.h_2 - h_3.\theta \right) + \mbox{const}\left\{\theta\right\} \quad ; \\ h_1 =& \Lambda_0+sc\,\boldsymbol{J}^T\color{red}{C_e^{-1}}\boldsymbol{J} \\ h_2=& \Lambda_0\boldsymbol{m}_0 + sc\boldsymbol{J}^T\color{red}{C_e^{-1}} \left( \boldsymbol{k} + \boldsymbol{J}\boldsymbol{m} \right) \\ h_3=& \boldsymbol{m}_0^T\Lambda_0 + sc \left( \boldsymbol{k}^T+\boldsymbol{m}^T\boldsymbol{J}^T \right) \color{red}{C_e^{-1}}\boldsymbol{J} \end{split}\end{split}\]

We assume that the matrices \(\Lambda_0\) and \(\color{red}{C_e^{-1}}\) are symmetric, i.e. \(h_2^T=h_3\). Then, by comparing both sides of (9) from (23) and (25), we obtain the following equations for the parameters (\(\boldsymbol{m}, \,\Lambda\)) of the posterior \(\theta\).

(26)\[\begin{split}\boxed{ \begin{split} \Lambda =& \Lambda_0+sc\,\boldsymbol{J}^T\color{red}{C_e^{-1}}\boldsymbol{J} \\ \Lambda\boldsymbol{m} =& \Lambda_0\boldsymbol{m}_0 + sc\boldsymbol{J}^T\color{red}{C_e^{-1}} \left( \boldsymbol{k} + \boldsymbol{J}\boldsymbol{m} \right) \end{split} }\end{split}\]

Update of \(\Phi\)

A similar procedure is performed for expanding both sides of (10) using (17), (19) and (20).

(27)\[\begin{split}\begin{split} \log \left[ q_{\Phi} \right] &= \log \left[ \frac{1}{\Gamma(c)}\frac{\Phi^{c-1}}{s^{c}}e^{-\frac{\Phi}{s}} \right] = -\log\left[ \Gamma(c)\right] -c\log\left[s\right] + (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} \\ &= (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} + \color{blue}{\mathrm{const}\left\{\Phi \right\}} \\ &; \quad \color{blue}{\mathrm{const}\left\{\Phi \right\} = -\log\left[ \Gamma(c)\right] -c\log\left[s\right]} \; . \end{split}\end{split}\]
(28)\[\begin{split}\begin{split} \int q_{\theta} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\theta =& -\frac{1}{2}\Phi \int \boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \, q_{\theta} d\theta + f_{\Phi} \underbrace{\int q_{\theta} d\theta}_{=1} + \int \left( f_{\theta} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) q_{\theta} d\theta \, . \\ =& -\frac{1}{2}\Phi \int \left ( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) ^T.\color{red}{C_e^{-1}}. \left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) \, q_{\theta} d\theta + f_{\Phi} + \mbox{const}\left\{\Phi\right\} \\ =& -\frac{1}{2}\Phi \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} \underbrace{\int q_{\theta} d\theta}_{=1} + \int \left( \theta - \boldsymbol{m} \right)^T.\left( \boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right).\left( \theta - \boldsymbol{m} \right) q_{\theta} d\theta \right) \\ & -\frac{1}{2}\Phi \left( -\boldsymbol{k}^T.\color{red}{C_e^{-1}}.J. \cancelto{0}{\int \left( \theta - \boldsymbol{m} \right) q_{\theta} d\theta} - \cancelto{0}{\int \left( \theta - \boldsymbol{m} \right)^T q_{\theta} d\theta} \, .\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} \right) \\ &+ (c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] + \mbox{const}\left\{\Phi\right\} \\ =& -\frac{1}{2}\Phi \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) - \frac{\Phi}{s_0} \\ &+ \log\left[\Phi\right]\left(c_0-1+\frac{N}{2}\right) + \mbox{const}\left\{\Phi\right\} \, . \end{split}\end{split}\]

Note that, in the last step of (28) the following identity (trace trick) has been used in view of \(q_{\theta}\) defined in (18).

(29)\[\begin{split}\begin{split} \boldsymbol{E}_{\mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1})}\left( \left( \theta - \boldsymbol{m} \right)^TA\left( \theta - \boldsymbol{m} \right) \right) &=: \int \left( \theta - \boldsymbol{m} \right)^TA\left( \theta - \boldsymbol{m} \right) \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) d\theta \\ &= \mbox{tr}\left( \Lambda^{-1}A\right) \quad ; \; \mbox{for any constant matrix } A \end{split}\end{split}\]

Comparing both sides of (10) simplified in (27) and (28), we obtain the following equations for the posterior noise parameters (\(c, \,s\)).

(30)\[\begin{split}\boxed{ \begin{split} c =& c_0+\frac{N}{2} \\ \frac{1}{s} =& \frac{1}{s_0} + \frac{1}{2}\left ( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) \end{split} }\end{split}\]

Simplification of the free energy

We look into maximizing the free energy defined in (1) via solving the set of algebraic equations obtained in the previous section. This optimization problem is likely to be ill-posed for highly nonlinear forward simulators (\(g\)); i.e. to have several local maximums. To cope with this situation, it is useful to compute and keep monitoring the free energy. Considering \(q(\boldsymbol{w})=q_{\theta}\,q_{\Phi}\) and \(d\boldsymbol{w}=d\theta\,d\Phi\), we start expanding (1):

(31)\[\begin{split}\begin{split} F =& \int q_{\theta}\,q_{\Phi} \log \left[ \frac{P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})}{q_{\theta}\,q_{\Phi}} \right] d\theta\,d\Phi \\ =& \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w}) \right] d\theta\,d\Phi}_{=:F_1} - \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ q_{\theta} \right] d\theta\,d\Phi}_{=:F_2} - \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ q_{\Phi} \right] d\theta\,d\Phi}_{=:F_3} \,. \end{split}\end{split}\]

The expansion of the terms \(F_1\), \(F_2\), \(F_3\) follow.

(32)\[\begin{split}\begin{split} F_1 =& \int q_{\theta}\,q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w}) \right] d\theta\,d\Phi \\ =& \int q_{\theta}\,q_{\Phi} \left ( -\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) d\theta\,d\Phi \\ =& -\frac{1}{2} \int \boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \, q_{\theta}d\theta \underbrace{\int \Phi\,q_{\Phi}d\Phi}_{=sc} -\frac{1}{2} \int \left( (\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \right) q_{\theta}d\theta \underbrace{\int q_{\Phi}d\Phi}_{=1} \\ & + \underbrace{\int q_{\theta}d\theta}_{=1} \int \left( (c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] \right)d\Phi + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \\ =& -\frac{1}{2} sc \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) -\frac{sc}{s_0} +\left( \frac{N}{2} + c_0 -1 \right) \left( \log \left[s\right] + \psi(c) \right) \\ & -\frac{1}{2}\left( (\boldsymbol{m}-\boldsymbol{m}_0)^T.\Lambda_0.(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}\Lambda_0 \right) \right) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \, . \end{split}\end{split}\]

The first term in the above simplified expression has been rewritten according to the simplication already done through the derivation of (28). We have furthermore applied the identity \(\int \log\left[ \Phi\right] d\Phi=\log\left[s\right] + \psi(c)\) with \(\psi\) being the di-gamma function defined as \(\psi(x)=:\frac{d}{dx}\ln\Gamma(x)=\frac{\Gamma'(x)}{\Gamma(x)}\), as well as the following identity that is similar to (29).

(33)\[\begin{split}\begin{split} \boldsymbol{E}_{\mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1})}\left[ \left( \theta - \boldsymbol{m}_0 \right)^TA\left( \theta - \boldsymbol{m}_0 \right) \right] &=: \int \left( \theta - \boldsymbol{m}_0 \right)^TA\left( \theta - \boldsymbol{m}_0 \right) \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) d\theta \\ &= (\boldsymbol{m}-\boldsymbol{m}_0)^TA(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}A\right) \\ & \quad ; \; \mbox{for any constant matrix } A \end{split}\end{split}\]

We continue with expanding \(F_2\) and \(F_3\) by the help of (23), (29) and (27):

(34)\[\begin{split}\begin{split} F_2 &= \int q_{\theta}\,q_{\Phi} \log \left[ q_{\theta} \right] d\theta\,d\Phi = \int \log \left[ q_{\theta} \right] q_{\theta}d\theta \int q_{\Phi} d\Phi \\ &= \int \left( -\frac{1}{2}\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m}) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} \right) q_{\theta}d\theta \\ &= -\frac{1}{2}\mathrm{tr}(\Lambda^{-1}\Lambda) +\color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} = -\frac{N_{\theta}}{2} + \color{blue}{\frac{1}{2}\log\left[\,|\Lambda|\,\right] -\frac{N_{\theta}}{2}\log[2\pi]} \,. \end{split}\end{split}\]
(35)\[\begin{split}\begin{split} F_3 &= \int q_{\theta}\,q_{\Phi} \log \left[ q_{\Phi} \right] d\theta\,d\Phi = \int q_{\theta} d\theta \int \log \left[ q_{\Phi} \right] q_{\Phi}d\Phi \\ &= \int \left( (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} + \color{blue}{\mathrm{const}\left\{\Phi \right\}} \right) q_{\Phi}d\Phi \\ &= (c-1)\left( \log \left[s\right] + \psi(c) \right) - c \color{blue}{-\log\left[ \Gamma(c)\right] -c\log\left[s\right]} \,. \end{split}\end{split}\]

Substituting (32), (34) and (35) into (31) results in:

(36)\[\begin{split}\boxed{ \begin{split} F =& -\frac{1}{2} sc \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) -\frac{sc}{s_0} -\frac{1}{2}\left( (\boldsymbol{m}-\boldsymbol{m}_0)^T.\Lambda_0.(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}\Lambda_0 \right) \right) \\ & + \cancelto{0}{\left( \frac{N}{2} + c_0 -c \right)} \left( \log \left[s\right] + \psi(c) \right) + c + \frac{N_{\theta}}{2} \color{blue}{ - \frac{1}{2}\log\left[\,|\Lambda|\,\right] + \cancel{\frac{N_{\theta}}{2}\log[2\pi]} + \log\left[ \Gamma(c)\right] + c\log\left[s\right] } \\ & \color{blue}{ -\frac{N+\cancel{N_{\theta}}}{2}\log[2\pi] -\frac{1}{2}\log[\color{red}{|C_e|}] + \frac{1}{2} \log[\mathrm{det}(\Lambda_0)] + \log[1/\Gamma(c_0)]-c_0\log[s_0] } \, . \end{split} }\end{split}\]

Notice the vanishing term provided that the noise parameter \(c\) is already updated by the first identity in (30).