Derivation of variational Bayesian for nonlinear models

Scope

The derivation procedure for the formulation of the Variational Bayesian (VB) method proposed in the references listed below is addressed here in details together with an improvement in the noise model as compared to the one used in these works; namely, the noise model can adopt a prescribed non-uniform covariance matrix.

Chappell, M. A.; Groves, A. R.; Whitcher, B. & Woolrich, M. W. Variational Bayesian Inference for a Nonlinear Forward Model, IEEE Transactions on Signal Processing, 2009, 57, 223-236 https://ieeexplore.ieee.org/document/4625948
Chappell, M. A.; Groves, A. R. & Woolrich, M. W. The FMRIB Variational Bayes Tutorial: Variational Bayesian inference for non-linear forward model, https://users.fmrib.ox.ac.uk/~chappell/papers/TR07MC1.pdf

Main criterion (maximization of free energy)

With \(\boldsymbol{y}\): measured data, \(\boldsymbol{w}=[\boldsymbol{w}_i]\): latent parameters, \(P(\boldsymbol{w})\): priors and \(q(\boldsymbol{w})\): approximated posteriors, the following free energy must be maximized.

(1)\[\require{cancel} F = \int q(\boldsymbol{w}) \log \left[ \frac{P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})}{q(\boldsymbol{w})} \right] d\boldsymbol{w} \, .\]

We use the mean field approximation for posteriors \(q(\boldsymbol{w})\):

(2)\[q(\boldsymbol{w}) = \prod_{i=1}^m q_{i}(\boldsymbol{w}_i) \, ,\]

and apply the calculus of variations to fulfil the maximization of the free energy. For this purpose, we regroup the free energy defined above for every and each index \(i\).

(3)\[\begin{split}F &= \int q_{i} \, q_{\cancel{i}} \, \log\left[P(\boldsymbol{y}|\boldsymbol{w})\, P(\boldsymbol{w})\right] - q_{i} \,q_{\cancel{i}} \, \log[q_{i}] - q_{i} \, q_{\cancel{i}} \, \log[q_{\cancel{i}}] \;d \boldsymbol{w} ; \; \mbox{for each } i \, .\\\end{split}\]

where \(_{\cancel{i}}\) indicates all indices except \(i\), and \(q_\cancel{i}\) represents the product of all distributions except \(q_i\).

We consider the above equation for every and each \(_i\) and shift integrants corresponding to \(\boldsymbol{w}_\cancel{i}\) to a separate term (\(g\)), resulting in the following expression.

(4)\[\begin{split}F &= \int g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i)\right) \;d\boldsymbol{w}_i \, ,\\\end{split}\]

with

\[\begin{split}g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i)\right) &= \int f\left(\boldsymbol{w}, q(\boldsymbol{w})\right) \;d\boldsymbol{w}_\cancel{i} \, ,\\\end{split}\]

where \(f\left(\boldsymbol{w}, q(\boldsymbol{w})\right)\) is the whole integrant in (3).

From variational calculus, the maximum of F represented in (4) is the solution of the Euler-Lagrange equation.

\[\begin{split}\frac{\partial}{\partial q_i(\boldsymbol{w}_i)} \left[ g\left(\boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i), q'_{i}(\boldsymbol{w}_i)\right) \right]- \frac{d}{d\boldsymbol{w}_i}\left\{ \frac{\partial}{\partial q'_i(\boldsymbol{w}_i)}\left[g( \boldsymbol{w}_i, q_{i}(\boldsymbol{w}_i), q'_{i}(\boldsymbol{w}_i)) \right] \right\}&=0.\\\end{split}\]

The second term vanishes, since \(g\) does not depend on \(q'_i(\boldsymbol{w}_i)\). Substituting the first term using the mean field approximation of F yields

\[\begin{split}0&= \frac{\partial }{\partial q_i} \left[ \int q_{i} \, q_{\cancel{i}} \, \log\left[P(\boldsymbol{y}|\boldsymbol{w})\, P(\boldsymbol{w})\right] - q_{i} \,q_{\cancel{i}} \, \log[q_{i}] - q_{i} \, q_{\cancel{i}} \, \log[q_{\cancel{i}}] \;d \boldsymbol{w}_{\!\cancel{i}} \right]\\ &= \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})] - (q_{\cancel{i}} \, \log[q_{i}] + \frac{q_{i} \, q_{\cancel{i}}}{q_{i}}) - q_{\cancel{i}} \,\log[q_{\cancel{i}}] \; d\boldsymbol{w}_{\!\cancel{i}}.\\\end{split}\]

With the property of a density function to integrate to one, it follows:

\[\begin{split}0&= \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w})]d\boldsymbol{w}_{\cancel{i}} - \log[q_{i}] - 1 - \int q_{\cancel{i}}\log[q_{\cancel{i}}]d\boldsymbol{w}_{\!\cancel{i}}.\\\end{split}\]

Moving the second term to the other side of the equation and realizing that the latter two terms do not depend on \(q_{i}\), we obtain the following set of governing equations for the maximization of the free energy.

(5)\[\begin{split}\log[q_{i}] = \int q_{\cancel{i}}\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\boldsymbol{w}_{\!\cancel{i}} + \mbox{const.} \quad ; \; \mbox{for each } i \, .\\\end{split}\]

Note that, (5) agrees with eq.(5) of Chappell paper:

\[\begin{split}\log[q_{i}] & \propto \int q_{\cancel{i}}\log[P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w})] d\boldsymbol{w}_{\!\cancel{i}} \, .\\\end{split}\]

Model of noise (error)

We define the noise (error) \(\boldsymbol{e}\) as the gap between the data \(\boldsymbol{y}\) and the model response \(g(\theta)\):

(6)\[\begin{split}\boldsymbol{e}=\boldsymbol{e}(\theta)=:\boldsymbol{y}-g(\theta) \, ,\\\end{split}\]

and model it through a multivariate normal distribution, i.e.:

(7)\[\boldsymbol{e} \sim \mathcal{MVN}(\boldsymbol{0}, \Phi^{-1} \color{red}{C_e} ) \quad \rightarrow \quad P(\boldsymbol{e}|\Phi)=\frac{\Phi^{N/2}}{\sqrt{(2\pi)^{N}\color{red}{|C_e|}}}e^{-0.5\,\Phi\,\boldsymbol{e}^T. \color{red}{C_e^{-1}} .\boldsymbol{e}} \; ,\]

where \(N\) is the length of model error signal and \(\color{red}{C_e}\) is a known covariance matrix scaled by means of the latent parameter \(\Phi\) to represent the covariance of the distribution.

With such a setup, the latent parameters will become:

(8)\[\boldsymbol{w} = [\boldsymbol{w}_{\theta}, \boldsymbol{w}_{\Phi}] \; ,\]

which also implies \(i \in \left\{ \theta, \Phi \right\}\) in (2) and (5), resulting in the following governing equations for the inference.

(9)\[\log[q_{\theta}] = \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi + \mbox{const.} \, .\]

(10)\[\log[q_{\Phi}] = \int q_{\theta} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\theta + \mbox{const.} \, .\]

Log likelihood

The likelihood is by definition:

(11)\[\mbox{Likelihood} =: P(\boldsymbol{y}|\boldsymbol{w}) = P(\boldsymbol{y}|\theta, \Phi) = \underbrace{P(\boldsymbol{y}|\boldsymbol{e}, \theta, \Phi)}_{=1} \, P(\boldsymbol{e}(\theta)|\Phi)=P(\boldsymbol{e}|\Phi) \; .\]

According to (7), the log likelihood then reads:

(12)\[\log\left[ \mbox{Likelihood} \right] =: \log\left[ P(\boldsymbol{y}|\boldsymbol{w}) \right] = \frac{N}{2}\log\left[\Phi\right]-\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + \mbox{const.} \; .\]

Priors

We consider an \(\mathcal{MVN}\) distribution for the prior of \(\theta\) (\(N_{\theta}\) is the length of parameter vector \(\theta\)):

(13)\[\theta \sim \mathcal{MVN}(\boldsymbol{m}_0, \Lambda_0^{-1}) \quad \rightarrow \quad P(\theta|\boldsymbol{m}_0,\Lambda_0)=\sqrt{ \frac{|\Lambda_0|}{(2\pi)^{N_{\theta}}} } \; e^{-0.5\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0)} \; ,\]

and a Gamma distribution for the prior of the noise parameter \(\Phi\):

(14)\[\Phi \sim \Gamma(c_0,s_0) \quad \rightarrow \quad P(\Phi|c_0,s_0)=\frac{1}{\Gamma(c_0)}\frac{\Phi^{c_0-1}}{s_0^{c_0}}e^{-\frac{\Phi}{s_0}} \; .\]

One assumption in the derivations that follow is, that the priors of the latent parameters \(\theta\) and \(\Phi\) are independent, i.e.:

(15)\[P(\boldsymbol{w}) = P(\theta, \Phi) = P(\theta)\,P(\Phi)\]

\(\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right]\)

We need to expand the term \(\log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right]\) as a required step for solving (9) and (10). It is important to recognize, that this term is log of the posterior plus log of the constant evidence: \(\int P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) d\boldsymbol{w}\). Using (12), (15), (13) and (14), and adding all constant terms to one term, we obtain:

(16)\[\begin{split}\begin{split} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] &= \log \left[ P(\boldsymbol{y}|\theta,\Phi) \right] + \log \left[ P(\theta) \right] + \log \left[ P(\Phi) \right] \\ &=\frac{N}{2}\log\left[\Phi\right]-\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \\ &-\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \\ &+(c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \color{blue}{\mathrm{const}\lbrace \boldsymbol{\theta},\Phi \rbrace} \, . \end{split}\end{split}\]

This equation can be regrouped in the following way that is more handy for the next steps.

(17)\[\begin{split}\begin{split} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] &= -\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \quad ; \\ &f_{\theta}=: -\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \\ &f_{\Phi}=:(c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] \, . \end{split}\end{split}\]

The constant term that is expressed below will not influence the solution of (9) and (10), however, it will be used for the computation of the free energy.

\[\color{blue}{\mathrm{const} \lbrace \boldsymbol{\theta},\Phi \rbrace = -\frac{N+N_{\theta}}{2}\log[2\pi] -\frac{1}{2}\log[\color{red}{|C_e|}] + \frac{1}{2} \log[\mathrm{det}(\Lambda_0)] + \log[1/\Gamma(c_0)]-c_0\log[s_0]} \, .\]

Approximated posteriors

To become able to derive suitable update equations, which will follow up, we select the approximated posteriors in exactly the same form as the priors, i.e.:

(18)\[q_{\theta} = \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) \quad \rightarrow \quad q(\theta)=\sqrt{ \frac{|\Lambda|}{(2\pi)^N_{\theta}} } \; e^{-0.5\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m})} \; ,\]

(19)\[q_{\Phi} = \Gamma(\Phi;\,c,s) \quad \rightarrow \quad q(\Phi)=\frac{1}{\Gamma(c)}\frac{\Phi^{c-1}}{s^{c}}e^{-\frac{\Phi}{s}} \; ,\]

which introduces \(\boldsymbol{m},\,\Lambda,\,c,\,s\) as the deterministic latent parameters to be identified.

Taylor expansion of the model error

Another simplification made in the VB is to approximate the model error defined in (6) by means of a first-order Taylor expansion around the posterior mean \(\boldsymbol{m}\):

(20)\[\boldsymbol{e} = \boldsymbol{e}(\theta) =: \boldsymbol{y} - g(\theta) \approx \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \, ,\]

with:

(21)\[\boldsymbol{k} =: \boldsymbol{e}(\boldsymbol{m}) = \boldsymbol{y} - g(\boldsymbol{m}) \, ,\]

and

(22)\[\boldsymbol{J} =: \frac{dg(\theta)}{d\theta}|_{\theta=\boldsymbol{m}} = -\frac{d\boldsymbol{e}(\theta)}{d\theta}|_{\theta=\boldsymbol{m}}\, .\]

Update equations

Update of \(\theta\)

Considering (17) and (18), we start with separately expanding both sides of (9).

(23)\[\begin{split}\begin{split} \log \left[ q_{\theta} \right] &= -\frac{1}{2}\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m}) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} \\ &=-\frac{1}{2} \left( \theta^T\Lambda\theta + \theta^T\Lambda\boldsymbol{m} + \boldsymbol{m}^T\Lambda\theta \right) + \mathrm{const_2}\left\{\boldsymbol{\theta} \right\} \\ &; \quad \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\} = \frac{1}{2}\log\left[\,|\Lambda|\,\right] -\frac{N_{\theta}}{2}\log[2\pi]} \, . \end{split}\end{split}\]

(24)\[\begin{split}\begin{split} \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi &= -\frac{1}{2}\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \underbrace{\int \Phi q_{\Phi} d\Phi}_{=sc} + f_{\theta} \underbrace{\int q_{\Phi} d\Phi}_{=1} + \int \left( f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) q_{\Phi} d\Phi \\ &= -\frac{1}{2}sc\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + \mbox{const}\left\{\theta\right\} \, , \end{split}\end{split}\]

where the two identities \(\int \Phi q_{\Phi} d\Phi=sc\) and \(\int q_{\Phi} d\Phi=1\) have been used. Equation (24) can be further simplified according to (20) and \(f_{\theta}\) defined in (17):

(25)\[\begin{split}\begin{split} \int q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\Phi =&-\frac{1}{2}sc\,\left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) ^T.\color{red}{C_e^{-1}}. \left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) \\ &-\frac{1}{2}\,(\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) + \mbox{const}\left\{\theta\right\} \, . \\ =& -\frac{1}{2} \left( \theta^T.h_1.\theta - \theta^T.h_2 - h_3.\theta \right) + \mbox{const}\left\{\theta\right\} \quad ; \\ h_1 =& \Lambda_0+sc\,\boldsymbol{J}^T\color{red}{C_e^{-1}}\boldsymbol{J} \\ h_2=& \Lambda_0\boldsymbol{m}_0 + sc\boldsymbol{J}^T\color{red}{C_e^{-1}} \left( \boldsymbol{k} + \boldsymbol{J}\boldsymbol{m} \right) \\ h_3=& \boldsymbol{m}_0^T\Lambda_0 + sc \left( \boldsymbol{k}^T+\boldsymbol{m}^T\boldsymbol{J}^T \right) \color{red}{C_e^{-1}}\boldsymbol{J} \end{split}\end{split}\]

We assume that the matrices \(\Lambda_0\) and \(\color{red}{C_e^{-1}}\) are symmetric, i.e. \(h_2^T=h_3\). Then, by comparing both sides of (9) from (23) and (25), we obtain the following equations for the parameters (\(\boldsymbol{m}, \,\Lambda\)) of the posterior \(\theta\).

(26)\[\begin{split}\boxed{ \begin{split} \Lambda =& \Lambda_0+sc\,\boldsymbol{J}^T\color{red}{C_e^{-1}}\boldsymbol{J} \\ \Lambda\boldsymbol{m} =& \Lambda_0\boldsymbol{m}_0 + sc\boldsymbol{J}^T\color{red}{C_e^{-1}} \left( \boldsymbol{k} + \boldsymbol{J}\boldsymbol{m} \right) \end{split} }\end{split}\]

Update of \(\Phi\)

A similar procedure is performed for expanding both sides of (10) using (17), (19) and (20).

(27)\[\begin{split}\begin{split} \log \left[ q_{\Phi} \right] &= \log \left[ \frac{1}{\Gamma(c)}\frac{\Phi^{c-1}}{s^{c}}e^{-\frac{\Phi}{s}} \right] = -\log\left[ \Gamma(c)\right] -c\log\left[s\right] + (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} \\ &= (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} + \color{blue}{\mathrm{const}\left\{\Phi \right\}} \\ &; \quad \color{blue}{\mathrm{const}\left\{\Phi \right\} = -\log\left[ \Gamma(c)\right] -c\log\left[s\right]} \; . \end{split}\end{split}\]

(28)\[\begin{split}\begin{split} \int q_{\theta} \log \left[ P(\boldsymbol{y}|\boldsymbol{w}) P(\boldsymbol{w}) \right] d\theta =& -\frac{1}{2}\Phi \int \boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \, q_{\theta} d\theta + f_{\Phi} \underbrace{\int q_{\theta} d\theta}_{=1} + \int \left( f_{\theta} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) q_{\theta} d\theta \, . \\ =& -\frac{1}{2}\Phi \int \left ( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) ^T.\color{red}{C_e^{-1}}. \left( \boldsymbol{k} - \boldsymbol{J} \left( \theta - \boldsymbol{m} \right) \right) \, q_{\theta} d\theta + f_{\Phi} + \mbox{const}\left\{\Phi\right\} \\ =& -\frac{1}{2}\Phi \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} \underbrace{\int q_{\theta} d\theta}_{=1} + \int \left( \theta - \boldsymbol{m} \right)^T.\left( \boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right).\left( \theta - \boldsymbol{m} \right) q_{\theta} d\theta \right) \\ & -\frac{1}{2}\Phi \left( -\boldsymbol{k}^T.\color{red}{C_e^{-1}}.J. \cancelto{0}{\int \left( \theta - \boldsymbol{m} \right) q_{\theta} d\theta} - \cancelto{0}{\int \left( \theta - \boldsymbol{m} \right)^T q_{\theta} d\theta} \, .\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} \right) \\ &+ (c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] + \mbox{const}\left\{\Phi\right\} \\ =& -\frac{1}{2}\Phi \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) - \frac{\Phi}{s_0} \\ &+ \log\left[\Phi\right]\left(c_0-1+\frac{N}{2}\right) + \mbox{const}\left\{\Phi\right\} \, . \end{split}\end{split}\]

Note that, in the last step of (28) the following identity (trace trick) has been used in view of \(q_{\theta}\) defined in (18).

(29)\[\begin{split}\begin{split} \boldsymbol{E}_{\mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1})}\left( \left( \theta - \boldsymbol{m} \right)^TA\left( \theta - \boldsymbol{m} \right) \right) &=: \int \left( \theta - \boldsymbol{m} \right)^TA\left( \theta - \boldsymbol{m} \right) \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) d\theta \\ &= \mbox{tr}\left( \Lambda^{-1}A\right) \quad ; \; \mbox{for any constant matrix } A \end{split}\end{split}\]

Comparing both sides of (10) simplified in (27) and (28), we obtain the following equations for the posterior noise parameters (\(c, \,s\)).

(30)\[\begin{split}\boxed{ \begin{split} c =& c_0+\frac{N}{2} \\ \frac{1}{s} =& \frac{1}{s_0} + \frac{1}{2}\left ( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) \end{split} }\end{split}\]

Simplification of the free energy

We look into maximizing the free energy defined in (1) via solving the set of algebraic equations obtained in the previous section. This optimization problem is likely to be ill-posed for highly nonlinear forward simulators (\(g\)); i.e. to have several local maximums. To cope with this situation, it is useful to compute and keep monitoring the free energy. Considering \(q(\boldsymbol{w})=q_{\theta}\,q_{\Phi}\) and \(d\boldsymbol{w}=d\theta\,d\Phi\), we start expanding (1):

(31)\[\begin{split}\begin{split} F =& \int q_{\theta}\,q_{\Phi} \log \left[ \frac{P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w})}{q_{\theta}\,q_{\Phi}} \right] d\theta\,d\Phi \\ =& \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w}) \right] d\theta\,d\Phi}_{=:F_1} - \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ q_{\theta} \right] d\theta\,d\Phi}_{=:F_2} - \underbrace{\int q_{\theta}\,q_{\Phi} \log \left[ q_{\Phi} \right] d\theta\,d\Phi}_{=:F_3} \,. \end{split}\end{split}\]

The expansion of the terms \(F_1\), \(F_2\), \(F_3\) follow.

(32)\[\begin{split}\begin{split} F_1 =& \int q_{\theta}\,q_{\Phi} \log \left[ P(\boldsymbol{y}|\boldsymbol{w})\,P(\boldsymbol{w}) \right] d\theta\,d\Phi \\ =& \int q_{\theta}\,q_{\Phi} \left ( -\frac{1}{2}\Phi\,\boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} + f_{\theta} + f_{\Phi} + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \right) d\theta\,d\Phi \\ =& -\frac{1}{2} \int \boldsymbol{e}^T.\color{red}{C_e^{-1}}.\boldsymbol{e} \, q_{\theta}d\theta \underbrace{\int \Phi\,q_{\Phi}d\Phi}_{=sc} -\frac{1}{2} \int \left( (\theta-\boldsymbol{m}_0)^T.\Lambda_0.(\theta-\boldsymbol{m}_0) \right) q_{\theta}d\theta \underbrace{\int q_{\Phi}d\Phi}_{=1} \\ & + \underbrace{\int q_{\theta}d\theta}_{=1} \int \left( (c_0-1) \log \left[ \Phi \right] - \frac{\Phi}{s_0} + \frac{N}{2}\log\left[\Phi\right] \right)d\Phi + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \\ =& -\frac{1}{2} sc \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) -\frac{sc}{s_0} +\left( \frac{N}{2} + c_0 -1 \right) \left( \log \left[s\right] + \psi(c) \right) \\ & -\frac{1}{2}\left( (\boldsymbol{m}-\boldsymbol{m}_0)^T.\Lambda_0.(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}\Lambda_0 \right) \right) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta},\Phi \right\}} \, . \end{split}\end{split}\]

The first term in the above simplified expression has been rewritten according to the simplication already done through the derivation of (28). We have furthermore applied the identity \(\int \log\left[ \Phi\right] d\Phi=\log\left[s\right] + \psi(c)\) with \(\psi\) being the di-gamma function defined as \(\psi(x)=:\frac{d}{dx}\ln\Gamma(x)=\frac{\Gamma'(x)}{\Gamma(x)}\), as well as the following identity that is similar to (29).

(33)\[\begin{split}\begin{split} \boldsymbol{E}_{\mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1})}\left[ \left( \theta - \boldsymbol{m}_0 \right)^TA\left( \theta - \boldsymbol{m}_0 \right) \right] &=: \int \left( \theta - \boldsymbol{m}_0 \right)^TA\left( \theta - \boldsymbol{m}_0 \right) \mathcal{MVN}(\theta;\,\boldsymbol{m}, \Lambda^{-1}) d\theta \\ &= (\boldsymbol{m}-\boldsymbol{m}_0)^TA(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}A\right) \\ & \quad ; \; \mbox{for any constant matrix } A \end{split}\end{split}\]

We continue with expanding \(F_2\) and \(F_3\) by the help of (23), (29) and (27):

(34)\[\begin{split}\begin{split} F_2 &= \int q_{\theta}\,q_{\Phi} \log \left[ q_{\theta} \right] d\theta\,d\Phi = \int \log \left[ q_{\theta} \right] q_{\theta}d\theta \int q_{\Phi} d\Phi \\ &= \int \left( -\frac{1}{2}\,(\theta-\boldsymbol{m})^T.\Lambda.(\theta-\boldsymbol{m}) + \color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} \right) q_{\theta}d\theta \\ &= -\frac{1}{2}\mathrm{tr}(\Lambda^{-1}\Lambda) +\color{blue}{\mathrm{const}\left\{\boldsymbol{\theta} \right\}} = -\frac{N_{\theta}}{2} + \color{blue}{\frac{1}{2}\log\left[\,|\Lambda|\,\right] -\frac{N_{\theta}}{2}\log[2\pi]} \,. \end{split}\end{split}\]

(35)\[\begin{split}\begin{split} F_3 &= \int q_{\theta}\,q_{\Phi} \log \left[ q_{\Phi} \right] d\theta\,d\Phi = \int q_{\theta} d\theta \int \log \left[ q_{\Phi} \right] q_{\Phi}d\Phi \\ &= \int \left( (c-1)\log\left[ \Phi \right]\ -\frac{\Phi}{s} + \color{blue}{\mathrm{const}\left\{\Phi \right\}} \right) q_{\Phi}d\Phi \\ &= (c-1)\left( \log \left[s\right] + \psi(c) \right) - c \color{blue}{-\log\left[ \Gamma(c)\right] -c\log\left[s\right]} \,. \end{split}\end{split}\]

Substituting (32), (34) and (35) into (31) results in:

(36)\[\begin{split}\boxed{ \begin{split} F =& -\frac{1}{2} sc \left( \boldsymbol{k}^T.\color{red}{C_e^{-1}}.\boldsymbol{k} +\mbox{tr}\left( \Lambda^{-1}\boldsymbol{J}^T.\color{red}{C_e^{-1}}.\boldsymbol{J} \right) \right) -\frac{sc}{s_0} -\frac{1}{2}\left( (\boldsymbol{m}-\boldsymbol{m}_0)^T.\Lambda_0.(\boldsymbol{m}-\boldsymbol{m}_0) + \mbox{tr}\left( \Lambda^{-1}\Lambda_0 \right) \right) \\ & + \cancelto{0}{\left( \frac{N}{2} + c_0 -c \right)} \left( \log \left[s\right] + \psi(c) \right) + c + \frac{N_{\theta}}{2} \color{blue}{ - \frac{1}{2}\log\left[\,|\Lambda|\,\right] + \cancel{\frac{N_{\theta}}{2}\log[2\pi]} + \log\left[ \Gamma(c)\right] + c\log\left[s\right] } \\ & \color{blue}{ -\frac{N+\cancel{N_{\theta}}}{2}\log[2\pi] -\frac{1}{2}\log[\color{red}{|C_e|}] + \frac{1}{2} \log[\mathrm{det}(\Lambda_0)] + \log[1/\Gamma(c_0)]-c_0\log[s_0] } \, . \end{split} }\end{split}\]

Notice the vanishing term provided that the noise parameter \(c\) is already updated by the first identity in (30).