{% raw %}
Title: Create a Markdown Blog Post Integrating Research Details and a Featured Paper
====================================================================================
This task involves generating a Markdown file (ready for a GitHub-served Jekyll site) that integrates our research details with a featured research paper. The output must follow the exact format and conventions described below.
====================================================================================
Output Format (Markdown):
------------------------------------------------------------------------------------
---
layout: post
title: "Compromise-free Bayesian neural networks"
date: 2020-04-25
categories: papers
---




Content generated by [gemini-2.5-pro](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/2020-04-25-2004.12211.txt).
Image generated by [imagen-3.0-generate-002](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/2020-04-25-2004.12211.txt).
------------------------------------------------------------------------------------
====================================================================================
Please adhere strictly to the following instructions:
====================================================================================
Section 1: Content Creation Instructions
====================================================================================
1. **Generate the Page Body:**
- Write a well-composed, engaging narrative that is suitable for a scholarly audience interested in advanced AI and astrophysics.
- Ensure the narrative is original and reflective of the tone and style and content in the "Homepage Content" block (provided below), but do not reuse its content.
- Use bullet points, subheadings, or other formatting to enhance readability.
2. **Highlight Key Research Details:**
- Emphasize the contributions and impact of the paper, focusing on its methodology, significance, and context within current research.
- Specifically highlight the lead author ({'name': 'Kamran Javid'}). When referencing any author, use Markdown links from the Author Information block (choose academic or GitHub links over social media).
3. **Integrate Data from Multiple Sources:**
- Seamlessly weave information from the following:
- **Paper Metadata (YAML):** Essential details including the title and authors.
- **Paper Source (TeX):** Technical content from the paper.
- **Bibliographic Information (bbl):** Extract bibliographic references.
- **Author Information (YAML):** Profile details for constructing Markdown links.
- Merge insights from the Paper Metadata, TeX source, Bibliographic Information, and Author Information blocks into a coherent narrative—do not treat these as separate or isolated pieces.
- Insert the generated narrative between the HTML comments:
and
4. **Generate Bibliographic References:**
- Review the Bibliographic Information block carefully.
- For each reference that includes a DOI or arXiv identifier:
- For DOIs, generate a link formatted as:
[10.1234/xyz](https://doi.org/10.1234/xyz)
- For arXiv entries, generate a link formatted as:
[2103.12345](https://arxiv.org/abs/2103.12345)
- **Important:** Do not use any LaTeX citation commands (e.g., `\cite{...}`). Every reference must be rendered directly as a Markdown link. For example, instead of `\cite{mycitation}`, output `[mycitation](https://doi.org/mycitation)`
- **Incorrect:** `\cite{10.1234/xyz}`
- **Correct:** `[10.1234/xyz](https://doi.org/10.1234/xyz)`
- Ensure that at least three (3) of the most relevant references are naturally integrated into the narrative.
- Ensure that the link to the Featured paper [2004.12211](https://arxiv.org/abs/2004.12211) is included in the first sentence.
5. **Final Formatting Requirements:**
- The output must be plain Markdown; do not wrap it in Markdown code fences.
- Preserve the YAML front matter exactly as provided.
====================================================================================
Section 2: Provided Data for Integration
====================================================================================
1. **Homepage Content (Tone and Style Reference):**
```markdown
---
layout: home
---

The Handley Research Group stands at the forefront of cosmological exploration, pioneering novel approaches that fuse fundamental physics with the transformative power of artificial intelligence. We are a dynamic team of researchers, including PhD students, postdoctoral fellows, and project students, based at the University of Cambridge. Our mission is to unravel the mysteries of the Universe, from its earliest moments to its present-day structure and ultimate fate. We tackle fundamental questions in cosmology and astrophysics, with a particular focus on leveraging advanced Bayesian statistical methods and AI to push the frontiers of scientific discovery. Our research spans a wide array of topics, including the [primordial Universe](https://arxiv.org/abs/1907.08524), [inflation](https://arxiv.org/abs/1807.06211), the nature of [dark energy](https://arxiv.org/abs/2503.08658) and [dark matter](https://arxiv.org/abs/2405.17548), [21-cm cosmology](https://arxiv.org/abs/2210.07409), the [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1807.06209), and [gravitational wave astrophysics](https://arxiv.org/abs/2411.17663).
### Our Research Approach: Innovation at the Intersection of Physics and AI
At The Handley Research Group, we develop and apply cutting-edge computational techniques to analyze complex astronomical datasets. Our work is characterized by a deep commitment to principled [Bayesian inference](https://arxiv.org/abs/2205.15570) and the innovative application of [artificial intelligence (AI) and machine learning (ML)](https://arxiv.org/abs/2504.10230).
**Key Research Themes:**
* **Cosmology:** We investigate the early Universe, including [quantum initial conditions for inflation](https://arxiv.org/abs/2002.07042) and the generation of [primordial power spectra](https://arxiv.org/abs/2112.07547). We explore the enigmatic nature of [dark energy, using methods like non-parametric reconstructions](https://arxiv.org/abs/2503.08658), and search for new insights into [dark matter](https://arxiv.org/abs/2405.17548). A significant portion of our efforts is dedicated to [21-cm cosmology](https://arxiv.org/abs/2104.04336), aiming to detect faint signals from the Cosmic Dawn and the Epoch of Reionization.
* **Gravitational Wave Astrophysics:** We develop methods for [analyzing gravitational wave signals](https://arxiv.org/abs/2411.17663), extracting information about extreme astrophysical events and fundamental physics.
* **Bayesian Methods & AI for Physical Sciences:** A core component of our research is the development of novel statistical and AI-driven methodologies. This includes advancing [nested sampling techniques](https://arxiv.org/abs/1506.00171) (e.g., [PolyChord](https://arxiv.org/abs/1506.00171), [dynamic nested sampling](https://arxiv.org/abs/1704.03459), and [accelerated nested sampling with $\beta$-flows](https://arxiv.org/abs/2411.17663)), creating powerful [simulation-based inference (SBI) frameworks](https://arxiv.org/abs/2504.10230), and employing [machine learning for tasks such as radiometer calibration](https://arxiv.org/abs/2504.16791), [cosmological emulation](https://arxiv.org/abs/2503.13263), and [mitigating radio frequency interference](https://arxiv.org/abs/2211.15448). We also explore the potential of [foundation models for scientific discovery](https://arxiv.org/abs/2401.00096).
**Technical Contributions:**
Our group has a strong track record of developing widely-used scientific software. Notable examples include:
* [**PolyChord**](https://arxiv.org/abs/1506.00171): A next-generation nested sampling algorithm for Bayesian computation.
* [**anesthetic**](https://arxiv.org/abs/1905.04768): A Python package for processing and visualizing nested sampling runs.
* [**GLOBALEMU**](https://arxiv.org/abs/2104.04336): An emulator for the sky-averaged 21-cm signal.
* [**maxsmooth**](https://arxiv.org/abs/2007.14970): A tool for rapid maximally smooth function fitting.
* [**margarine**](https://arxiv.org/abs/2205.12841): For marginal Bayesian statistics using normalizing flows and KDEs.
* [**fgivenx**](https://arxiv.org/abs/1908.01711): A package for functional posterior plotting.
* [**nestcheck**](https://arxiv.org/abs/1804.06406): Diagnostic tests for nested sampling calculations.
### Impact and Discoveries
Our research has led to significant advancements in cosmological data analysis and yielded new insights into the Universe. Key achievements include:
* Pioneering the development and application of advanced Bayesian inference tools, such as [PolyChord](https://arxiv.org/abs/1506.00171), which has become a cornerstone for cosmological parameter estimation and model comparison globally.
* Making significant contributions to the analysis of major cosmological datasets, including the [Planck mission](https://arxiv.org/abs/1807.06209), providing some of the tightest constraints on cosmological parameters and models of [inflation](https://arxiv.org/abs/1807.06211).
* Developing novel AI-driven approaches for astrophysical challenges, such as using [machine learning for radiometer calibration in 21-cm experiments](https://arxiv.org/abs/2504.16791) and [simulation-based inference for extracting cosmological information from galaxy clusters](https://arxiv.org/abs/2504.10230).
* Probing the nature of dark energy through innovative [non-parametric reconstructions of its equation of state](https://arxiv.org/abs/2503.08658) from combined datasets.
* Advancing our understanding of the early Universe through detailed studies of [21-cm signals from the Cosmic Dawn and Epoch of Reionization](https://arxiv.org/abs/2301.03298), including the development of sophisticated foreground modelling techniques and emulators like [GLOBALEMU](https://arxiv.org/abs/2104.04336).
* Developing new statistical methods for quantifying tensions between cosmological datasets ([Quantifying tensions in cosmological parameters: Interpreting the DES evidence ratio](https://arxiv.org/abs/1902.04029)) and for robust Bayesian model selection ([Bayesian model selection without evidences: application to the dark energy equation-of-state](https://arxiv.org/abs/1506.09024)).
* Exploring fundamental physics questions such as potential [parity violation in the Large-Scale Structure using machine learning](https://arxiv.org/abs/2410.16030).
### Charting the Future: AI-Powered Cosmological Discovery
The Handley Research Group is poised to lead a new era of cosmological analysis, driven by the explosive growth in data from next-generation observatories and transformative advances in artificial intelligence. Our future ambitions are centred on harnessing these capabilities to address the most pressing questions in fundamental physics.
**Strategic Research Pillars:**
* **Next-Generation Simulation-Based Inference (SBI):** We are developing advanced SBI frameworks to move beyond traditional likelihood-based analyses. This involves creating sophisticated codes for simulating [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1908.00906) and [Baryon Acoustic Oscillation (BAO)](https://arxiv.org/abs/1607.00270) datasets from surveys like DESI and 4MOST, incorporating realistic astrophysical effects and systematic uncertainties. Our AI initiatives in this area focus on developing and implementing cutting-edge SBI algorithms, particularly [neural ratio estimation (NRE) methods](https://arxiv.org/abs/2407.15478), to enable robust and scalable inference from these complex simulations.
* **Probing Fundamental Physics:** Our enhanced analytical toolkit will be deployed to test the standard cosmological model ($\Lambda$CDM) with unprecedented precision and to explore [extensions to Einstein's General Relativity](https://arxiv.org/abs/2006.03581). We aim to constrain a wide range of theoretical models, from modified gravity to the nature of [dark matter](https://arxiv.org/abs/2106.02056) and [dark energy](https://arxiv.org/abs/1701.08165). This includes leveraging data from upcoming [gravitational wave observatories](https://arxiv.org/abs/1803.10210) like LISA, alongside CMB and large-scale structure surveys from facilities such as Euclid and JWST.
* **Synergies with Particle Physics:** We will continue to strengthen the connection between cosmology and particle physics by expanding the [GAMBIT framework](https://arxiv.org/abs/2009.03286) to interface with our new SBI tools. This will facilitate joint analyses of cosmological and particle physics data, providing a holistic approach to understanding the Universe's fundamental constituents.
* **AI-Driven Theoretical Exploration:** We are pioneering the use of AI, including [large language models and symbolic computation](https://arxiv.org/abs/2401.00096), to automate and accelerate the process of theoretical model building and testing. This innovative approach will allow us to explore a broader landscape of physical theories and derive new constraints from diverse astrophysical datasets, such as those from GAIA.
Our overarching goal is to remain at the forefront of scientific discovery by integrating the latest AI advancements into every stage of our research, from theoretical modeling to data analysis and interpretation. We are excited by the prospect of using these powerful new tools to unlock the secrets of the cosmos.
Content generated by [gemini-2.5-pro-preview-05-06](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/index.txt).
Image generated by [imagen-3.0-generate-002](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/index.txt).
```
2. **Paper Metadata:**
```yaml
!!python/object/new:feedparser.util.FeedParserDict
dictitems:
id: http://arxiv.org/abs/2004.12211v3
guidislink: true
link: http://arxiv.org/abs/2004.12211v3
updated: '2020-06-13T12:03:28Z'
updated_parsed: !!python/object/apply:time.struct_time
- !!python/tuple
- 2020
- 6
- 13
- 12
- 3
- 28
- 5
- 165
- 0
- tm_zone: null
tm_gmtoff: null
published: '2020-04-25T19:12:56Z'
published_parsed: !!python/object/apply:time.struct_time
- !!python/tuple
- 2020
- 4
- 25
- 19
- 12
- 56
- 5
- 116
- 0
- tm_zone: null
tm_gmtoff: null
title: Compromise-free Bayesian neural networks
title_detail: !!python/object/new:feedparser.util.FeedParserDict
dictitems:
type: text/plain
language: null
base: ''
value: Compromise-free Bayesian neural networks
summary: 'We conduct a thorough analysis of the relationship between the out-of-sample
performance and the Bayesian evidence (marginal likelihood) of Bayesian neural
networks (BNNs), as well as looking at the performance of ensembles of BNNs,
both using the Boston housing dataset. Using the state-of-the-art in nested
sampling, we numerically sample the full (non-Gaussian and multimodal) network
posterior and obtain numerical estimates of the Bayesian evidence, considering
network models with up to 156 trainable parameters. The networks have between
zero and four hidden layers, either $\tanh$ or $ReLU$ activation functions, and
with and without hierarchical priors. The ensembles of BNNs are obtained by
determining the posterior distribution over networks, from the posterior
samples of individual BNNs re-weighted by the associated Bayesian evidence
values. There is good correlation between out-of-sample performance and
evidence, as well as a remarkable symmetry between the evidence versus model
size and out-of-sample performance versus model size planes. Networks with
$ReLU$ activation functions have consistently higher evidences than those with
$\tanh$ functions, and this is reflected in their out-of-sample performance.
Ensembling over architectures acts to further improve performance relative to
the individual BNNs.'
summary_detail: !!python/object/new:feedparser.util.FeedParserDict
dictitems:
type: text/plain
language: null
base: ''
value: 'We conduct a thorough analysis of the relationship between the out-of-sample
performance and the Bayesian evidence (marginal likelihood) of Bayesian neural
networks (BNNs), as well as looking at the performance of ensembles of BNNs,
both using the Boston housing dataset. Using the state-of-the-art in nested
sampling, we numerically sample the full (non-Gaussian and multimodal) network
posterior and obtain numerical estimates of the Bayesian evidence, considering
network models with up to 156 trainable parameters. The networks have between
zero and four hidden layers, either $\tanh$ or $ReLU$ activation functions,
and
with and without hierarchical priors. The ensembles of BNNs are obtained by
determining the posterior distribution over networks, from the posterior
samples of individual BNNs re-weighted by the associated Bayesian evidence
values. There is good correlation between out-of-sample performance and
evidence, as well as a remarkable symmetry between the evidence versus model
size and out-of-sample performance versus model size planes. Networks with
$ReLU$ activation functions have consistently higher evidences than those
with
$\tanh$ functions, and this is reflected in their out-of-sample performance.
Ensembling over architectures acts to further improve performance relative
to
the individual BNNs.'
authors:
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
name: Kamran Javid
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
name: Will Handley
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
name: Mike Hobson
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
name: Anthony Lasenby
author_detail: !!python/object/new:feedparser.util.FeedParserDict
dictitems:
name: Anthony Lasenby
author: Anthony Lasenby
arxiv_comment: "https://github.com/PolyChord/PolyChordLite;\n https://github.com/SuperKam91/bnn"
links:
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
href: http://arxiv.org/abs/2004.12211v3
rel: alternate
type: text/html
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
title: pdf
href: http://arxiv.org/pdf/2004.12211v3
rel: related
type: application/pdf
arxiv_primary_category:
term: stat.ML
scheme: http://arxiv.org/schemas/atom
tags:
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
term: stat.ML
scheme: http://arxiv.org/schemas/atom
label: null
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
term: cs.LG
scheme: http://arxiv.org/schemas/atom
label: null
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
term: stat.AP
scheme: http://arxiv.org/schemas/atom
label: null
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
term: stat.CO
scheme: http://arxiv.org/schemas/atom
label: null
- !!python/object/new:feedparser.util.FeedParserDict
dictitems:
term: stat.ME
scheme: http://arxiv.org/schemas/atom
label: null
```
3. **Paper Source (TeX):**
```tex
\documentclass{article}
\PassOptionsToPackage{numbers,sort,compress}{natbib}
% \usepackage{neurips_2020} %submission
%\usepackage[final]{neurips_2020} %camera ready
\usepackage[preprint]{neurips_2020} %arXiv
\usepackage[utf8]{inputenc} % allow utf-8 input
\usepackage[T1]{fontenc} % use 8-bit T1 fonts
\usepackage{hyperref} % hyperlinks
\usepackage{url} % simple URL typesetting
\usepackage{booktabs} % professional-quality tables
\usepackage{amsfonts} % blackboard math symbols
\usepackage{nicefrac} % compact symbols for 1/2, etc.
\usepackage{microtype} % microtypography
% Our packages
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{longtable}
% \usepackage{xr}
% \externaldocument{supplementary_materials}
\usepackage{cleveref}
\Crefname{equation}{Equation}{Equations}
\crefname{equation}{Equation}{Equations}
\Crefname{figure}{Figure}{Figures}
\crefname{figure}{Figure}{Figures}
\Crefname{table}{Table}{Tables}
\crefname{table}{Table}{Tables}
\Crefname{section}{Section}{Sections}
\crefname{section}{Section}{Sections}
\Crefname{appendix}{Appendix}{Appendices}
\crefname{appendix}{Appendix}{Appendices}
% Our commands
\renewcommand{\vec}[1]{\boldsymbol{#1}}
\newcommand{\xtr}{\vec{x}_{\mathrm{tr}}}
\newcommand{\xte}{\vec{x}_{\mathrm{te}}}
\newcommand{\ytr}{\vec{y}_{\mathrm{tr}}}
\newcommand{\yte}{\vec{y}_{\mathrm{te}}}
\newcommand{\x}{x}
\newcommand{\f}{f}
\newcommand{\y}{y}
\newcommand{\ypred}{y_{\mathrm{pred}}}
\newcommand{\fofx}{\f(\x)}
\newcommand{\vt}{\vec{\theta}}
\newcommand{\mm}{\mathcal{M}}
\newcommand{\mz}{\mathcal{Z}}
\newcommand{\post}{\mathrm{P}\left(\vt|\xtr, \ytr,\mm \right)}
\newcommand{\postf}{\mathrm{P}\left(\fofx|\xtr, \ytr,\mm \right)}
\newcommand{\p}{\mathcal{P}\left(\vt\right)}
\newcommand{\pf}{\mathcal{P}\left(\fofx\right)}
\newcommand{\like}{\mathcal{L}\left(\vec{\theta}\right)}
\newcommand{\pr}{\pi\left(\vec{\theta}\right)}
\newcommand{\z}{\mathcal{Z}}
\newcommand{\ps}{\mathcal{P}_{i}}
\newcommand{\ns}{n_{\mathrm{s}}}
\newcommand{\nl}{L}
\newcommand{\pcs}{\mathcal{P}_{i,j}}
\newcommand{\ncs}{n_{\mathrm{s}j}}
\newcommand{\yh}{\widehat{\y}}
\newcommand{\yhte}{\widehat{\vec{\y}_{\mathrm{te}}}}
\newcommand{\sigy}{\sigma_{\y}}
\newcommand{\np}{n_{\mathrm{pars}}}
\newcommand{\sigp}{\sigma_{\mathrm{p}}}
\newcommand{\precp}{\tau_{\mathrm{p}}}
\newcommand{\sigl}{\sigma_{\mathrm{l}}}
\newcommand{\precl}{\tau_{\mathrm{l}}}
\newcommand{\nd}{n_{\mathrm{dim}}}
\newcommand{\e}{E}
\newcommand{\ga}{\alpha_i}
\newcommand{\gb}{\beta_i}
\newcommand{\PolyChord}{\texttt{PolyChord}}
\newcommand{\MultiNest}{\texttt{MultiNest}}
\DeclareMathOperator{\relu}{ReLU}
\title{Compromise-free Bayesian neural networks}
\author{%
Kamran Javid, \hfill Will Handley, \hfill Mike Hobson \& \hfill Anthony Lasenby \\
Cavendish Laboratory (Astrophysics Group) \\
University of Cambridge, UK, CB3 9HE \\
\texttt{kj316@cam.ac.uk}, \texttt{wh260@cam.ac.uk}
}
\begin{document}
\maketitle
\begin{abstract}
We conduct a thorough analysis of the relationship between the out-of-sample performance and the Bayesian evidence (marginal likelihood) of Bayesian neural networks (BNNs), as well as looking at the performance of ensembles of BNNs, both using the Boston housing dataset. Using the state-of-the-art in nested sampling, we numerically sample the full (non-Gaussian and multimodal) network posterior and obtain numerical estimates of the Bayesian evidence, considering network models with up to 156 trainable parameters. The networks have between zero and four hidden layers, either $\tanh$ or $\relu$ activation functions, and with and without hierarchical priors. The ensembles of BNNs are obtained by determining the posterior distribution over networks, from the posterior samples of individual BNNs re-weighted by the associated Bayesian evidence values. There is good correlation between out-of-sample performance and evidence, as well as a remarkable symmetry between the evidence versus model size and out-of-sample performance versus model size planes. Networks with $\relu$ activation functions have consistently higher evidences than those with $\tanh$ functions, and this is reflected in their out-of-sample performance. Ensembling over architectures acts to further improve performance relative to the individual BNNs.
\end{abstract}
\section{Introduction}
%TODO can possibly drop this for space
In an age where machine learning models are being applied to scenarios where the associated decisions can have significant consequences, quantifying model uncertainty is becoming more and more crucial \citep{krzywinski2013points, ghahramani2015probabilistic}. Bayesian neural networks (BNNs) are one example of a model which provides its own uncertainty quantification, and have gained popularity in-part due to the successes of the conventional backward propagation trained neural networks \citep{rumelhart1985learning}.
BNNs have a history stretching back almost as far as research on traditional neural networks (TNNs), with \citep{denker1987large, tishby1989consistent, buntine1991bayesian, denker1991transforming} laying the foundations for the application of BNNs. Major breakthroughs occurred thanks to the work of \citet{mackay1994bayesian} and his success with BNNs in prediction competitions. Prior to this, \citeauthor{mackay1992bayesian} published several papers \citep{mackay1992bayesian, mackay1992practical, mackay1992evidence} detailing his methods and highlighting several important aspects of BNNs. He trained the networks by using quadratic approximations of the posteriors, type II maximum likelihood estimation \citep{rasmussen2003gaussian}, and incorporated low-level hierarchical Bayesian inference into his modelling to learn the prior distribution variances. From this he obtained estimates for the Bayesian evidence (also known as the marginal likelihood), and found a correlation between the evidence and the BNNs' ability to make predictions on out-of-sample data well.
\citet{neal1992bayesian, neal1993bayesian} focused on improving the predictive power of BNNs, by relaxing the Gaussian approximation by instead sampling the posterior using Hamiltonian Markov chain Monte Carlo (MCMC) methods. He also incorporated more complex hierarchical Bayesian modelling into his methods by using Gibbs sampling to sample the variances associated with both the priors and likelihood. He found that predictions with these BNNs consistently outperformed the equivalently sized networks trained using backward propagation techniques. More recently \citep{freitas2000sequential, de2003bayesian, andrieu2003introduction} used reversible jump MCMC and sequential MC methods to train BNNs and do model selection without calculating Bayesian evidences, by parameterising the different networks as a random variable, and sampling from the resultant posterior. In general they found that their methods produce networks which perform well, but were much slower to train than expectation maximisation-based networks. Indeed, methods which are more efficient in the training of networks have gained popularity in the recent years, due to the success of deep learning. Two of the most commonly used methods for training scalable networks are variational inference \citep{hinton1993keeping, barber1998ensemble, graves2011practical} and dropout training \citep{gal2016dropout}. The latter has proved to be particularly popular, due to the fact that dropout BNNs can be trained using standard backward propagation techniques, and can be applied to recurrent and convolutional networks \citep{gal2016theoretically, kendall2017uncertainties}.
In this paper we present what we refer to as {\em compromise-free\/} BNNs as a proof-of-concept idea for training BNNs, conditional on computational resources. As in \citet{higson2018bayesian}, no assumption is made about the functional form of the posteriors, and we numerically sample the full posterior distributions using the nested sampling algorithm \PolyChord{} \citep{handley2015polychord1, handley2015polychord2}\footnote{\url{https://github.com/PolyChord/PolyChordLite}} to train the BNNs. Furthermore we obtain numerical estimates for the Bayesian evidences associated with each BNN, with which we analyse the relationship between the evidence and out-of-sample performance as originally performed approximately in \citet{mackay1992practical}. We consider a wide array of different networks, with between zero and four hidden layers, $\tanh$ or $\relu$ activation functions, and varying complexities of hierarchical priors following \citep{mackay1992practical, neal2012bayesian}, to obtain a thorough understanding of the evidence and out-of-sample performance through various cross sections of the BNN architecture space.
%TODO Could make this shorter
Similarly to \citet{de2003bayesian} we look at the posterior over networks as a form of model selection. However in our case, the posteriors are obtained from the samples of the individual network posteriors re-weighted according to the evidences associated with a given run. We then marginalise over these network posteriors to obtain predictions from ensembles of BNNs. A preliminary analysis along these lines was conducted by \citet{higson2018bayesian}, who also explored using an adaptive method akin to \citet{de2003bayesian}. The adaptive approach has the feature of severely undersampling less preferred models and taking (often substantially) less time than sampling models individually, and will be explored in a future work.
\section{Background}\label{s:background}
\subsection{Neural networks}\label{s:nn}
A fully connected multi-layer perceptron (MLP) neural network is parameterised by network weights $w$ and biases $b$, and can be represented recursively using intermediate variables $z$ as
\begin{equation}
f = z^{(\nl)}, \qquad
z^{(\ell)}_i = g_i^{(\ell)}(b_i^{(\ell)} + \textstyle\sum_{k=1}^{l_i} w_{ik}^{(\ell)} z_{k}^{(\ell-1)}),\qquad
z^{(0)} = x,
\label{e:nn}
\end{equation}
where $g_i^{(\ell)}$ are {\em activation functions}, which we take to be either $\tanh$ ($g(x) = \tanh x $) or $\relu$ ($g(x) = \max\{x,0\}$) in the hidden layers, with a linear activation in the output layer $\ell=L$. Such networks are represented graphically in \cref{f:nn}. The activation functions $g$, number of layers $\nl$, and the number of nodes within each layer $l_i$ together determine the {\em architecture\/} of the network.
The network parameters $\theta=(b,w)$ are trained in a supervised fashion on a set of example input data paired with the corresponding outputs via a misfit function and regularisation term
\begin{equation}\label{e:chi2}
\theta_* = \min\limits_{\theta} \lambda_{m} \chi^2_\mathrm{train}(\theta) + \lambda_{r} R(\theta), \qquad \chi^2_\mathrm{train}(\theta) = \textstyle\sum_{i\in\mathrm{train}} |y^{(i)} - f(x^{(i)};\theta)|^2,
\end{equation}
where $\lambda_{m}$ and $\lambda_{r}$ are hyperparameters dictating the relative weighting of the misfit versus the regularisation in the optimisation.\footnote{It should be noted that in a minimisation context, there is a redundancy in including two regularisation parameters $\lambda_m$ and $\lambda_r$, so people usually without loss of generality set $\lambda_m=1$. In a Bayesian context this redundancy is removed, as the posterior is composed of likelihood and prior terms whose widths are each controlled by a separate regularisation parameter. We thus follow \citet{mackay1994bayesian}, retaining both $\lambda_m$ and $\lambda_r$.}
There are two key issues that immediately arise from the traditional approach. First, the network gives no indication of the confidence in its prediction $y_\mathrm{pred} \equiv f(x_\mathrm{new};\theta_*)$ on unseen inputs $x_\mathrm{new}$. It would be preferable if the network were to provide an error bar $y_\mathrm{pred}\pm\sigma_{y, \mathrm{pred}}$ for its confidence, and that this error bar should become larger as the network extrapolates beyond the domain of the original training data. Second, there is little guidance from the formalism above as to how to choose the architecture of the network in \cref{e:nn} and the hyperparameters in \cref{e:chi2}. Networks that are too large may overfit the data, while networks too small/simple will likely underfit. The most common method of finding a `happy medium' is through the use of cross validation \citep{stone1977asymptotics, efron1979computers, li1985stein, snoek2012practical}, but searching the associated hyperparameter space can be time consuming and ad-hoc.%\footnote{One also runs the risk that if one tries too many different networks one brings the testing data ``back into sample'' manually by performing what is effectively an optimisation over networks and testing data.}
\subsection{Bayesian neural networks} \label{s:bnn}
The Bayesian approach to neural networks aims to ameliorate the two difficulties discussed above in \cref{s:nn} by {\em sampling\/} the parameter space rather than {\em optimising\/} over it. Here we consider the misfit function $\chi^2_\mathrm{train}(\theta)$ as part of an independent Gaussian likelihood
\begin{equation}
P(D_\mathrm{train}|\theta,\mathcal{M}) = \prod_{i\in\mathrm{train}} \frac{1}{\sqrt{2\pi}\sigma}\exp{\left(-\frac{(y^{(i)}-f(x^{(i)};\theta))^2}{2\sigma^2}\right)} = \frac{\exp\left(-\frac{\chi^2_\mathrm{train}(\theta)}{2\sigma^2}\right) }{{(\sqrt{2\pi}\sigma)}^{N_\mathrm{train}}},
\label{e:like}
\end{equation}
where $D_\mathrm{train}$ are the training data, $\sigma^2$ is a misfit variance (which plays a similar role to the parameter $\lambda_m$ in \cref{e:chi2}) and $\mathcal{M}$ is the network architecture (or Bayesian model). The likelihood $\mathcal{L}\equiv P(D_\mathrm{train}|\theta,\mathcal{M})$ can be related to a posterior $\mathcal{P}$ on the parameters $\theta$ using a prior measure on the network parameter space $P(\theta|\mathcal{M})\equiv\pi$ (which draws parallels with $\lambda_r R(\theta)$~\citep{higson2018bayesian}) via Bayes theorem
\begin{equation}
\mathcal{P}\equiv P(\theta|D_\mathrm{train},\mathcal{M}) = \frac{P(D_\mathrm{train}|\theta,\mathcal{M}) P(\theta|\mathcal{M})}{P(D_\mathrm{train}|\mathcal{M})} \equiv \frac{\mathcal{L}\pi}{\mathcal{Z}}, \qquad
\mathcal{Z} = \int \mathcal{L}\pi \mathrm{d}\theta.
\label{e:bayes}
\end{equation}
The process of determining the posterior is termed {\em parameter estimation}.
Instead of having a single best-fit set of network parameters $\theta_*$, one now has a distribution over $\theta$ (\cref{f:posteriors}).
This quantification of error in the posterior may be forwarded onto a distribution on the predictions $y_\mathrm{pred}$, from unseen inputs $x_\mathrm{new}$, which may be summarised by a mean $\hat{y}_\mathrm{pred}$ and an error bar $\sigma_{y,\mathrm{pred}}^2$.
The other critical quantity $\mathcal{Z}$ in \cref{e:bayes} is termed the Bayesian evidence (also known as the marginal likelihood), and is computed from the likelihood and prior as a normalisation constant. The evidence is critical in the upper level of Bayesian inference, termed {\em model comparison}, whereby one's confidence in the network as supported by the data is given by
\begin{equation}
P(\mathcal{M}|D_\mathrm{train}) = \frac{P(D_\mathrm{train}|\mathcal{M})P(\mathcal{M})}{P(D_\mathrm{train})} = \frac{\mathcal{Z}_\mathcal{M}P(\mathcal{M})}{\sum_m \mathcal{Z}_m P(m)}.
\end{equation}
In the above posterior over models, $m$ is a categorical variable ranging over all architectures considered, and $P(m)$ is the assigned prior probability to each network (typically taken to be uniform over all $m$). The evidence is therefore a measure of the quality of a network as viewed by the data and can be used to compare architectures and marginalise over models.
The Bayesian evidence $\mathcal{Z}$ in \cref{e:bayes} is the average of the likelihood function over the parameter space, weighted by the prior distribution. A larger parameter space, either in the form of higher dimensionality or a larger domain results in a lower evidence value, all other things being equal. Thus the evidence automatically implements Occam's razor: when you have two competing theories that make similar predictions, the one with fewer active (i.e.\ constrained) parameters should be preferred.
This naturally has useful implications in the context of machine learning: for two models which fit the training data equally well, one would expect the simpler model (i.e. the one with the higher Bayesian evidence) to generalise to out-of-sample data better, due to it overfitting the training data less. Thus one can postulate that the Bayesian evidence can be used as a proxy for out-of-sample data performance of a model relative to alternative models. If shown to be true in practice more generally, this has wide-reaching implications for machine learning, as an orthogonal measure of performance generalisation from training data alone might allow less data to be sacrificed to testing sets, and a more robust test of performance on completely unseen data.
\citet{mackay1992practical} historically found a good correlation between $\z$ and generalisation ability for small neural networks applied to regression problems. Furthermore \citeauthor{mackay1992practical} found that for models where this correlation did not exist, the models generally performed poorly on test set data, but when they were improved in some way \footnote{In the particular example presented, \citet{mackay1992practical} improved performance by increasing the granularity of the variable prior hyperparameters.} then the correlation was found to exist. Thus in this instance the correlation (or lack thereof) between the evidence and out of sample performance can also be interpreted as a tool for determining when a model's performance on out-of-sample data can be improved. One of the key results of this paper is the demonstration of the correlation between $\z$ and generalisation ability to be true in the compromise-free setting as well.
%TODO can possibly make this shorter
A further use of the evidence $\z$ is to perform Bayesian ensembling over networks/models. Taking a uniform prior over all networks, samples from the full ensemble distribution over different networks can be generated by re-weighting so that the posterior mass of each individual network is proportional to its evidence. This generates a set of samples from the full joint distribution. In the context of MLP neural networks, different models can take several forms, including different numbers of layers, different numbers of nodes within the layers, different activation functions, and in our case, even different granularities (see \cref{s:methods}) on the hierarchical priors. The beauty of using the full joint distribution is it allows the data to decide which models are most important in the fitting process, arguably putting less onus on the user as they now only have to decide on a suite of models to choose from, and the associated prior probabilities of these models. Supplementary detail may be found in \cref{s:further_bayes}.
\begin{figure}
\centerline{\includegraphics[width=\textwidth]{posteriors.pdf}}
\caption{Example posterior produced by \PolyChord{}~\citep{handley2015polychord1,handley2015polychord2} for the neural network shown in the bottom right corner with ``single'' granularity hyperpriors (\cref{s:further_bayes}). The full 45-dimensional posterior is sampled numerically. In the top panels the 1-dimensional marginal distributions for the network weights $w$ and biases $b$ in the first layer are shown. The three square plots show the pairwise 2-dimensional marginal posteriors for the next three layers, with 1-d marginals on the diagonals, representative samples drawn from the posterior in the upper triangle and histograms of the posterior in the lower triangle. The bottom centre plots show the posterior on the {\em a-priori\/} unknown Gaussian noise level $\sigma$ in the likelihood and single prior hyperparameter $\sigma_1$. It should be noted that the posteriors are highly non-Gaussian and multi-modal, necessitating the use of a full compromise-free sampler. Plot created under \texttt{anesthetic}~\citep{2019JOSS....4.1414H}.
\label{f:posteriors}
}
\end{figure}
\section{Methodology}\label{s:methods}
\textbf{Data:} For this paper we focus on the Boston housing dataset\footnote{\url{https://archive.ics.uci.edu/ml/machine-learning-databases/housing/}} as its small sample size is appropriate for a compromise-free Bayesian treatment. The dataset consists of 506 continuously variable records, with 13 inputs (features) and one output. We linearly whiten the input and output variables of this regression problem so that they have zero mean and unit variance.
For each analysis we split the entire data in half so that both the training and test sets contain $n_\mathrm{test}=n_\mathrm{train}=253$ records. We evaluate the test set performance by considering the mean squared error $E=\chi^2_\mathrm{test}/{n_\mathrm{test}}$ between $y_{\mathrm{test}}$ and the mean BNN prediction $\widehat{y}_\mathrm{new}$, as well as the error on $E$ derived from $\sigma_{y, \mathrm{pred}}$.
For training purposes we repeat the analysis with ten different random splits of the training/test data, so for a given setup we train a BNN on ten different training sets and measure their performance on the ten corresponding test sets. For these ten different data splits, we look at the average value of the mean squared errors on the test sets, and also take the mean value of $\z$ obtained from the ten analyses. The values quoted in the rest of this analysis are the logarithm of the values of these average values of the evidence, and the average values of the mean squared errors.
\textbf{Models:} \cref{f:nn} details the neural network architectures we consider alongside a network with no hidden layer (i.e. Bayesian linear regression).
For each architecture, we first consider networks with either a $\tanh$ or $\relu$ activation function for all hidden layers, with Gaussian prior and likelihood widths fixed to $\sigma_i=\sigma=1$.
For the $\tanh$ activation functions, we also consider the impact of hierarchical priors as discussed in \cref{s:further_bayes}.\footnote{This is analogous to letting the data decide on the values of $\lambda_m$ and $\lambda_r$} In this case we also allow the likelihood precision $\tau \equiv \sigma^{-2}$ to vary by setting a Gamma prior with $\alpha=\beta=1$.
In the terminology of Bayesian statistics we shall refer to each combination of architecture, activation function and prior as a {\em model}.
For the hierarchical priors, three different variants were considered, corresponding to three different levels of granularity on the priors: single, layer \citep[both used in][]{mackay1992practical} and input size \citep[similar to the automatic relevance determination method in][]{neal2012bayesian}. Single granularity has one hyperparameter controlling the standard deviations on the priors of all weights and biases in the network. Layer granularity has two hyperparameters per layer (one for the bias, one for the weights in each layer). For input size granularity models, the number of hyperparameters for each layer depends on the number of inputs to that layer. Following \citet{neal2012bayesian} we used zero mean Gaussian distributions with an ordering enforced to prevent weight space degeneracy \citep{goodfellow2016deep} \citep[as discussed in][]{handley2019bayesian, buscicchio2019label} for the priors and Gamma distributions for the hyperpriors. For layer and input size random variable hyperparameter models we scaled the Gamma distribution hyperparameters so that the prior over functions converges to Gaussian processes in the limit of infinitely wide networks, as discussed in \citet{neal2012bayesian}.
\textbf{Training:} To obtain estimates of the BNN posterior distributions and Bayesian evidences we use the \PolyChord{} algorithm \citep{handley2015polychord1, handley2015polychord2}, a high-dimensional, high-performance implementation of nested sampling \citep{skilling2006nested}. We run with 1,000 live points $n_\mathrm{live}$ and the number of repeats $n_\mathrm{repeats}$ set to 5 $\times$ the dimensionality of the parameter space (which vary between $14$ and $156$ dimensions, see \cref{t:non_combined_averages}).
We followed the recommended procedure of checking our results are stable with respect to varying $n_\mathrm{repeats}$. An example posterior produced by \PolyChord{} is plotted under \texttt{anesthetic}~\citep{2019JOSS....4.1414H} in \cref{f:posteriors}, which shows the compromise-free posterior as highly non-Gaussian, with nonlinear correlation structure and significant multimodality. A naive optimiser applied to such a training scenario would likely be unable to reveal such information.
\textbf{Computing:} Our longest runs (those with the biggest networks and most complex hierarchical priors) took up to 12 hours using the multithreaded Eigen\footnote{\url{http://eigen.tuxfamily.org/index.php?title=Main_Page}} C++ library to implement these BNNs, computed on Intel Xeon Skylake 6142 processors (16-core 2.6GHz, 192GB RAM)\footnote{\url{https://www.hpc.cam.ac.uk/systems/peta-4}}. \PolyChord{} may be MPI parallelised up to the number of live points (i.e. 1,000), but we opted to trivially parallelise over separate network runs. We also experimented with serial tensorflow-gpu code on a Nvidia P100 GPU 16GB GPU \footnote{\url{https://www.hpc.cam.ac.uk/systems/wilkes-2}}, but found this to be slower than the CPU code described above, due to the fact that we were running small networks on small datasets, so the transferring in and out of GPU memory overhead for subsequent samples outweighed the GPU matrix manipulation gains. The code used to conduct these experiments is publicly available on GitHub\footnote{\url{https://github.com/SuperKam91/bnn}}.
\begin{figure}
\centerline{ \includegraphics{non_comb_mean_results_colour-crop.pdf}}
\caption{Log Bayesian evidence versus test loss values averaged over the ten different data randomisations, for the BNNs with $\tanh$ activation functions (blue), $\relu$ activations (orange), and variable hyperparameter models with $\tanh$ activations (green). }
\label{f:non_comb_mean_results}
\end{figure}
\section{Results}\label{s:results}
%TODO possibly split into subsections
In total we trained 49 different model combinations of architecture, activation function and prior (\cref{t:non_combined_averages}), and for each we performed 10 different randomisations of the training--test split (in each case 50\% of the data is used for training and the remaining 50\% is used for testing), and took averages of these 10 results. We examined correlations between the test loss (mean squared error of the BNN mean estimates on the test data), the Bayesian evidences, and the size of the parameter space dimensionalities for the different models. A more detailed discussion of results with additional figures and tables may be found in \cref{s:detailed_results}.
Our headline results are shown as a plot in the test-set performance--evidence plane in \cref{f:non_comb_mean_results}. Across all of the models there are three distinct clusters: models which used fixed prior/likelihood variances split into $\tanh$ and $\relu$ clusters, and models which used hierarchical priors (which all used $\tanh$ hidden activations). In all cases for a given architecture, the latter had higher evidence values and better performance, adhering to the trend found in \citet{mackay1992practical}.
For models with fixed prior/likelihood variances, the models which used the $\relu$ activation function consistently outperformed the equivalent architectures with $\tanh$ activations.
This is a somewhat surprising result (further highlighted in \cref{f:non_comb_tanh_relu_mean_results}) since $\tanh$ is a more non-linear function, one may expect it to perform better than $\relu$ for the small networks considered here. In traditional neural network training, two of the key reasons why $\relu$ is a popular choice are that: 1) its derivative is fast and easy to calculate which is crucial for backward propagation training and 2) the fact that non-positive activations are shut out (their value and derivative are both zero) means that a side-effect of using $\relu$ is that it provides a form of regularisation, which can be helpful in large networks. Neither of these considerations are applicable to the BNNs we consider here, however, since \PolyChord{} does not use derivative information, and only small networks are considered. One potential benefit of using $\relu$ for these BNNs is that the function does not saturate for input values large in magnitude as $\tanh$ does. Regardless of why $\relu$ so consistently outperforms the $\tanh$ models, the evidence clearly picks up on its superiority.
\begin{figure}
\centerline{\includegraphics{non_comb_sh_lh_ih_sv_mean_loss_Z_sharex_vs_dims-crop.pdf}}
\caption{Log Bayesian evidence (top plot) and test loss values (bottom plot) versus BNN model dimensionality focussing on the cases with variable hyperparameters with single (blue points), layer (orange points), or input size granularity.
Note that the mapping between architecture and dimensionality is not one-to-one, as some models by chance have the same dimensionality. In these cases, the evidence may still be used to select between architectures of the same dimensionality. For clarity, reading right-to-left and then top-to-bottom the network architectures are (2) (2,2) (2,2,2) (2,2,2,2), (4), (4,4), (4,4,4), (4,4,4,4), (8).
}
\label{f:non_comb_sh_lh_ih_sv_mean_results_mean_Z_vs_dims_mean_loss_vs_dims}
\end{figure}
We also investigated the effect of using hierarchical priors with a range of degrees of complexity/granularity (single, input and layer: \cref{s:stochasticresults}). At single (course) granularity, there is a definite correlation between test set accuracy and Bayesian evidence (\cref{f:non_comb_sh_lh_ih_sv_mean_results}).
For the same subset of models, there appears to be a Bayesian cliff present when considering the evidence as a function of parameter space dimensionality \citep{mackay1992practical}, as well as a striking symmetry between the evidence--dimensionality and the test set performance--dimensionality planes (\cref{f:non_comb_sh_lh_ih_sv_mean_results_mean_Z_vs_dims_mean_loss_vs_dims}).
A stronger correlation between performance and evidence is present for the layer random variable prior hyperparameter models. The symmetry mentioned previously is also present, but in this case the Bayesian cliff is more of a plateau. The same can be said for the models with input size granularity.
For layer granularity models the test set performance was better than the equivalent models with single granularity in most cases. The evidence values were also higher for the former in general, indicating that it correctly captured the superior performance. But perhaps surprisingly, the input size granularity models consistently underperform the layer granularity models, in contrast to \citep{neal2012bayesian, javid201921cm}. Further investigation into the training and test set losses suggests that the input size models are on average shutting off nodes important for the test data, more than the layer granularity models are, while performing similarly well on the training data.
The jump in performance and evidence, and increase in correlation between the two, associated with switching from fixed to variable prior/likelihood variances was also found in \citet{mackay1992practical}, but in effectively switching from single to layer granularity models. \citeauthor{mackay1992practical} attributed this to the idea that when a correlation between evidence and performance was not present, then the model could be improved in some way, and that the improved model showed this correlation more.
We finally combined the predictions of an ensemble of BNNs by considering the posterior distribution over the corresponding models, parameterised by a categorical variable representing a given BNN (\cref{t:combined_averages}). The corresponding posteriors were obtained from the samples of the individual model posteriors which are re-weighted according to the evidences associated with a given run, and the ensemble prior. For simplicity we assume a uniform prior over all the models, and we consider a wide array of different ensembles of these BNNs, including: ensembles of models with the same variable hyperparameter granularity; ensembles of models with the same number of nodes per layer; and ensembles of models with the same number of layers.
As detailed in \cref{f:all_mean_results}, from the predictions and the evidences associated with the ensembles we found: 1) The evidence--test set performance relations were very much the same as those found for the individual model results; 2) The best performance on the test set data overall was obtained with an ensemble of models: the combination of all models with layer granularity random variable hyperparameters achieved a test loss of 0.1732. The best performing individual model obtained a loss of 0.1791.
\section{Conclusions}
\label{s:c}
In this paper we applied compromise-free Bayesian neural networks (BNNs) to the Boston housing dataset in order to explore the relationship between out-of-sample performance and the Bayesian evidence as first studied in \citet{mackay1992practical}, and how combining the predictions of different networks in a statistically sound way affects their performance. To numerically obtain samples of the full posterior and evidences associated with the training data and the networks, we used the nested sampling algorithm \PolyChord{} \citep{handley2015polychord1, handley2015polychord2}, to train models with up to $156$ trainable parameters.
We considered a wide variety of models in our analysis: networks with either $\tanh$ or $\relu$ activation functions; networks with between zero and four hidden layers; and models trained using hierarchical Bayesian inference, i.e. models where the prior and likelihood standard deviations are modelled as random variables, as in \citet{mackay1992practical, neal2012bayesian, javid201921cm}.
Our experiments demonstrate a proof-of-concept for two concrete principles: A) Using the Bayesian evidence one can quantify out-of-sample performance and generalisability from training data alone (and where this isn't the case, it is indicative that the model needs improvement); B) Ensembles of networks, which are obtained almost for free from previous runs can improve predictive performance.
A compromise-free approach is only ever intended as an initial step in a wider analysis. The purpose of solving the full numerical problem without approximation is twofold:
First to see how far one can get with current computing resources, and hence to forecast how things may scale both now and in the future with more computing resources, time or money.
Second, and more importantly, to use the results of the full solution as a foundation to an exciting area of model selection and ensembling. We hope this paper will encourage the community to apply more approximate methods, and to consider larger, more practical applications, and test how the trends and characteristics presented here are affected by these regime changes.
It is hoped that this work will be a springboard and inspiration for a profitable line of research into Bayesian neural networks in the context of model selection and model ensembling.
\vfill
\pagebreak
\section*{Broader Impact}
This work into compromise-free Bayesian neural networks is very much at a preliminary stage, but demonstrates that there are many benefits that a fully Bayesian numerical approach can bring to inference.
If the further investigations into using the Bayesian evidence $\z$ to quantify out-of-sample generalisation from in-sample data show the observations of this paper to be robust, then the potential importance of this approach cannot be overstated. Whilst we do not advocate dispensing with out-of-sample testing data, there are many machine learning domains where obtaining such data is very hard or ethically impossible, such as autonomous vehicles or medical applications. In these settings the ability to select, deselect or marginalise over approaches that are unlikely to work well beyond the confines of a training dataset could prove essential, although relying blindly on the evidence could result in overfitting of the data.
Even in fields where testing and training data are plentiful, an orthogonal measure of generalisability could still prove invaluable to both improving performance, and reducing the effect of hidden biases in training data and systematic errors.
There is also scope for this work to further encourage the use of Bayesian modelling in machine learning in general. It is likely true that the observations regarding generalisability and ensembling obtained here are not neural network specific, and the ability to have principled uncertainty quantification (or certainty about one's uncertainty) is something that would be of great use in a wide variety of fields.
We do not necessarily advocate using a compromise-free approach for widespread industrial use in its present state. The computational cost is very resource-heavy (in both time, money, memory and energy). Whilst this can be brought down to human-scale training times using high-performance computing and the extensive parallelisation capability of \PolyChord{}, which may be sped up linearly to $n_\mathrm{CPUs}\sim\mathcal{O}(n_\mathrm{live})$, this is arguably not a very green approach. However in its current form it could be useful as an additional tool in the arsenal of a machine learning researcher for solving particularly stubborn supervised learning problems.
As the technology behind the research is refined, the compromise-free approach may become more competitive in comparison with traditional training techniques. In such an instance, the products of this research will have a wide impact across a broad range of research and industrial applications.
\textbf{Cui bono:} People wishing to gauge generalisation ability from training data; people concerned uncertainty quantification; people concerned with tying practical machine learning back to traditional statistics and theory; people concerned with approximations/assumptions used in training Bayesian models.
\textbf{Cui malo:} People concerned with the compute costs associated with training machine learning models; people who are concerned primarily with optimal performance (e.g. lowest error) of ML models.
\begin{ack}
This work was performed using resources provided by the \href{http://www.csd3.cam.ac.uk/}{Cambridge Service for Data Driven Discovery (CSD3)} operated by the University of Cambridge Research Computing Service, provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/P020259/1), and \href{www.dirac.ac.uk}{DiRAC funding from the Science and Technology Facilities Council}.
K.J.\ was supported by an STFC Impact Acceleration Account 2018 grant.
W.H.\ was supported by a Gonville \& Caius College Research fellowship and STFC IPS grant number G102229.
\end{ack}
\small
\setlength{\bibsep}{0.0pt}
\bibliographystyle{unsrtnat}
\bibliography{references}
\appendix
\section{Additional detail for Bayesian framework}\label{s:further_bayes}
In this appendix we collect together additional material describing our Bayesian approach which we consider too much detail for the main paper, but may be of use to readers who are new to numerical Bayesian inference.
\subsection{Propagating Bayesian errors from weights to predictions}\label{s:propagating}
In \cref{s:bnn} we briefly commented that the posterior described by \cref{e:bayes} quantifies the error in our fitted network parameters $\theta$, and that this error may be ``forwarded onto a distribution on the predictions $y_\mathrm{pred}$''. More precisely, a predictive distribution for $y_\mathrm{pred}$, from unseen inputs $x_\mathrm{new}$ by the network is induced by the posterior, and may be computed by marginalisation~\citep{2019Entrp..21..272H}
\begin{align}
P(y_\mathrm{pred}|x_\mathrm{new},D_\mathrm{train},\mathcal{M}) &= \int \delta(y_\mathrm{pred}-f(x_\mathrm{new};\theta)) P(\theta|D_\mathrm{train},\mathcal{M})\mathrm{d}{\theta}.
\label{e:pred}\\
&= \frac{\mathrm{d}}{\mathrm{d}y_\mathrm{pred}} \int\limits_{{f(x_\mathrm{new};\theta)}