{% raw %} Title: Create a Markdown Blog Post Integrating Research Details and a Featured Paper ==================================================================================== This task involves generating a Markdown file (ready for a GitHub-served Jekyll site) that integrates our research details with a featured research paper. The output must follow the exact format and conventions described below. ==================================================================================== Output Format (Markdown): ------------------------------------------------------------------------------------ --- layout: post title: "Split personalities in Bayesian Neural Networks: the case for full marginalisation" date: 2022-05-23 categories: papers --- ![AI generated image](/assets/images/posts/2022-05-23-2205.11151.png) David YallupWill HandleyMike HobsonAnthony Lasenby Content generated by [gemini-2.5-pro](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/2022-05-23-2205.11151.txt). Image generated by [imagen-3.0-generate-002](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/2022-05-23-2205.11151.txt). ------------------------------------------------------------------------------------ ==================================================================================== Please adhere strictly to the following instructions: ==================================================================================== Section 1: Content Creation Instructions ==================================================================================== 1. **Generate the Page Body:** - Write a well-composed, engaging narrative that is suitable for a scholarly audience interested in advanced AI and astrophysics. - Ensure the narrative is original and reflective of the tone and style and content in the "Homepage Content" block (provided below), but do not reuse its content. - Use bullet points, subheadings, or other formatting to enhance readability. 2. **Highlight Key Research Details:** - Emphasize the contributions and impact of the paper, focusing on its methodology, significance, and context within current research. - Specifically highlight the lead author ({'name': 'David Yallup'}). When referencing any author, use Markdown links from the Author Information block (choose academic or GitHub links over social media). 3. **Integrate Data from Multiple Sources:** - Seamlessly weave information from the following: - **Paper Metadata (YAML):** Essential details including the title and authors. - **Paper Source (TeX):** Technical content from the paper. - **Bibliographic Information (bbl):** Extract bibliographic references. - **Author Information (YAML):** Profile details for constructing Markdown links. - Merge insights from the Paper Metadata, TeX source, Bibliographic Information, and Author Information blocks into a coherent narrative—do not treat these as separate or isolated pieces. - Insert the generated narrative between the HTML comments: and 4. **Generate Bibliographic References:** - Review the Bibliographic Information block carefully. - For each reference that includes a DOI or arXiv identifier: - For DOIs, generate a link formatted as: [10.1234/xyz](https://doi.org/10.1234/xyz) - For arXiv entries, generate a link formatted as: [2103.12345](https://arxiv.org/abs/2103.12345) - **Important:** Do not use any LaTeX citation commands (e.g., `\cite{...}`). Every reference must be rendered directly as a Markdown link. For example, instead of `\cite{mycitation}`, output `[mycitation](https://doi.org/mycitation)` - **Incorrect:** `\cite{10.1234/xyz}` - **Correct:** `[10.1234/xyz](https://doi.org/10.1234/xyz)` - Ensure that at least three (3) of the most relevant references are naturally integrated into the narrative. - Ensure that the link to the Featured paper [2205.11151](https://arxiv.org/abs/2205.11151) is included in the first sentence. 5. **Final Formatting Requirements:** - The output must be plain Markdown; do not wrap it in Markdown code fences. - Preserve the YAML front matter exactly as provided. ==================================================================================== Section 2: Provided Data for Integration ==================================================================================== 1. **Homepage Content (Tone and Style Reference):** ```markdown --- layout: home --- ![AI generated image](/assets/images/index.png) The Handley Research Group stands at the forefront of cosmological exploration, pioneering novel approaches that fuse fundamental physics with the transformative power of artificial intelligence. We are a dynamic team of researchers, including PhD students, postdoctoral fellows, and project students, based at the University of Cambridge. Our mission is to unravel the mysteries of the Universe, from its earliest moments to its present-day structure and ultimate fate. We tackle fundamental questions in cosmology and astrophysics, with a particular focus on leveraging advanced Bayesian statistical methods and AI to push the frontiers of scientific discovery. Our research spans a wide array of topics, including the [primordial Universe](https://arxiv.org/abs/1907.08524), [inflation](https://arxiv.org/abs/1807.06211), the nature of [dark energy](https://arxiv.org/abs/2503.08658) and [dark matter](https://arxiv.org/abs/2405.17548), [21-cm cosmology](https://arxiv.org/abs/2210.07409), the [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1807.06209), and [gravitational wave astrophysics](https://arxiv.org/abs/2411.17663). ### Our Research Approach: Innovation at the Intersection of Physics and AI At The Handley Research Group, we develop and apply cutting-edge computational techniques to analyze complex astronomical datasets. Our work is characterized by a deep commitment to principled [Bayesian inference](https://arxiv.org/abs/2205.15570) and the innovative application of [artificial intelligence (AI) and machine learning (ML)](https://arxiv.org/abs/2504.10230). **Key Research Themes:** * **Cosmology:** We investigate the early Universe, including [quantum initial conditions for inflation](https://arxiv.org/abs/2002.07042) and the generation of [primordial power spectra](https://arxiv.org/abs/2112.07547). We explore the enigmatic nature of [dark energy, using methods like non-parametric reconstructions](https://arxiv.org/abs/2503.08658), and search for new insights into [dark matter](https://arxiv.org/abs/2405.17548). A significant portion of our efforts is dedicated to [21-cm cosmology](https://arxiv.org/abs/2104.04336), aiming to detect faint signals from the Cosmic Dawn and the Epoch of Reionization. * **Gravitational Wave Astrophysics:** We develop methods for [analyzing gravitational wave signals](https://arxiv.org/abs/2411.17663), extracting information about extreme astrophysical events and fundamental physics. * **Bayesian Methods & AI for Physical Sciences:** A core component of our research is the development of novel statistical and AI-driven methodologies. This includes advancing [nested sampling techniques](https://arxiv.org/abs/1506.00171) (e.g., [PolyChord](https://arxiv.org/abs/1506.00171), [dynamic nested sampling](https://arxiv.org/abs/1704.03459), and [accelerated nested sampling with $\beta$-flows](https://arxiv.org/abs/2411.17663)), creating powerful [simulation-based inference (SBI) frameworks](https://arxiv.org/abs/2504.10230), and employing [machine learning for tasks such as radiometer calibration](https://arxiv.org/abs/2504.16791), [cosmological emulation](https://arxiv.org/abs/2503.13263), and [mitigating radio frequency interference](https://arxiv.org/abs/2211.15448). We also explore the potential of [foundation models for scientific discovery](https://arxiv.org/abs/2401.00096). **Technical Contributions:** Our group has a strong track record of developing widely-used scientific software. Notable examples include: * [**PolyChord**](https://arxiv.org/abs/1506.00171): A next-generation nested sampling algorithm for Bayesian computation. * [**anesthetic**](https://arxiv.org/abs/1905.04768): A Python package for processing and visualizing nested sampling runs. * [**GLOBALEMU**](https://arxiv.org/abs/2104.04336): An emulator for the sky-averaged 21-cm signal. * [**maxsmooth**](https://arxiv.org/abs/2007.14970): A tool for rapid maximally smooth function fitting. * [**margarine**](https://arxiv.org/abs/2205.12841): For marginal Bayesian statistics using normalizing flows and KDEs. * [**fgivenx**](https://arxiv.org/abs/1908.01711): A package for functional posterior plotting. * [**nestcheck**](https://arxiv.org/abs/1804.06406): Diagnostic tests for nested sampling calculations. ### Impact and Discoveries Our research has led to significant advancements in cosmological data analysis and yielded new insights into the Universe. Key achievements include: * Pioneering the development and application of advanced Bayesian inference tools, such as [PolyChord](https://arxiv.org/abs/1506.00171), which has become a cornerstone for cosmological parameter estimation and model comparison globally. * Making significant contributions to the analysis of major cosmological datasets, including the [Planck mission](https://arxiv.org/abs/1807.06209), providing some of the tightest constraints on cosmological parameters and models of [inflation](https://arxiv.org/abs/1807.06211). * Developing novel AI-driven approaches for astrophysical challenges, such as using [machine learning for radiometer calibration in 21-cm experiments](https://arxiv.org/abs/2504.16791) and [simulation-based inference for extracting cosmological information from galaxy clusters](https://arxiv.org/abs/2504.10230). * Probing the nature of dark energy through innovative [non-parametric reconstructions of its equation of state](https://arxiv.org/abs/2503.08658) from combined datasets. * Advancing our understanding of the early Universe through detailed studies of [21-cm signals from the Cosmic Dawn and Epoch of Reionization](https://arxiv.org/abs/2301.03298), including the development of sophisticated foreground modelling techniques and emulators like [GLOBALEMU](https://arxiv.org/abs/2104.04336). * Developing new statistical methods for quantifying tensions between cosmological datasets ([Quantifying tensions in cosmological parameters: Interpreting the DES evidence ratio](https://arxiv.org/abs/1902.04029)) and for robust Bayesian model selection ([Bayesian model selection without evidences: application to the dark energy equation-of-state](https://arxiv.org/abs/1506.09024)). * Exploring fundamental physics questions such as potential [parity violation in the Large-Scale Structure using machine learning](https://arxiv.org/abs/2410.16030). ### Charting the Future: AI-Powered Cosmological Discovery The Handley Research Group is poised to lead a new era of cosmological analysis, driven by the explosive growth in data from next-generation observatories and transformative advances in artificial intelligence. Our future ambitions are centred on harnessing these capabilities to address the most pressing questions in fundamental physics. **Strategic Research Pillars:** * **Next-Generation Simulation-Based Inference (SBI):** We are developing advanced SBI frameworks to move beyond traditional likelihood-based analyses. This involves creating sophisticated codes for simulating [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1908.00906) and [Baryon Acoustic Oscillation (BAO)](https://arxiv.org/abs/1607.00270) datasets from surveys like DESI and 4MOST, incorporating realistic astrophysical effects and systematic uncertainties. Our AI initiatives in this area focus on developing and implementing cutting-edge SBI algorithms, particularly [neural ratio estimation (NRE) methods](https://arxiv.org/abs/2407.15478), to enable robust and scalable inference from these complex simulations. * **Probing Fundamental Physics:** Our enhanced analytical toolkit will be deployed to test the standard cosmological model ($\Lambda$CDM) with unprecedented precision and to explore [extensions to Einstein's General Relativity](https://arxiv.org/abs/2006.03581). We aim to constrain a wide range of theoretical models, from modified gravity to the nature of [dark matter](https://arxiv.org/abs/2106.02056) and [dark energy](https://arxiv.org/abs/1701.08165). This includes leveraging data from upcoming [gravitational wave observatories](https://arxiv.org/abs/1803.10210) like LISA, alongside CMB and large-scale structure surveys from facilities such as Euclid and JWST. * **Synergies with Particle Physics:** We will continue to strengthen the connection between cosmology and particle physics by expanding the [GAMBIT framework](https://arxiv.org/abs/2009.03286) to interface with our new SBI tools. This will facilitate joint analyses of cosmological and particle physics data, providing a holistic approach to understanding the Universe's fundamental constituents. * **AI-Driven Theoretical Exploration:** We are pioneering the use of AI, including [large language models and symbolic computation](https://arxiv.org/abs/2401.00096), to automate and accelerate the process of theoretical model building and testing. This innovative approach will allow us to explore a broader landscape of physical theories and derive new constraints from diverse astrophysical datasets, such as those from GAIA. Our overarching goal is to remain at the forefront of scientific discovery by integrating the latest AI advancements into every stage of our research, from theoretical modeling to data analysis and interpretation. We are excited by the prospect of using these powerful new tools to unlock the secrets of the cosmos. Content generated by [gemini-2.5-pro-preview-05-06](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/index.txt). Image generated by [imagen-3.0-generate-002](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/index.txt). ``` 2. **Paper Metadata:** ```yaml !!python/object/new:feedparser.util.FeedParserDict dictitems: id: http://arxiv.org/abs/2205.11151v1 guidislink: true link: http://arxiv.org/abs/2205.11151v1 updated: '2022-05-23T09:24:37Z' updated_parsed: !!python/object/apply:time.struct_time - !!python/tuple - 2022 - 5 - 23 - 9 - 24 - 37 - 0 - 143 - 0 - tm_zone: null tm_gmtoff: null published: '2022-05-23T09:24:37Z' published_parsed: !!python/object/apply:time.struct_time - !!python/tuple - 2022 - 5 - 23 - 9 - 24 - 37 - 0 - 143 - 0 - tm_zone: null tm_gmtoff: null title: "Split personalities in Bayesian Neural Networks: the case for full\n marginalisation" title_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: type: text/plain language: null base: '' value: "Split personalities in Bayesian Neural Networks: the case for full\n\ \ marginalisation" summary: 'The true posterior distribution of a Bayesian neural network is massively multimodal. Whilst most of these modes are functionally equivalent, we demonstrate that there remains a level of real multimodality that manifests in even the simplest neural network setups. It is only by fully marginalising over all posterior modes, using appropriate Bayesian sampling tools, that we can capture the split personalities of the network. The ability of a network trained in this manner to reason between multiple candidate solutions dramatically improves the generalisability of the model, a feature we contend is not consistently captured by alternative approaches to the training of Bayesian neural networks. We provide a concise minimal example of this, which can provide lessons and a future path forward for correctly utilising the explainability and interpretability of Bayesian neural networks.' summary_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: type: text/plain language: null base: '' value: 'The true posterior distribution of a Bayesian neural network is massively multimodal. Whilst most of these modes are functionally equivalent, we demonstrate that there remains a level of real multimodality that manifests in even the simplest neural network setups. It is only by fully marginalising over all posterior modes, using appropriate Bayesian sampling tools, that we can capture the split personalities of the network. The ability of a network trained in this manner to reason between multiple candidate solutions dramatically improves the generalisability of the model, a feature we contend is not consistently captured by alternative approaches to the training of Bayesian neural networks. We provide a concise minimal example of this, which can provide lessons and a future path forward for correctly utilising the explainability and interpretability of Bayesian neural networks.' authors: - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: David Yallup - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Will Handley - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Mike Hobson - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Anthony Lasenby - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Pablo Lemos author_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Pablo Lemos author: Pablo Lemos arxiv_comment: 10 pages, 5 figures links: - !!python/object/new:feedparser.util.FeedParserDict dictitems: href: http://arxiv.org/abs/2205.11151v1 rel: alternate type: text/html - !!python/object/new:feedparser.util.FeedParserDict dictitems: title: pdf href: http://arxiv.org/pdf/2205.11151v1 rel: related type: application/pdf arxiv_primary_category: term: stat.ML scheme: http://arxiv.org/schemas/atom tags: - !!python/object/new:feedparser.util.FeedParserDict dictitems: term: stat.ML scheme: http://arxiv.org/schemas/atom label: null - !!python/object/new:feedparser.util.FeedParserDict dictitems: term: cs.LG scheme: http://arxiv.org/schemas/atom label: null ``` 3. **Paper Source (TeX):** ```tex % nested sampling macros \newcommand{\nlive}{\ensuremath{n_\text{live}}} \newcommand{\niter}{n_\text{iter}} \newcommand{\ndynamic}{n_\text{dynamic}} \newcommand{\pvalue}{\text{\textit{p-}value}\xspace} \newcommand{\pvalues}{\text{\pvalue{}s}\xspace} \newcommand{\Pvalue}{\text{\textit{P-}value}\xspace} \newcommand{\Pvalues}{\text{\Pvalue{}s}\xspace} \newcommand{\Z}{\ensuremath{\mathcal{Z}}\xspace} \newcommand{\logZ}{\ensuremath{\log\Z}\xspace} \newcommand{\like}{\ensuremath{\mathcal{L}}\xspace} \newcommand{\post}{\ensuremath{\mathcal{P}}\xspace} \newcommand{\prior}{\ensuremath{\Pi}\xspace} \newcommand{\evidence}{\ensuremath{\mathcal{Z}}\xspace} \newcommand{\nprior}{\ensuremath{n_\text{prior}}\xspace} \newcommand{\anyg}[3]{\ensuremath{#1\mathopen{}\left(#2\,\rvert\, #3\right)\mathclose{}}\xspace} \newcommand{\ofOrder}[1]{\ensuremath{\mathcal{O}\left(#1\right)}} \newcommand{\threshold}{\like^\star} \newcommand{\pg}[2]{p\mathopen{}\left(#1\,\rvert\, #2\right)\mathclose{}} \newcommand{\Pg}[2]{P\mathopen{}\left(#1\,\rvert\, #2\right)\mathclose{}} \newcommand{\p}[1]{p\mathopen{}\left(#1\right)\mathclose{}} \newcommand{\intd}{\text{d}} \newcommand{\sampleParams}{\mathbf{x}} \newcommand{\modelParams}{\boldsymbol{\theta}} \newcommand{\param}{x} \newcommand{\stoppingtol}{\epsilon} \newcommand{\efr}{\ensuremath{\code{efr}}\xspace} \newcommand{\nr}{\ensuremath{n_r}\xspace} \newcommand{\expectation}[1]{\langle #1 \rangle} \newcommand{\MN}{\textsc{MultiNest}\xspace} % PRL wants smallcaps \newcommand{\PC}{\textsc{PolyChord}\xspace} \newcommand{\sn}{\ensuremath{\mathbb{S}_n}\xspace} \newcommand{\sx}[1]{\ensuremath{\mathbb{S}_#1}\xspace} \newcommand{\W}{\ensuremath{\mathcal{W}}\xspace} \newcommand{\Wsym}{\ensuremath{\W\mathrm{sym}}\xspace} \newcommand{\Wskew}{\ensuremath{\W\mathrm{skew}}\xspace} \documentclass{article} % if you need to pass options to natbib, use, e.g.: % \PassOptionsToPackage{numbers, compress}{natbib} % before loading neurips_2022 % ready for submission % \usepackage{neurips_2022} % to compile a preprint version, e.g., for submission to arXiv, add add the % [preprint] option: % \usepackage[preprint]{neurips_2022} % to compile a camera-ready version, add the [final] option, e.g.: % \usepackage[final]{neurips_2022} % to avoid loading the natbib package, add option nonatbib: \usepackage[nonatbib,preprint]{neurips_2022} % \usepackage[nonatbib]{neurips_2022} \usepackage[utf8]{inputenc} % allow utf-8 input \usepackage[T1]{fontenc} % use 8-bit T1 fonts \usepackage{hyperref} % hyperlinks \usepackage{graphicx} \usepackage{url} % simple URL typesetting \usepackage{booktabs} % professional-quality tables \usepackage{amsfonts} % blackboard math symbols \usepackage{nicefrac} % compact symbols for 1/2, etc. \usepackage{microtype} % microtypography \usepackage{xcolor} % colors \usepackage{amsmath} \usepackage{subfig} \usepackage{cleveref} \Crefname{figure}{Figure}{Figures} \crefname{figure}{Figure}{Figures} \Crefname{section}{Section}{Sections} \crefname{section}{Section}{Sections} \usepackage[algoruled,lined,linesnumbered,longend]{algorithm2e} \input{macros} %\bibliographystyle{apsrev4-2} % \bibliographystyle{JHEP} \bibliographystyle{nips} % comments \newcommand{\DY}[1]{{\color{blue}\textbf{TODO DY:} \textit{#1}}} \newcommand{\PL}[1]{{\color{green}\textbf{TODO PL:} \textit{#1}}} \newcommand{\WH}[1]{{\color{orange}\textbf{W:} \textit{#1}}} \newcommand{\TODO}[1]{{\color{red}\textbf{TODO:} \textit{#1}}} % The \author macro works with any number of authors. There are two commands % used to separate the names and addresses of multiple authors: \And and \AND. % % Using \And between authors leaves it to LaTeX to determine where to break the % lines. Using \AND forces a line break at that point. So, if LaTeX puts 3 of 4 % authors names on the first line, and the last on the second line, try using % \AND instead of \And before the third author name. \title{Split personalities in Bayesian Neural Networks:\\ the case for full marginalisation} \author{% David Yallup\thanks{Corresponding Author}\\ Kavli Institute for Cosmology\\ Cavendish Laboratory \\ University of Cambridge\\ \texttt{dy297@cam.ac.uk} \\ % examples of more authors \And Will Handley\\ Kavli Institute for Cosmology\\ Cavendish Laboratory \\ University of Cambridge\\ \texttt{wh260@cam.ac.uk} \\ \And Mike Hobson \\ Astrophysics Group\\ Cavendish Laboratory \\ University of Cambridge\\ \texttt{mph@mrao.cam.ac.uk}\\ \And Anthony Lasenby \\ Astrophysics Group\\ Cavendish Laboratory \\ University of Cambridge\\ \texttt{a.n.lasenby@mrao.cam.ac.uk}\\ \And Pablo Lemos \\ Department of Physics and Astronomy\\ University of Sussex\\ \texttt{p.lemos@sussex.ac.uk}\\ \\ } \begin{document} \maketitle \begin{abstract} The true posterior distribution of a Bayesian neural network is massively multimodal. Whilst most of these modes are functionally equivalent, we demonstrate that there remains a level of real multimodality that manifests in even the simplest neural network setups. It is only by fully marginalising over all posterior modes, using appropriate Bayesian sampling tools, that we can capture the split personalities of the network. The ability of a network trained in this manner to reason between multiple candidate solutions dramatically improves the generalisability of the model, a feature we contend is not consistently captured by alternative approaches to the training of Bayesian neural networks. We provide a concise minimal example of this, which can provide lessons and a future path forward for correctly utilising the explainability and interpretability of Bayesian neural networks. \end{abstract} \section{Introduction}\label{sec:intro} %Failing to account for this reduces the efficacy\\ %Whilst most of this multimodality is spurious, arising from degenerate weight-space labeling choices, there remains a level of true multimodality that manifests in even the simplest neural network setups. %Given this multimodality you need a sampling scheme that can handle this.\\ %We demonstrate in a ludicrously simple example\\ %in the foothills of real bayesian neural networks\\ %In order to get the correct Bayesian solution you need to perform a full marginal likelihood calculation\\ %We demonstrate this using sampling technology capable of navigating these multiple modes\ldots\\ %Explainability\\ %Conjecture a form of dense neural networks that is \\ %Calculating the marginal likelihood guarantees that you've sampled the posterior.\\ % The emergence of the deep learning paradigm to tackle high dimensional inference problems has driven numerous striking advances in recent years. Entire fields, such as computer vision or natural language processing, have found deep learning realised with deep Neural Networks indispensable for inference. Optimizing these massively high dimensional function spaces is however a task fraught with difficulty that has led to an industry in developing approximations and techniques to make the optimization reliable. This difficulty is compounded for the much desired goal of Bayesian Neural Networks, where the problem of navigating the function space is compounded by the variety of approximations needed to make Bayesian inference tractable. In this work we make the case for a compromise free numerical marginalisation of the likelihood of a Neural Network. These calculations reveal some striking challenges for existing methods of Neural network posterior inference. % \WH{This paragraph could get to the point a bit faster} The emergence of the deep learning paradigm to tackle high dimensional inference problems has driven numerous striking advances in recent years. In most practical settings this is approached as a high dimensional Neural Network (NN) optimisation problem, with an extensive industry built up to attempt to make this reliably solvable. Bayesian Neural Networks (BNNs) are a parallel program to bring probabilistic reasoning over the same classes of models, realised through the application of Bayes theorem to the training procedure. A core attraction of performing Bayesian inference with NNs is that it provides a probabilistic interpretation of the action of a network, allowing clear expression of the increasingly important question: \emph{How certain are the predictions we make with ML?} Vital issues such as quantifying the potential for bias to occur in network predictions are expressed in a principled Bayesian framework; imbalances in input training data are fed back as reduced certainty in predictions for under-represented population members in the training sample~\cite{2021arXiv210604015N,maddox2019simple}. % Bayesian Neural Networks (BNNs) are realised through the application of Bayes theorem to the training procedure, as opposed to the typical maximum likelihood estimation based approach. The attraction of performing Bayesian inference with NNs is that it provides a probabilistic interpretation of the action of a network, allowing clear expression of the increasingly important question; \emph{How certain are the predictions we make with ML?} Vital issues such as quantifying the potential for bias to occur in network predictions are expressed in a principled Bayesian framework; imbalances in input training data are fed back as reduced certainty in predictions for under-represented population members in the training sample~\cite{2021arXiv210604015N,maddox2019simple}. Despite the attraction of a BNN framework, progress in this area has been marred by a looming issue: performing Bayesian inference over the vast parameter spaces a typical deep NN covers. The increased utility of a Bayesian calculation suffers a trade-off in the dimensionality of parameter space that the inference can be drawn over. When analytic approaches break down, as is the case in any real-world scenario, approximations are needed to make the Bayesian computations tractable~\cite{foong2020expressiveness}. Particular attention is generally paid to Bayesian approximations that are best suited to scaling to the very high dimensional space that maximum likelihood based approaches typically operate in. In general this results in an approximation for the posterior network probability distribution that holds under some potentially strict assumptions~\cite{2020arXiv200202405W}. In particular it is often assumed that the posterior over the free parameters of a network is suitably captured by a unimodal distribution (typically around the maximum a posteriori value). In this work we propose a minimal problem that directly challenges this viewpoint by demonstrating that even the simplest possible Neural Networks have a posterior distribution that is truly multimodal. In the example we provide in this work, we present the case for focusing on a ``compromise free'' numerical marginalisation of the likelihood as a way to guarantee that the training procedure accounts for, and accurately reflects, the posterior multimodality. We perform this explicit marginalisation using numerical sampling tools designed to directly calculate the full marginal likelihood, or \emph{Evidence}. A suitable sampling tool for such a calculation should have the ability to handle integrals over likelihoods with potentially multiple strongly peaked structures, which in turn affords formal guarantees that the posterior distribution inferred from the samples accurately captures these structures. Some preliminary work in marginalising over NNs in this manner has taken place~\cite{bsr,javid2020compromisefree}, where in general a focus has been placed on the role of the evidence for Bayesian Model comparison between networks. Whilst this is an interesting avenue to build further on the work presented here, we make the case that by closely examining the split personalities of networks emerging from these calculations, a fundamental challenge to all probabilistic NN descriptions is exposed. % In the example we provide, we contend that the only way to guarantee that the training procedure accounts for and accurately reflects this multimodality is to focus on a compromise free marginalisation of the likelihood. % In practice we propose that this is achieved by employing numerical Bayesian methods that explicitly calculate the oft-neglected marginal likelihood, or \emph{Evidence}, of a BNN. In common expositions of Bayesian methods for NNs, the evidence is either discarded as being little more than a normalising constant or approximated in a manner that will be demonstrated to be insufficient in this work. Some preliminary work in calculating evidences of NNs in this manner has taken place~\cite{bsr,javid2020compromisefree}, where in general a focus has been placed on the role of the evidence in Bayesian Model comparison between networks. Whilst this is an interesting avenue to pursue further, by closely examining the split personalities of networks that only emerges from such a calculation, a fundamental challenge to all probabilistic NN descriptions is exposed. % \WH{Still not entirely sure I agree with this -- we're not actually using the evidence here, only the relative posterior masses (one could discard the normalising constant), and in principle one could achieve this with any sampler that accurately sampled both posterior peaks}\WH{Consider reorganising this paragraph into other portions of the text, perhaps in \cref{sec:NSforNN}}\WH{I.e. this para should introduces what we mean by an ``appropriate bayesian sampling tool''} % In this work an orthogonal approach is taken, instead of reaching up to attempt to validate BNNs against their maximum likelihood counterparts, a first principles examination of the true nature of a BNN is explored. This is achieved not via an exhaustive comparison of proposed Bayesian methods, but rather by focusing on the oft-neglected marginal likelihood, or \emph{Evidence}, of a BNN. In common expositions of Bayesian methods for NNs, the evidence is either discarded as being little more than a normalising constant or approximated in a manner that will be demonstrated to be insufficient in this work. Previous explorations of fitting classical adaptive basis function models demonstrated multimodal posteriors that can only be numerically estimated from samples derived from a convergent evidence estimate~\cite{bsr}. In this work the case for, and an example of, explicit marginalisation over NNs is presented. In \cref{sec:bayes} a brief overview of Bayesian Neural Networks is given, particularly covering the challenges facing numerical Bayesian methods. In \cref{sec:ex} we present a minimal challenge that exposes the truly split behaviour of a NN posterior in a clear fashion, as well as discussing the implications of this observation. Lastly in \cref{sec:conc} we provide some concluding remarks, and higher-level motivation for future work. \section{Bayesian formulation of Neural Network training}\label{sec:bayes} This work focuses on a subset of the myriad of modern NN realisations: Fully Connected Feedforward Neural Networks. These networks were some of the earliest proposed forms of Artificial Neural Network, and in spite of the rise of symmetry respecting transformation layers~\cite{DBLP:journals/corr/abs-2104-13478}, these \emph{Dense} (synonymous with fully connected) layers are still abundant in modern architectures. A dense layer is composed of a linear mapping defined by a weight matrix \W taking a vector of $M$ inputs to a vector of $N$ outputs, $\W : \mathbb{R}^M \rightarrow \mathbb{R}^N$. The elements of this matrix, $w_{ij}$, connect every input element $i \in M$ to every output $j \in N$, a constant \emph{bias} vector, $b_j$, is also usually added. If the output nodes of a dense layer are internal (or hidden) a non-linear \emph{activation} function is applied to each node. Repeated action of $K$ (indexed in this work as a superscript $k\in\{1,\dots,K\}$) parameterised linear maps with activation builds up a deep NN, mapping an input data vector, $x_i$, to a prediction vector, $y'_j$. A neural network would be considered deep if it has more than one hidden layer ($K\geq3$), in this work we examine \emph{shallow} networks ($K=2$). In this study we focus on a simple binary classification network, predicting class probabilities of a categorical distribution likelihood. The network is then trained by comparing labelled data, $D=\{x,y\}$, to predictions, $y'$, made with a choice of the free parameters of the network, $\theta=\{w,b\}$. The comparison is realised through the joint likelihood for making a set of observations conditioned on a choice of $\theta$, $\anyg{\like}{D}{\theta}$. The Bayesian approach to training is then using Bayes theorem to invert the conditional arguments of this likelihood, \begin{equation} \anyg{\post}{\theta}{D}=\frac{\anyg{\like}{D}{\theta} \prior(\theta)}{\evidence (D)}\,, \end{equation} where in addition to the likelihood, the following distributions are introduced: the prior over the network parameters \prior, the posterior distribution of the network parameters \post and the evidence for the network \evidence. The evidence is the marginal likelihood, found by marginalising over all the network parameters, \begin{equation} \evidence(D) = \int \anyg{\like}{D}{\theta} \prior(\theta) d\theta\,. \end{equation} A trained network in this sense is the posterior distribution of the weights and biases given an input set of data points. A sample from this posterior distribution can then be used to predict a sample of predicted labels for a new unseen set of data points, referred to as the model prediction through the remainder of this article. The role of the Bayesian evidence in tasks such as model selection is long established~\cite{mackay}, and has been demonstrated as a tool for tuning hyperparameters such as number of nodes in a hidden layer of a network~\cite{bsr,javid2020compromisefree}. What is less clearly established thus far in the literature are the implications of obtaining a convergent evidence calculation in relation to the resulting network posterior distribution. % \WH{Consider adjusting this sentence} \subsection{Priors for Dense Neural Networks}\label{sec:prior} Since the early days of NNs it has been understood that there are many functionally equivalent, or \emph{degenerate}, choices of $\theta$ for any given network architecture~\cite{10.1162/neco.1994.6.3.543}. For any given dense layer outputting $N$ nodes, arguments arising from combinatoric reordering of nodes and potential compensating translations across activation functions, lead to $2^N N!$ degenerate choices of weight for every posterior mode~\cite{bishop}. This presents a strict challenge for any Bayesian method. Either you sample the posterior and find at least $2^N N!$ degenerate modes per layer (noting this is clearly impossible due to the combinatorial growth of this factor and the typical value of $N$ in modern NNs), or you don't and hence have only explored some undefined patch of the total parameter space. In order to isolate and examine this issue in detail we employ a novel definition for the solution space a dense NN occupies. We propose that a constrained form of weight prior can be employed that retains the flexibility of a dense layer, without admitting functionally equivalent solutions. % \TODO{something more about equivariant/geometric layers here} % \WH{We could probably make it clearer about how the posterior is the correct \textbf{weighted} ensembling strategy, unlike random-start or equal weight ensembling} % Almost all\DY{todo survey this a bit better}\PL{You could definitely say that SWA and VI do, and that HMC does not assume it, but is not good for exploring multimodal spaces} proposed Bayesian approaches rely on an assumption that the posterior is unimodal, any demonstration that this approximately holds hinges on a clearly incomplete exploration of the space. In this work we consider a potential solution to remove these degenerate modes but retain the full flexible range of linear mapping afforded by a dense NN layer. We initially restrict our analysis to mappings with $M=N$, \emph{i.e.} only admitting square weight matrices. % \PL{We should probably specify that every hidden layer must also have size M}, A square weight matrix can be trivially written as a sum of symmetric and skew symmetric pieces, \begin{equation} \W = \frac{1}{2} (\W + \W^T) + \frac{1}{2} (\W - \W^T) = \Wsym + \Wskew \,. \end{equation} We can impose a condition that \Wsym is positive definite. This implicitly removes $N!$ permutations by requiring that the weight matrix be pivot free. Further to this requiring positive definiteness imposes that the scaling part of the linear map cannot flip sign, giving consistently oriented scaling transformations. The remaining degrees of freedom in \Wskew can also be constrained, \Wskew forms the Lie algebra of $SO(N)$, with the structure constant being the totally anti-symmetric tensor $\epsilon$. By convention we can take the normal ordering, $\epsilon_{1,\dots,N} = 1$, which in practice can be enforced by appropriate sign choices in \Wskew. For simplicity in this work we use a uniform prior over weights and biases in the range $(-5,5)$, rejecting any samples where \Wsym is not positive definite, and \Wskew is not consistently oriented. It would perhaps be more familiar to choose a Gaussian prior, or equivalently L2 regularisation, to promote sparsity. This is less important as we are not operating in the over-parameterized regime with the examples we study, although implications of non flat priors are briefly discussed alongside our results. Starting from any typical symmetric prior and rejecting to a particular subset of mappings is useful for visualising this proposal on the concise problem in this work, however these constraints can be more efficiently encoded as a prior that has an appropriate support. The $N\times N$ matrix can be effectively split into a lower triangle (including diagonal), $L$, and upper triangle, $U$. \Wsym can then be generated as $\Wsym = L\cdot L^T$ with a prior over $L$ to reflect the desired sparsity promotion (typically employed in the form of regularisation terms for optimized NN approaches). The upper triangle provides the skew piece as, $\Wskew = U -U^T$, with appropriate signed priors over $U$ to respect a choice of structure constant for $SO(N)$. The generalisation of this to cases where $M\neq N$ is not trivial, particularly to construct in the rejection sampling format used in this work. %When $M>N$, similar arguments follow constructing pivot free matrices (where the pivot matrix spans the $N$ dimension) and forcing positive weights along the diagonal spanning $N$ can be motivated. A potential construction of a prior with this support could come by considering the reverse of a singular value decomposition. It is typical for such a decomposition of \W to implicitly perform the required pivoting to return a sorted and positive set of singular values, as well as orthogonal rotations in the $M$ and $N$ planes. To remove spurious degeneracy that emerges due to the flexibility of these rotations a similar construction to the one motivated for appropriate \Wskew in square matrices is likely necessary. Additionally we note that the symmetry respecting equivariant layers being increasingly deployed in modern NN applications are naturally better suited than dense layers to the marginalisation we attempt in this work~\cite{equiv}. It is quite natural to incorporate such layers into the picture we build here. % (with similar appropriate orientation constraints motivated for \Wskew) could be used to generically construct weight space priors suitable for sampling. % Recent work phrasing deep networks as spline operators~\cite{pmlr-v80-balestriero18b} has employed orthogonal regularization conditions on weight matrices to improve performance. In a sampling approach we could apply strict orthogonality requirements on the space of weight matrices, but in practice this would be overly restrictive. % The symmetric piece, \Wsym, can be expressed as a diagonal \emph{scaling} transform in some basis (via the spectral theorem). The skew piece, \Wskew, can be considered as the Lie algebra of $SO(N)$ \emph{rotations}. Starting from any typical even prior (Uniform priors are used in this work), the space of possible choices for an unconstrained \W can be reduced by rejecting configurations that do not meet the following criteria: % \begin{itemize} % \item \Wsym is positive definite \TODO{why based on some argument of scaling} % \item \Wskew only admits rotations in one direction about the skew axis (if $\psi$ is the angle of rotation this matrix generates, we restrict $\psi>0$) \TODO{why} % \end{itemize} \subsection{Tools for Neural Network marginalisation}\label{sec:NSforNN} In \cref{sec:prior} we considered the potentially troublesome sampling problem a dense NN layer poses. In order to sample these spaces we can consider which numerical Bayesian tools will be sufficient for adequately exploring the parameter space. Limited exploration is acceptable if every mode is functionally equivalent, however if there is genuine multimodality to the solution, there is no guarantee that the sampled posterior will be representative of the true posterior unless these degeneracies can be resolved before sampling. Bayesian methods that arise from an optimized point estimate (Variational Inference being a common example of this~\cite{tomczak2020efficient,rudnertractable,2018arXiv181003958W}), by construction cannot reflect any split behaviour. Sampling methods such as Hamiltonian Monte Carlo do give formal guarantees to sample a representation of the full density~\cite{cobb2019semi,cobb2021scaling}, however in practice it is well known that multimodal targets are a common failure mode for Markov Chain approaches. Ensemble approaches (either of Markov chains, or optimized point estimate solutions) are a potential method to extend the previously mentioned tools to work with multimodal targets. Whilst there is promise in these approaches, in general they lack formal probabilistic guarantees to obtain the true posterior~\cite{2020arXiv200110995W}. In practice then, it is often considered sufficient to either implicitly (or explicitly) assume the posterior distribution is unimodal. A correctly weighted ensemble of unimodal methods would adequately define the correct posterior, however the challenge is in how to construct this weighting. In this work we contend that an effective approach can be found by employing Nested Sampling~\cite{Skilling:2006gxv} (NS). NS is a sampling algorithm uniquely designed to target numerical estimation of the evidence integral. It is often bracketed with other Markov Chain sampling algorithms, but its ability to navigate strongly peaked multimodal distributions with little to no tuning makes it uniquely able to perform a full marginalisation of a Neural Network, in a manner necessary for this work. As a full numerical integration algorithm, the dimensionality of space that the sampling can cover is typically limited to \ofOrder{1000} parameters. Whilst this is much smaller than desired for full networks, probabilistic techniques with similar dimensionality limitations, such as Gaussian Processes, have been successfully embedded in scalable frameworks~\cite{DBLP:journals/corr/WilsonHSX15,2017arXiv171100165L}. %\TODO{Might be worth mentioning other partition function/evidence/marginal likelihood tools and why NS over these, e.g.\cite{pmlr-v51-frellsen16}} In this work we use the \PC implementation of NS~\cite{Handley:2015vkr}. The resolution of the algorithm is defined by the number of live points used and is set to $\nlive=1000$, the number of initial prior samples is also boosted to ensure exploration of the space and set to $\nprior=\nlive\times 10$. All other hyperparameters are set to the defaults. The chain files are analysed using the \texttt{anesthetic} package~\cite{anesthetic}, which is used to create the corner plots. The NNs were built on the \texttt{flax}~\cite{flax2020github} NN library within the \texttt{jax} ecosystem~\cite{jax2018github}. \section{A concise example of multimodality}\label{sec:ex} In this section we propose a minimal classification problem, that can be fit with a minimal NN. We choose to examine a modified noisy version of an XOR logic gate problem. Samples forming the training set are drawn from cardinal signal points $(x_1,x_2) =\{(1,1), (-1,-1), (1,-1), (-1,1) \}$, with diagonal points being assigned a class $y=1$ and anti-diagonal points classified as $y=0$. The samples have two sources of noise added to remove any actual symmetry from the problem (with the intention of isolating degenerate symmetries in the weight space): samples are drawn unevenly from each cardinal point with $\{30, 100, 10, 80\}$ draws respectively, and to each sampled point Gaussian noise with a mean of 0 and a variance of 0.5 is added. Arbitrary test datasets can be made either by repeating the sampling with different cardinal populations or alternative noise seeds. A minimal neural network is chosen to fit this composed of a single hidden layer with $N=2$ nodes, passed through sigmoid activations with an added bias vector. Such a set up is typically identified as a logistic regression problem. This introduces a network with 9 free parameters, illustrated in \cref{fig:nn}. This network is trained with a standard binary cross entropy loss function\footnote{~{The binary cross entropy being the log of a categorical likelihood with two classes}}. Both the proposed problem and trained network solution are shown in \cref{fig:target}. % \begin{figure} % \centering % \includegraphics[width=0.6\textwidth]{figures/nn.pdf} % \caption{The minimal Neural Network used in this example.\label{fig:nn}} % \end{figure} \begin{figure} \centering \subfloat[The minimal Neural Network used in this example.\label{fig:nn}]{\includegraphics[width=0.6\textwidth]{figures/nn.pdf}} \\ \subfloat[The target training data, and trained model on the noisy XOR problem.\label{fig:target}]{\includegraphics[width=1.0\textwidth]{figures/samples_map_targ.pdf}} \caption{The chosen network architecture (\cref{fig:nn}) and target function to fit in this challenge (\cref{fig:target}). The left panel of \cref{fig:target} shows the initial cardinal signal points, with noisy uneven sampled data along side this. The right panel of the same Figure retains the same noisy training data overlaid on a prediction from the trained NN across the whole parameter space.} \end{figure} The network predictions shown in \cref{fig:target} are built by taking the mean prediction of the binary classifier $y'$ over 1000 posterior samples of $\theta$. These samples are drawn from the posterior calculated as the result of using Nested Sampling to fully marginalise the network likelihood, as detailed in \cref{sec:NSforNN}. In the model prediction figures throughout this work we display the mean of the vector of predicted $y'$ over a regular grid of input data $(x_1,x_2)$ with spacing 0.1 covering the full set of training points. We can further interrogate the structure of these solutions by examining how the posterior samples of network parameters, $\theta$, are distributed. The corner plots of these posteriors are shown in \cref{fig:posterior}. For simplicity only the parameters of the weight matrix connecting the inputs to the hidden nodes are shown. This figure displays posterior samples for networks trained with two different choices of prior, with an important additional note that both choices of prior give functionally equivalent model predictions (\emph{e.g.} the right panel in \cref{fig:target}). The Nested Sampling algorithm realised in \PC keeps track of the number of posterior modes it finds during its evolution. We define a prominent node as being any mode that has a local evidence estimate greater than a factor of $10^{-3}$ smaller than the mode with the highest local evidence. This definition implies that we only select modes that contribute on average at least 1 sample to the 1000 posterior samples used to infer the figures in this work. By this definition the full prior finds 14 distinct prominent modes, whereas the proposed prior constraint yields only 2 prominent posterior modes. \begin{figure} \centering \subfloat[\label{fig:posta}]{\includegraphics[height=.4\textheight]{figures/full_prior_layer_0.pdf}} \\ \subfloat[\label{fig:postb}]{\includegraphics[height=.4\textheight]{figures/rejected_prior_layer_0.pdf}} \caption{Posterior samples from two choices of prior for a dense layer, where the parameters shown are only the weight matrix parameters connecting the input data to the hidden nodes. A Uniform distribution from (-5,5) forms the base weight prior of both figures, with \cref{fig:postb} rejecting samples not following the recipe given in \cref{sec:prior}. The distinct modes in \cref{fig:postb} have been separated and coloured differently, the sum of these densities gives a functionally equivalent network to the red posterior in \cref{fig:posta}.\label{fig:posterior}} \end{figure} The proposed form of prior for dense layers greatly simplified the composition of the target function, from being a weighted superposition of 14 modes to only 2, with no apparent change in the quality of the predictions. This can be further investigated by now examining the individual modes found against the full solution already presented, this is shown in \cref{fig:diagnostic}. The rows of this figure display three comparison diagnostics, with the columns replicating the diagnostics for each mode as well as the full solution. The top row of panels display the average log likelihood, $\langle \log \like \rangle$, across the training and test dataset. A test dataset has been formed for this purpose by taking ten repeated samples of the same cardinal point proportions as in the training set, but with a different Gaussian noise seed for each set. The violin density estimate is again derived from an ensemble of predictions made using 1000 posterior samples (from the individual modes and the full posterior respectively), with the error bars in black showing the mean $\langle \log \like \rangle$ from these posterior set and extending to the most extreme values in the ensemble. Overlaid across all three panels are the mean values from the full solution. The maximum a posteriori mode (containing as well the maximum likelihood solution in this set), is Mode 1. Whilst this mode gives the best performing (highest average $\langle \log \like \rangle$) solution on the training set it poorly generalises to the test set, under-performing the full solution on average. Mode 2 gives a more robust and generalisable solution, outperforming the full solution on the test set on average however it less successfully captured the training set. It is worth noting that this kind of ``local minima'' in the loss landscape is deliberately suppressed in nearly all NN training schemes. The full solution, as an evidence weighted superposition of the two modes combines the best features of both. The second row of panels display the log evidence of each solution, $\logZ$. The log evidence of each mode effectively defines the full solution, with the ratio of evidences giving the ratio of contribution of each mode to the full solution. The bottom row of panels show the predictions of each solution across the input data space, using the same scale and format as the corresponding panel in \cref{fig:target}. \begin{figure} \centering \includegraphics[width=1.0\textwidth]{figures/samples_train_test.pdf} \caption{Comparison diagnostics for the two posterior modes found when using the constrained prior, as well as the corresponding full solution. The $\langle \log \like \rangle$ and \logZ diagnostics for each solution are colour coded corresponding to the colours used in \cref{fig:posterior}, with the Model Prediction diagnostic corresponding to colouring and scale used in \cref{fig:target}. \label{fig:diagnostic}} \end{figure} \subsection{Discussion} The noisy XOR gate problem demonstrated in this work provides a clear illustration of multimodal behaviour in NN posteriors, present in even the most minimal network. This poses challenges both for future work using the principled marginalisation approach we proposed in this study, and more generally for robust inference with any NN. There are however some marked differences between the stylised challenge presented here and the generally accepted standard practices for NN inference, here we consider how some of these standard practices would impact the patterns demonstrated. A notable feature that only emerges from the full solution as demonstrated in \cref{fig:diagnostic}, is the complexity of the model achieved. It is not possible to represent the decision boundary that the full multimodal solution constructs with any single point estimate network weights given the fixed finite network size. Correctly posterior weighting two candidate models gives an inferred decision boundary that would only be achievable with a point estimate solution for a higher dimensional network. It is not uncommon for people to consider Bayesian inference as simply giving an uncertainty on a prediction, but in this case it is doing something grander, allowing construction of higher dimension solutions by a mixture of models. A larger model with standard regularization terms (or equivalently non-flat priors) could be proposed as a way of constructing a more unimodal solution. Reshaping the prior will influence the evidence calculation, and hence the ratio of local evidences between modes, but there is little motivation to think that this will consistently decide in favour of a particular solution. Furthermore we contend that increasing the dimensionality of the model will likely make the potential for competitive models at scale worse. Reading further into the diagnostic plot, the maximum likelihood solution (Mode 1) is demonstrated to have worse generalisability than the full solution. By allowing a weighted sum of two distinct models to contribute to our full solution, a network that is more robust to noise is constructed. Whilst this in and of itself is a useful trait for any network, there is a secondary fact about principled Bayesian inference on display here, robust performance in low data/high noise environments. Taking the typical argument that Neural Networks are data hungry, this challenge could be refuted by requiring a much larger set of data samples for training, implicitly averaging over more samples of data noise at training time. Whilst this would resolve much of the challenge in this case, this is an ad-hoc requirement for applicability of NNs and furthermore an inadequate defence if we want to make sensitive statements in the massive model and data limits employed in cutting edge large model NN inference~\cite{DBLP:journals/corr/abs-2110-09485}. Lastly, a potential refutation of this challenge is to invoke some kind of ensemble based method as a vehicle to capture a set of candidate alternative models. The correct weighting of the two proposed solutions in our fully marginalised approach is the result of a convergent calculation of the exact volume under the likelihood surface over the prior. Any method that does not guarantee that the ratio of densities is correct will have a vanishingly small chance of getting this ratio correct, the rate of attraction of a gradient boosted path (whether that be via an optimizer or an HMC chain) will not necessarily be proportional to the posterior mass of a mode. % Important features to cover/note: % \begin{itemize} % \item The fitted model is not possible to express with the network architecture; evidence weighting two smaller models can build more complicated dynamics that would typically require a higher dimensional network to express % \item The maximum likelihood solution has worse generalisability, clear and definable simple benchmark for any method to pass. % \item Other sampling approaches can benefit from a better defined prior over the space of equations a dense network represents. Even initialization of optimized networks can be guided by this knowledge. % \end{itemize} \section{Conclusion}\label{sec:conc} In this work we presented a concise example of multimodal posteriors in Neural Networks and the case for capturing them with numerical marginalisation. By reducing the Neural Network to its most minimal form, and choosing a minimal simple learning problem, a paradox for inference using Neural Networks was posed. In this paradox, a learning scheme that focuses on network parameters that maximise the likelihood on the training data is at odds with the aim of generalisability and explainability of the network. It is only by both abandoning point estimate optimisation based approaches, and adopting a sampling scheme that can handle multimodal posterior distributions, that the most robust solution on this problem can be obtained. Two particularly compelling features of the solution a full numerical marginalisation can uniquely obtain were highlighted; the ability to reason between multiple candidate models and reason even when there is limited data with sizeable noise. Both of these are compelling features that show characteristics of what could be called intelligence. The current prevailing wind of Machine Learning research is following the direction of increasingly large models and large datasets. There are clear advantages, and some incredibly compelling results, emerging from this work at large scale. The orthogonal viewpoint we presented in this work offers features that can complement the striking successes of modern Machine Learning at scale. A more principled understanding, offered by a marginal likelihood based approach, of some parameters in a larger model could be a path to bring together principled probabilistic models with inference at massive scale. \begin{ack} DY, WH and ANL were supported by STFC grant ST/T001054/1. DY \& WH were also supported by a Royal Society University Research Fellowship and enhancement award. \texttt{PolyChord} is licensed to \href{https://raw.githubusercontent.com/PolyChord/PolyChordLite/master/LICENCE}{PolyChord Ltd.}, \texttt{PolyChordLite} is free for academic use. This work was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (\url{www.csd3.cam.ac.uk}), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council (\url{www.dirac.ac.uk}). %Use unnumbered first level headings for the acknowledgments. All acknowledgments %go at the end of the paper before the list of references. Moreover, you are required to declare %funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). %More information about this disclosure can be found at: \url{https://neurips.cc/Conferences/2022/PaperInformation/FundingDisclosure}. %Do {\bf not} include this section in the anonymized submission, only in the final paper. You can use the \texttt{ack} environment provided in the style file to autmoatically hide this section in the anonymized submission. \end{ack} % \section*{References} % References follow the acknowledgments. Use unnumbered first-level heading for % the references. Any choice of citation style is acceptable as long as you are % consistent. It is permissible to reduce the font size to \verb+small+ (9 point) % when listing the references. % Note that the Reference section does not count towards the page limit. % \medskip \bibliography{references} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \section*{Checklist} %%% BEGIN INSTRUCTIONS %%% % The checklist follows the references. Please % read the checklist guidelines carefully for information on how to answer these % questions. For each question, change the default \answerTODO{} to \answerYes{}, % \answerNo{}, or \answerNA{}. You are strongly encouraged to include a {\bf % justification to your answer}, either by referencing the appropriate section of % your paper or providing a brief inline description. For example: % \begin{itemize} % \item Did you include the license to the code and datasets? \answerYes{See Section gen\_inst.} % \item Did you include the license to the code and datasets? \answerNo{The code and the data are proprietary.} % \item Did you include the license to the code and datasets? \answerNA{} % \end{itemize} % Please do not modify the questions and only use the provided macros for your % answers. Note that the Checklist section does not count towards the page % limit. In your paper, please delete this instructions block and only keep the % Checklist section heading above along with the questions/answers below. %%% END INSTRUCTIONS %%% % \begin{enumerate} % \item For all authors... % \begin{enumerate} % \item Do the main claims made in the abstract and introduction accurately reflect the paper's contributions and scope? % \answerYes{In \cref{sec:conc} we provided a clear and minimal example of the largely ignored failure mode of probabilitic approaches to NNs, multimodal posteriors} % \item Did you describe the limitations of your work? % \answerYes{We highlighted clearly the dimensional limitations in \cref{sec:NSforNN} as well as highlighting a more popular technique with similar scaling and a stated intention in the conclusion to pursue such last layer methods} % \item Did you discuss any potential negative societal impacts of your work? % \answerNA{} % \item Have you read the ethics review guidelines and ensured that your paper conforms to them? % \answerYes{} % \end{enumerate} % \item If you are including theoretical results... % \begin{enumerate} % \item Did you state the full set of assumptions of all theoretical results? % \answerNA{This work is somewhere between theoretical and practical, the paradox presented and exact practical calculation details (with publicly available software), are all stated} % \item Did you include complete proofs of all theoretical results? % \answerNA{} % \end{enumerate} % \item If you ran experiments... % \begin{enumerate} % \item Did you include the code, data, and instructions needed to reproduce the main experimental results (either in the supplemental material or as a URL)? % \answerYes{The problem is minimal enough that it is considered fully detailed to the point of reproducibility in the text} % \item Did you specify all the training details (e.g., data splits, hyperparameters, how they were chosen)? % \answerYes{} % \item Did you report error bars (e.g., with respect to the random seed after running experiments multiple times)? % \answerYes{} % \item Did you include the total amount of compute and the type of resources used (e.g., type of GPUs, internal cluster, or cloud provider)? % \answerNA{Examples are so small they are run with trivial compute resources} % \end{enumerate} % \item If you are using existing assets (e.g., code, data, models) or curating/releasing new assets... % \begin{enumerate} % \item If your work uses existing assets, did you cite the creators? % \answerYes{} % \item Did you mention the license of the assets? % \answerYes{Explicitly in the acknowledgement, not in the anonymous submission} % \item Did you include any new assets either in the supplemental material or as a URL? % \answerNA{} % \item Did you discuss whether and how consent was obtained from people whose data you're using/curating? % \answerNA{} % \item Did you discuss whether the data you are using/curating contains personally identifiable information or offensive content? % \answerNA{} % \end{enumerate} % \item If you used crowdsourcing or conducted research with human subjects... % \begin{enumerate} % \item Did you include the full text of instructions given to participants and screenshots, if applicable? % \answerNA{} % \item Did you describe any potential participant risks, with links to Institutional Review Board (IRB) approvals, if applicable? % \answerNA{} % \item Did you include the estimated hourly wage paid to participants and the total amount spent on participant compensation? % \answerNA{} % \end{enumerate} % \end{enumerate} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % \appendix % \section{Appendix} % Optionally include extra information (complete proofs, additional experiments and plots) in the appendix. % This section will often be part of the supplemental material. \end{document} ``` 4. **Bibliographic Information:** ```bbl \begin{thebibliography}{10} \bibitem{2021arXiv210604015N} {Nado}, Z., N.~{Band}, M.~{Collier}, et~al. \newblock {Uncertainty Baselines: Benchmarks for Uncertainty \& Robustness in Deep Learning}. \newblock \emph{arXiv e-prints}, arXiv:2106.04015, 2021. \bibitem{maddox2019simple} Maddox, W.~J., P.~Izmailov, T.~Garipov, et~al. \newblock A simple baseline for bayesian uncertainty in deep learning. \newblock \emph{Advances in Neural Information Processing Systems}, 32, 2019. \bibitem{foong2020expressiveness} Foong, A., D.~Burt, Y.~Li, et~al. \newblock On the expressiveness of approximate inference in bayesian neural networks. \newblock \emph{Advances in Neural Information Processing Systems}, 33:15897--15908, 2020. \bibitem{2020arXiv200202405W} {Wenzel}, F., K.~{Roth}, B.~S. {Veeling}, et~al. \newblock {How Good is the Bayes Posterior in Deep Neural Networks Really?} \newblock \emph{arXiv e-prints}, arXiv:2002.02405, 2020. \bibitem{bsr} Higson, E., W.~Handley, M.~Hobson, et~al. \newblock {Bayesian sparse reconstruction: a brute-force approach to astronomical imaging and machine learning}. \newblock \emph{Monthly Notices of the Royal Astronomical Society}, 483(4):4828--4846, 2018. \bibitem{javid2020compromisefree} Javid, K., W.~Handley, M.~Hobson, et~al. \newblock Compromise-free bayesian neural networks, 2020. \bibitem{DBLP:journals/corr/abs-2104-13478} Bronstein, M.~M., J.~Bruna, T.~Cohen, et~al. \newblock Geometric deep learning: Grids, groups, graphs, geodesics, and gauges. \newblock \emph{CoRR}, abs/2104.13478, 2021. \bibitem{mackay} MacKay, D. J.~C. \newblock \emph{Information Theory, Inference \& Learning Algorithms}. \newblock Cambridge University Press, USA, 2002. \bibitem{10.1162/neco.1994.6.3.543} Kůrková, V., P.~C. Kainen. \newblock {Functionally Equivalent Feedforward Neural Networks}. \newblock \emph{Neural Computation}, 6(3):543--558, 1994. \bibitem{bishop} Bishop, C.~M. \newblock \emph{Pattern Recognition and Machine Learning (Information Science and Statistics)}. \newblock Springer-Verlag, Berlin, Heidelberg, 2006. \bibitem{equiv} Lim, L.-H., B.~J. Nelson. \newblock What is an equivariant neural network?, 2022. \bibitem{tomczak2020efficient} Tomczak, M., S.~Swaroop, R.~Turner. \newblock Efficient low rank gaussian variational inference for neural networks. \newblock \emph{Advances in Neural Information Processing Systems}, 33:4610--4622, 2020. \bibitem{rudnertractable} Rudner, T.~G., Z.~Chen, Y.~W. Teh, et~al. \newblock Tractable function-space variational inference in bayesian neural networks. \bibitem{2018arXiv181003958W} {Wu}, A., S.~{Nowozin}, E.~{Meeds}, et~al. \newblock {Deterministic Variational Inference for Robust Bayesian Neural Networks}. \newblock \emph{arXiv e-prints}, arXiv:1810.03958, 2018. \bibitem{cobb2019semi} Cobb, A.~D., A.~G. Baydin, I.~Kiskin, et~al. \newblock Semi-separable hamiltonian monte carlo for inference in bayesian neural networks. \newblock In \emph{Advances in Neural Information Processing Systems Workshop on Bayesian Deep Learning}. 2019. \bibitem{cobb2021scaling} Cobb, A.~D., B.~Jalaian. \newblock Scaling hamiltonian monte carlo inference for bayesian neural networks with symmetric splitting. \newblock In \emph{Uncertainty in Artificial Intelligence}, pages 675--685. PMLR, 2021. \bibitem{2020arXiv200110995W} {Wilson}, A.~G. \newblock {The Case for Bayesian Deep Learning}. \newblock \emph{arXiv e-prints}, arXiv:2001.10995, 2020. \bibitem{Skilling:2006gxv} Skilling, J. \newblock {Nested sampling for general Bayesian computation}. \newblock \emph{Bayesian Analysis}, 1(4):833--859, 2006. \bibitem{DBLP:journals/corr/WilsonHSX15} Wilson, A.~G., Z.~Hu, R.~Salakhutdinov, et~al. \newblock Deep kernel learning. \newblock \emph{CoRR}, abs/1511.02222, 2015. \bibitem{2017arXiv171100165L} {Lee}, J., Y.~{Bahri}, R.~{Novak}, et~al. \newblock {Deep Neural Networks as Gaussian Processes}. \newblock \emph{arXiv e-prints}, arXiv:1711.00165, 2017. \bibitem{Handley:2015vkr} Handley, W.~J., M.~P. Hobson, A.~N. Lasenby. \newblock {polychord: next-generation nested sampling}. \newblock \emph{Mon. Not. Roy. Astron. Soc.}, 453(4):4385--4399, 2015. \bibitem{anesthetic} Handley, W. \newblock anesthetic: nested sampling visualisation. \newblock \emph{The Journal of Open Source Software}, 4(37):1414, 2019. \bibitem{flax2020github} Heek, J., A.~Levskaya, A.~Oliver, et~al. \newblock \emph{{F}lax: A neural network library and ecosystem for {JAX}}, 2020. \bibitem{jax2018github} Bradbury, J., R.~Frostig, P.~Hawkins, et~al. \newblock \emph{{JAX}: composable transformations of {P}ython+{N}um{P}y programs}, 2018. \bibitem{DBLP:journals/corr/abs-2110-09485} Balestriero, R., J.~Pesenti, Y.~LeCun. \newblock Learning in high dimension always amounts to extrapolation. \newblock \emph{CoRR}, abs/2110.09485, 2021. \end{thebibliography} ``` 5. **Author Information:** - Lead Author: {'name': 'David Yallup'} - Full Authors List: ```yaml David Yallup: postdoc: start: 2021-01-10 thesis: null original_image: images/originals/david_yallup.jpg image: /assets/group/images/david_yallup.jpg links: ORCiD: https://orcid.org/0000-0003-4716-5817 linkedin: https://www.linkedin.com/in/dyallup/ Will Handley: pi: start: 2020-10-01 thesis: null postdoc: start: 2016-10-01 end: 2020-10-01 thesis: null phd: start: 2012-10-01 end: 2016-09-30 supervisors: - Anthony Lasenby - Mike Hobson thesis: 'Kinetic initial conditions for inflation: theory, observation and methods' original_image: images/originals/will_handley.jpeg image: /assets/group/images/will_handley.jpg links: Webpage: https://willhandley.co.uk Mike Hobson: coi: start: 2012-10-01 thesis: null image: https://www.phy.cam.ac.uk/wp-content/uploads/2025/04/hobson-150x150.jpg links: Department webpage: https://www.phy.cam.ac.uk/directory/hobsonm Anthony Lasenby: coi: start: 2012-10-01 thesis: null image: https://www.phy.cam.ac.uk/wp-content/uploads/2025/04/lasenby-150x150.jpg links: Department webpage: https://www.phy.cam.ac.uk/directory/lasenbya Pablo Lemos: {} ``` This YAML file provides a concise snapshot of an academic research group. It lists members by name along with their academic roles—ranging from Part III and summer projects to MPhil, PhD, and postdoctoral positions—with corresponding dates, thesis topics, and supervisor details. Supplementary metadata includes image paths and links to personal or departmental webpages. A dedicated "coi" section profiles senior researchers, highlighting the group’s collaborative mentoring network and career trajectories in cosmology, astrophysics, and Bayesian data analysis. ==================================================================================== Final Output Instructions ==================================================================================== - Combine all data sources to create a seamless, engaging narrative. - Follow the exact Markdown output format provided at the top. - Do not include any extra explanation, commentary, or wrapping beyond the specified Markdown. - Validate that every bibliographic reference with a DOI or arXiv identifier is converted into a Markdown link as per the examples. - Validate that every Markdown author link corresponds to a link in the author information block. - Before finalizing, confirm that no LaTeX citation commands or other undesired formatting remain. - Before finalizing, confirm that the link to the paper itself [2205.11151](https://arxiv.org/abs/2205.11151) is featured in the first sentence. Generate only the final Markdown output that meets all these requirements. {% endraw %}