{% raw %} Title: Create a Markdown Blog Post Integrating Research Details and a Featured Paper ==================================================================================== This task involves generating a Markdown file (ready for a GitHub-served Jekyll site) that integrates our research details with a featured research paper. The output must follow the exact format and conventions described below. ==================================================================================== Output Format (Markdown): ------------------------------------------------------------------------------------ --- layout: post title: "Gravitational-wave inference at GPU speed: A bilby-like nested sampling kernel within blackjax-ns" date: 2025-09-04 categories: papers --- ![AI generated image](/assets/images/posts/2025-09-04-2509.04336.png) Metha PrathabanDavid YallupJames AlveyMing YangWill TempletonWill Handley Content generated by [gemini-2.5-pro](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/2025-09-04-2509.04336.txt). Image generated by [imagen-4.0-generate-001](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/2025-09-04-2509.04336.txt). ------------------------------------------------------------------------------------ ==================================================================================== Please adhere strictly to the following instructions: ==================================================================================== Section 1: Content Creation Instructions ==================================================================================== 1. **Generate the Page Body:** - Write a well-composed, engaging narrative that is suitable for a scholarly audience interested in advanced AI and astrophysics. - Ensure the narrative is original and reflective of the tone and style and content in the "Homepage Content" block (provided below), but do not reuse its content. - Use bullet points, subheadings, or other formatting to enhance readability. 2. **Highlight Key Research Details:** - Emphasize the contributions and impact of the paper, focusing on its methodology, significance, and context within current research. - Specifically highlight the lead author ({'name': 'Metha Prathaban'}). When referencing any author, use Markdown links from the Author Information block (choose academic or GitHub links over social media). 3. **Integrate Data from Multiple Sources:** - Seamlessly weave information from the following: - **Paper Metadata (YAML):** Essential details including the title and authors. - **Paper Source (TeX):** Technical content from the paper. - **Bibliographic Information (bbl):** Extract bibliographic references. - **Author Information (YAML):** Profile details for constructing Markdown links. - Merge insights from the Paper Metadata, TeX source, Bibliographic Information, and Author Information blocks into a coherent narrative—do not treat these as separate or isolated pieces. - Insert the generated narrative between the HTML comments: and 4. **Generate Bibliographic References:** - Review the Bibliographic Information block carefully. - For each reference that includes a DOI or arXiv identifier: - For DOIs, generate a link formatted as: [10.1234/xyz](https://doi.org/10.1234/xyz) - For arXiv entries, generate a link formatted as: [2103.12345](https://arxiv.org/abs/2103.12345) - **Important:** Do not use any LaTeX citation commands (e.g., `\cite{...}`). Every reference must be rendered directly as a Markdown link. For example, instead of `\cite{mycitation}`, output `[mycitation](https://doi.org/mycitation)` - **Incorrect:** `\cite{10.1234/xyz}` - **Correct:** `[10.1234/xyz](https://doi.org/10.1234/xyz)` - Ensure that at least three (3) of the most relevant references are naturally integrated into the narrative. - Ensure that the link to the Featured paper [2509.04336](https://arxiv.org/abs/2509.04336) is included in the first sentence. 5. **Final Formatting Requirements:** - The output must be plain Markdown; do not wrap it in Markdown code fences. - Preserve the YAML front matter exactly as provided. ==================================================================================== Section 2: Provided Data for Integration ==================================================================================== 1. **Homepage Content (Tone and Style Reference):** ```markdown --- layout: home --- ![AI generated image](/assets/images/index.png) The Handley Research Group stands at the forefront of cosmological exploration, pioneering novel approaches that fuse fundamental physics with the transformative power of artificial intelligence. We are a dynamic team of researchers, including PhD students, postdoctoral fellows, and project students, based at the University of Cambridge. Our mission is to unravel the mysteries of the Universe, from its earliest moments to its present-day structure and ultimate fate. We tackle fundamental questions in cosmology and astrophysics, with a particular focus on leveraging advanced Bayesian statistical methods and AI to push the frontiers of scientific discovery. Our research spans a wide array of topics, including the [primordial Universe](https://arxiv.org/abs/1907.08524), [inflation](https://arxiv.org/abs/1807.06211), the nature of [dark energy](https://arxiv.org/abs/2503.08658) and [dark matter](https://arxiv.org/abs/2405.17548), [21-cm cosmology](https://arxiv.org/abs/2210.07409), the [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1807.06209), and [gravitational wave astrophysics](https://arxiv.org/abs/2411.17663). ### Our Research Approach: Innovation at the Intersection of Physics and AI At The Handley Research Group, we develop and apply cutting-edge computational techniques to analyze complex astronomical datasets. Our work is characterized by a deep commitment to principled [Bayesian inference](https://arxiv.org/abs/2205.15570) and the innovative application of [artificial intelligence (AI) and machine learning (ML)](https://arxiv.org/abs/2504.10230). **Key Research Themes:** * **Cosmology:** We investigate the early Universe, including [quantum initial conditions for inflation](https://arxiv.org/abs/2002.07042) and the generation of [primordial power spectra](https://arxiv.org/abs/2112.07547). We explore the enigmatic nature of [dark energy, using methods like non-parametric reconstructions](https://arxiv.org/abs/2503.08658), and search for new insights into [dark matter](https://arxiv.org/abs/2405.17548). A significant portion of our efforts is dedicated to [21-cm cosmology](https://arxiv.org/abs/2104.04336), aiming to detect faint signals from the Cosmic Dawn and the Epoch of Reionization. * **Gravitational Wave Astrophysics:** We develop methods for [analyzing gravitational wave signals](https://arxiv.org/abs/2411.17663), extracting information about extreme astrophysical events and fundamental physics. * **Bayesian Methods & AI for Physical Sciences:** A core component of our research is the development of novel statistical and AI-driven methodologies. This includes advancing [nested sampling techniques](https://arxiv.org/abs/1506.00171) (e.g., [PolyChord](https://arxiv.org/abs/1506.00171), [dynamic nested sampling](https://arxiv.org/abs/1704.03459), and [accelerated nested sampling with $\beta$-flows](https://arxiv.org/abs/2411.17663)), creating powerful [simulation-based inference (SBI) frameworks](https://arxiv.org/abs/2504.10230), and employing [machine learning for tasks such as radiometer calibration](https://arxiv.org/abs/2504.16791), [cosmological emulation](https://arxiv.org/abs/2503.13263), and [mitigating radio frequency interference](https://arxiv.org/abs/2211.15448). We also explore the potential of [foundation models for scientific discovery](https://arxiv.org/abs/2401.00096). **Technical Contributions:** Our group has a strong track record of developing widely-used scientific software. Notable examples include: * [**PolyChord**](https://arxiv.org/abs/1506.00171): A next-generation nested sampling algorithm for Bayesian computation. * [**anesthetic**](https://arxiv.org/abs/1905.04768): A Python package for processing and visualizing nested sampling runs. * [**GLOBALEMU**](https://arxiv.org/abs/2104.04336): An emulator for the sky-averaged 21-cm signal. * [**maxsmooth**](https://arxiv.org/abs/2007.14970): A tool for rapid maximally smooth function fitting. * [**margarine**](https://arxiv.org/abs/2205.12841): For marginal Bayesian statistics using normalizing flows and KDEs. * [**fgivenx**](https://arxiv.org/abs/1908.01711): A package for functional posterior plotting. * [**nestcheck**](https://arxiv.org/abs/1804.06406): Diagnostic tests for nested sampling calculations. ### Impact and Discoveries Our research has led to significant advancements in cosmological data analysis and yielded new insights into the Universe. Key achievements include: * Pioneering the development and application of advanced Bayesian inference tools, such as [PolyChord](https://arxiv.org/abs/1506.00171), which has become a cornerstone for cosmological parameter estimation and model comparison globally. * Making significant contributions to the analysis of major cosmological datasets, including the [Planck mission](https://arxiv.org/abs/1807.06209), providing some of the tightest constraints on cosmological parameters and models of [inflation](https://arxiv.org/abs/1807.06211). * Developing novel AI-driven approaches for astrophysical challenges, such as using [machine learning for radiometer calibration in 21-cm experiments](https://arxiv.org/abs/2504.16791) and [simulation-based inference for extracting cosmological information from galaxy clusters](https://arxiv.org/abs/2504.10230). * Probing the nature of dark energy through innovative [non-parametric reconstructions of its equation of state](https://arxiv.org/abs/2503.08658) from combined datasets. * Advancing our understanding of the early Universe through detailed studies of [21-cm signals from the Cosmic Dawn and Epoch of Reionization](https://arxiv.org/abs/2301.03298), including the development of sophisticated foreground modelling techniques and emulators like [GLOBALEMU](https://arxiv.org/abs/2104.04336). * Developing new statistical methods for quantifying tensions between cosmological datasets ([Quantifying tensions in cosmological parameters: Interpreting the DES evidence ratio](https://arxiv.org/abs/1902.04029)) and for robust Bayesian model selection ([Bayesian model selection without evidences: application to the dark energy equation-of-state](https://arxiv.org/abs/1506.09024)). * Exploring fundamental physics questions such as potential [parity violation in the Large-Scale Structure using machine learning](https://arxiv.org/abs/2410.16030). ### Charting the Future: AI-Powered Cosmological Discovery The Handley Research Group is poised to lead a new era of cosmological analysis, driven by the explosive growth in data from next-generation observatories and transformative advances in artificial intelligence. Our future ambitions are centred on harnessing these capabilities to address the most pressing questions in fundamental physics. **Strategic Research Pillars:** * **Next-Generation Simulation-Based Inference (SBI):** We are developing advanced SBI frameworks to move beyond traditional likelihood-based analyses. This involves creating sophisticated codes for simulating [Cosmic Microwave Background (CMB)](https://arxiv.org/abs/1908.00906) and [Baryon Acoustic Oscillation (BAO)](https://arxiv.org/abs/1607.00270) datasets from surveys like DESI and 4MOST, incorporating realistic astrophysical effects and systematic uncertainties. Our AI initiatives in this area focus on developing and implementing cutting-edge SBI algorithms, particularly [neural ratio estimation (NRE) methods](https://arxiv.org/abs/2407.15478), to enable robust and scalable inference from these complex simulations. * **Probing Fundamental Physics:** Our enhanced analytical toolkit will be deployed to test the standard cosmological model ($\Lambda$CDM) with unprecedented precision and to explore [extensions to Einstein's General Relativity](https://arxiv.org/abs/2006.03581). We aim to constrain a wide range of theoretical models, from modified gravity to the nature of [dark matter](https://arxiv.org/abs/2106.02056) and [dark energy](https://arxiv.org/abs/1701.08165). This includes leveraging data from upcoming [gravitational wave observatories](https://arxiv.org/abs/1803.10210) like LISA, alongside CMB and large-scale structure surveys from facilities such as Euclid and JWST. * **Synergies with Particle Physics:** We will continue to strengthen the connection between cosmology and particle physics by expanding the [GAMBIT framework](https://arxiv.org/abs/2009.03286) to interface with our new SBI tools. This will facilitate joint analyses of cosmological and particle physics data, providing a holistic approach to understanding the Universe's fundamental constituents. * **AI-Driven Theoretical Exploration:** We are pioneering the use of AI, including [large language models and symbolic computation](https://arxiv.org/abs/2401.00096), to automate and accelerate the process of theoretical model building and testing. This innovative approach will allow us to explore a broader landscape of physical theories and derive new constraints from diverse astrophysical datasets, such as those from GAIA. Our overarching goal is to remain at the forefront of scientific discovery by integrating the latest AI advancements into every stage of our research, from theoretical modeling to data analysis and interpretation. We are excited by the prospect of using these powerful new tools to unlock the secrets of the cosmos. Content generated by [gemini-2.5-pro-preview-05-06](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/content/index.txt). Image generated by [imagen-3.0-generate-002](https://deepmind.google/technologies/gemini/) using [this prompt](/prompts/images/index.txt). ``` 2. **Paper Metadata:** ```yaml !!python/object/new:feedparser.util.FeedParserDict dictitems: id: http://arxiv.org/abs/2509.04336v1 guidislink: true link: https://arxiv.org/abs/2509.04336v1 title: 'Gravitational-wave inference at GPU speed: A bilby-like nested sampling kernel within blackjax-ns' title_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: type: text/plain language: null base: '' value: 'Gravitational-wave inference at GPU speed: A bilby-like nested sampling kernel within blackjax-ns' updated: '2025-09-04T15:55:42Z' updated_parsed: !!python/object/apply:time.struct_time - !!python/tuple - 2025 - 9 - 4 - 15 - 55 - 42 - 3 - 247 - 0 - tm_zone: null tm_gmtoff: null links: - !!python/object/new:feedparser.util.FeedParserDict dictitems: href: https://arxiv.org/abs/2509.04336v1 rel: alternate type: text/html - !!python/object/new:feedparser.util.FeedParserDict dictitems: href: https://arxiv.org/pdf/2509.04336v1 rel: related type: application/pdf title: pdf summary: We present a GPU-accelerated implementation of the gravitational-wave Bayesian inference pipeline for parameter estimation and model comparison. Specifically, we implement the `acceptance-walk' sampling method, a cornerstone algorithm for gravitational-wave inference within the bilby and dynesty framework. By integrating this trusted kernel with the vectorized blackjax-ns framework, we achieve typical speedups of 20-40x for aligned spin binary black hole analyses, while recovering posteriors and evidences that are statistically identical to the original CPU implementation. This faithful re-implementation of a community-standard algorithm establishes a foundational benchmark for gravitational-wave inference. It quantifies the performance gains attributable solely to the architectural shift to GPUs, creating a vital reference against which future parallel sampling algorithms can be rigorously assessed. This allows for a clear distinction between algorithmic innovation and the inherent speedup from hardware. Our work provides a validated community tool for performing GPU-accelerated nested sampling in gravitational-wave data analyses. summary_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: type: text/plain language: null base: '' value: We present a GPU-accelerated implementation of the gravitational-wave Bayesian inference pipeline for parameter estimation and model comparison. Specifically, we implement the `acceptance-walk' sampling method, a cornerstone algorithm for gravitational-wave inference within the bilby and dynesty framework. By integrating this trusted kernel with the vectorized blackjax-ns framework, we achieve typical speedups of 20-40x for aligned spin binary black hole analyses, while recovering posteriors and evidences that are statistically identical to the original CPU implementation. This faithful re-implementation of a community-standard algorithm establishes a foundational benchmark for gravitational-wave inference. It quantifies the performance gains attributable solely to the architectural shift to GPUs, creating a vital reference against which future parallel sampling algorithms can be rigorously assessed. This allows for a clear distinction between algorithmic innovation and the inherent speedup from hardware. Our work provides a validated community tool for performing GPU-accelerated nested sampling in gravitational-wave data analyses. tags: - !!python/object/new:feedparser.util.FeedParserDict dictitems: term: gr-qc scheme: http://arxiv.org/schemas/atom label: null - !!python/object/new:feedparser.util.FeedParserDict dictitems: term: astro-ph.HE scheme: http://arxiv.org/schemas/atom label: null - !!python/object/new:feedparser.util.FeedParserDict dictitems: term: astro-ph.IM scheme: http://arxiv.org/schemas/atom label: null published: '2025-09-04T15:55:42Z' published_parsed: !!python/object/apply:time.struct_time - !!python/tuple - 2025 - 9 - 4 - 15 - 55 - 42 - 3 - 247 - 0 - tm_zone: null tm_gmtoff: null arxiv_comment: '13 total pages, 15 total figures (main text: 11 pages, 12 figures; appendix: 2 pages, 3 figures)' arxiv_primary_category: term: gr-qc authors: - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Metha Prathaban - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: David Yallup - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: James Alvey - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Ming Yang - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Will Templeton - !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Will Handley author_detail: !!python/object/new:feedparser.util.FeedParserDict dictitems: name: Will Handley author: Will Handley ``` 3. **Paper Source (TeX):** ```tex % Basic setup. Most papers should leave these options alone. \documentclass[fleqn,usenatbib]{mnras} % MNRAS is set in Times font. If you don't have this installed (most LaTeX % installations will be fine) or prefer the old Computer Modern fonts, comment % out the following line \usepackage{newtxtext,newtxmath} % Depending on your LaTeX fonts installation, you might get better results with one of these: %\usepackage{mathptmx} %\usepackage{txfonts} % Use vector fonts, so it zooms properly in on-screen viewing software % Don't change these lines unless you know what you are doing \usepackage[T1]{fontenc} % Allow "Thomas van Noord" and "Simon de Laguarde" and alike to be sorted by "N" and "L" etc. in the bibliography. % Write the name in the bibliography as "\VAN{Noord}{Van}{van} Noord, Thomas" \DeclareRobustCommand{\VAN}[3]{#2} \let\VANthebibliography\thebibliography \def\thebibliography{\DeclareRobustCommand{\VAN}[3]{##3}\VANthebibliography} %%%%% AUTHORS - PLACE YOUR OWN PACKAGES HERE %%%%% % Only include extra packages if you really need them. Avoid using amssymb if newtxmath is enabled, as these packages can cause conflicts. newtxmatch covers the same math symbols while producing a consistent Times New Roman font. Common packages are: \usepackage{graphicx} % Including figure files \usepackage{amsmath} % Advanced maths commands %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%% AUTHORS - PLACE YOUR OWN COMMANDS HERE %%%%% % Please keep new commands to a minimum, and use \newcommand not \def to avoid % overwriting existing commands. Example: %\newcommand{\pcm}{\,cm$^{-2}$} % per cm-squared %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%%%% TITLE PAGE %%%%%%%%%%%%%%%%%%% % Title of the paper, and the short title which is used in the headers. % Keep the title short and informative. \title[GPU-accelerated nested sampling for gravitational waves]{Gravitational-wave inference at GPU speed: A \texttt{bilby}-like nested sampling kernel within \texttt{blackjax-ns}} % The list of authors, and the short list which is used in the headers. % If you need two or more lines of authors, add an extra line using \newauthor \author[Metha Prathaban et al.]{ Metha Prathaban,$^{2,3}$\thanks{E-mail: myp23@cam.ac.uk (MP)} David Yallup,$^{1,2}$ James Alvey,$^{1,2}$ Ming Yang,$^{1}$ Will Templeton,$^{1}$ and Will Handley$^{1,2}$ \\ % List of institutions $^{1}$Institute of Astronomy, University of Cambridge, Cambridge, CB3 0HA, UK\\ $^{2}$Kavli Institute for Cosmology, University of Cambridge, Cambridge, CB3 0EZ, UK\\ $^{3}$Department of Physics, University of Cambridge, Cambridge, CB3 0HE, UK\\ } % These dates will be filled out by the publisher \date{Accepted XXX. Received YYY; in original form ZZZ} % Prints the current year, for the copyright statements etc. To achieve a fixed year, replace the expression with a number. \pubyear{\the\year{}} % Don't change these lines \begin{document} \label{firstpage} \pagerange{\pageref{firstpage}--\pageref{lastpage}} \maketitle % Abstract of the paper \begin{abstract} We present a GPU-accelerated implementation of the gravitational-wave Bayesian inference pipeline for parameter estimation and model comparison. Specifically, we implement the `acceptance-walk' sampling method, a cornerstone algorithm for gravitational-wave inference within the \texttt{bilby} and \texttt{dynesty} framework. By integrating this trusted kernel with the vectorized \texttt{blackjax-ns} framework, we achieve typical speedups of 20-40x for aligned spin binary black hole analyses, while recovering posteriors and evidences that are statistically identical to the original CPU implementation. This faithful re-implementation of a community-standard algorithm establishes a foundational benchmark for gravitational-wave inference. It quantifies the performance gains attributable solely to the architectural shift to GPUs, creating a vital reference against which future parallel sampling algorithms can be rigorously assessed. This allows for a clear distinction between algorithmic innovation and the inherent speedup from hardware. Our work provides a validated community tool for performing GPU-accelerated nested sampling in gravitational-wave data analyses. \vspace{0.5cm} \end{abstract} % Select between one and six entries from the list of approved keywords. % Don't make up new ones. % \begin{keywords} % keyword1 -- keyword2 -- keyword3 % \end{keywords} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%% BODY OF PAPER %%%%%%%%%%%%%%%%%% \section{Introduction} The era of gravitational-wave (GW) astronomy, initiated by the landmark detections at the Laser Interferometer Gravitational-Wave Observatory (LIGO), Virgo and now KAGRA, has revolutionized our view of the cosmos~\citep{GW150914, GW170817,GWTC1,GWTC2, GWTC3,GWTC3_pop_analysis,GWTC2_GR,siren}. Extracting scientific insights from the data, from measuring the masses and spins of binary black holes to performing precision tests of general relativity, relies heavily on the framework of Bayesian inference~\citep{Thrane_2019}. This allows for the estimation of posteriors on source parameters (parameter estimation) and the comparison of competing physical models (model selection). The process of Bayesian inference in GW astronomy is, however, computationally demanding. Realistic theoretical models for the GW waveform can be complex, and the stochastic sampling algorithms required to explore the high-dimensional parameter space can require millions of likelihood evaluations per analysis~\citep{LIGO_guide_signalextraction}. There are several community-standard software tools for performing GW inference, such as \texttt{pycbc}~\citep{pycbc} and \texttt{lal}~\citep{lal}. Here, we focus on the inference library \texttt{bilby}~\citep{bilby_paper} paired with a custom implementation of the nested sampler \texttt{dynesty}~\citep{dynesty}, which has proven to be a robust and highly effective framework, tuned for the specific needs of GW posteriors. However, this framework is predominantly executed on central processing units (CPUs), making individual analyses costly and creating a computational bottleneck. This challenge is set to become acute with the increased data volumes from future observing runs~\citep{aLIGO, aVirgo, aLVK_prospects} and the advent of next-generation observatories, such as the Einstein Telescope~\citep{ET_science_case}, which promise unprecedented sensitivity and detection volumes~\citep{HuAccelerationReview}. In response to this challenge, the GW community has begun to leverage the immense parallel processing power of graphics processing units (GPUs). Pioneering work in this domain, such as the \texttt{jimgw} codebase~\citep{wong2023fastgravitationalwaveparameter}, has successfully implemented the GPU-accelerated Markov Chain Monte Carlo (MCMC) sampler \texttt{flowMC}~\citep{flowMC}, paired with GPU implementations of waveform models provided by the \texttt{ripple} library~\citep{ripple}. This work has demonstrated that substantial, orders-of-magnitude speedups in runtime are achievable for GW parameter estimation. This success in accelerating parameter estimation motivates the development of complementary GPU-native algorithms to accelerate other key inference tasks, such as model selection. In this paper, we introduce a GPU-accelerated nested sampling (NS) algorithm for gravitational-wave data analysis. Our method builds upon the trusted `acceptance-walk' sampling method used in the community-standard \texttt{bilby} and \texttt{dynesty} framework~\citep{bilby_paper}. We leverage the \texttt{blackjax}~\citep{jax2018github,cabezas2024blackjax} implementation of nested sampling~\citep{yallup2025nested}, a reformulation of the core nested sampling algorithm for massive GPU parallelization. In place of the default method for particle evolution recommended in~\cite{yallup2025nested}, we develop a custom kernel that implements the `acceptance-walk' method. This ensures our sampler's logic is almost identical to the \texttt{bilby} and \texttt{dynesty} implementation, with differences primarily related to parallelization. This approach offers \texttt{bilby} users a direct path to GPU-accelerated inference for the most expensive problems in GW astronomy. They can achieve significant speedups while retaining the same robust and trusted algorithm at the core of their analyses. A key motivation for this work is to establish a clear performance baseline for GPU-based nested sampling in gravitational-wave astronomy. By adapting the community-standard \texttt{bilby} `acceptance-walk' sampler for a GPU-native framework, we aim to isolate and quantify the speedup achieved from hardware parallelization alone. This provides a crucial reference point, enabling future work on novel sampling algorithms to be benchmarked in a way that disentangles algorithmic improvements from architectural performance gains. In the following section, we summarise the core ideas of Bayesian inference, nested sampling and GPU architectures. We then describe the implementation of the `acceptance-walk' sampling method in the \texttt{blackjax-ns} framework in Section~\ref{sec:methods}, and validate it against the \texttt{bilby} and \texttt{dynesty} implementation in Section~\ref{sec:results}, discussing our results. Finally, we present our conclusions in Section~\ref{sec:conclusions}. \begin{figure*} \centering \includegraphics{figures/bilby_blackjax_comparison_4s.pdf} \caption{Recovered posteriors for the 4s signal. The posteriors are in excellent agreement with each other, demonstrating that the \texttt{blackjax-ns} implementation with our custom kernel is functionally equivalent to the \texttt{bilby} + \texttt{dynesty} implementation.} \label{fig:4s_posteriors} \end{figure*} \section{Background} \label{sec:background} \subsection{Bayesian inference in GW astronomy} \label{sec:background_bayes} We provide a brief overview of the core concepts of Bayesian inference as applied to GW astronomy. For a more comprehensive treatment, we refer the reader to~\cite{skilling, Thrane_2019, lal, bilby_paper} and~\cite{LIGO_guide_signalextraction}. The analysis of GW signals is fundamentally a problem of statistical inference, for which the Bayesian framework is the community standard. The relationship between data, $d$, and a set of parameters, $\theta$, under a specific hypothesis, $H$, is given by Bayes' theorem~\citep{Bayes1763}: \begin{equation} \mathcal{P}(\theta|d, H) = \frac{\mathcal{L}(d|\theta, H) \pi(\theta|H)}{Z(d|H)}. \label{eq:bayes_theorem} \end{equation}% Here, the posterior, $\mathcal{P}(\theta|d, H)$, is the probability distribution of the source parameters conditioned on the observed data. It is determined by the likelihood, $\mathcal{L}(d|\theta, H)$, which is the probability of observing the data given a specific realisation of the model, and the prior, $\pi(\theta|H)$, which encodes initial beliefs about the parameter distributions. The denominator is the Bayesian evidence, \begin{equation} Z(d|H) = \int \mathcal{L}(d|\theta, H) \pi(\theta|H) d\theta, \label{eq:evidence} \end{equation} defined as the likelihood integrated over the entire volume of the prior parameter space. There are two pillars of Bayesian inference of particular interest in GW astronomy. The first, parameter estimation, seeks to infer the posterior distribution $\mathcal{P}(\theta|d, H)$ of the source parameters of a signal or population of signals. The second, model selection, evaluates two competing models, $H_1$ and $H_2$, under a fully Bayesian framework by computing the ratio of their evidences, known as the Bayes factor, $Z_1 / Z_2$. This enables principled classification of noise versus true signals, as well as the comparison of different waveform models~\citep{LIGO_guide_signalextraction}. In GW astronomy, the high dimensionality of the parameter space and the computational cost of the likelihood render the direct evaluation of Eq.~\ref{eq:bayes_theorem} and Eq.~\ref{eq:evidence} intractable~\citep{LIGO_guide_signalextraction}. Analysis therefore relies on stochastic sampling algorithms to numerically approximate the posterior and evidence. \subsection{GPU-accelerated nested sampling} \label{sec:background_ns_and_gpus} \subsubsection{The nested sampling algorithm} \label{sec:background_ns} Nested Sampling is a Monte Carlo algorithm designed to solve the Bayesian inference problem outlined in Sec.~\ref{sec:background_bayes}. A key strength of the NS algorithm is that it directly computes the Bayesian evidence, $Z$, while also producing posterior samples as a natural by-product of its execution~\citep{skilling, dynamic_ns}. The algorithm starts by drawing a population of $N$ `live points' from the prior distribution, $\pi(\theta|H)$. It then proceeds to iteratively evolve this population. In each iteration, the live point with the lowest likelihood value, $\mathcal{L}_{\text{min}}$, is identified. This point is deleted from the live set and stored. It is then replaced with a new point, drawn from the prior, but subject to the hard constraint that its likelihood must exceed that of the deleted point, i.e., $\mathcal{L}_{\text{new}} > \mathcal{L}_{\text{min}}$. This process systematically traverses nested shells of increasing likelihood, with the sequence of discarded points mapping the likelihood landscape. The primary computational challenge within the NS algorithm is the efficient generation of a new point from the likelihood-constrained prior~\citep{NSNature}. The specific method used for this so-called `inner kernel' is a critical determinant of the sampler's overall efficiency and robustness~\citep{NS_methods_buchner}. \subsubsection{GPU architectures for scientific computing} \label{sec:background_gpus} The distinct architectures of Central Processing Units (CPUs) and Graphics Processing Units (GPUs) offer different advantages for computational tasks. CPUs are comprised of a few powerful cores optimised for sequential task execution and low latency. In contrast, GPUs feature a massively parallel architecture, containing thousands of simpler cores designed for high-throughput computation~\citep{GPU_computing}. This architecture makes GPUs exceptionally effective for problems that can be expressed in a Single Instruction, Multiple Data (SIMD) paradigm~\citep{CUDAGuide}. In such problems, the same operation is performed simultaneously across a large number of data elements, leading to substantial performance gains over serial execution on a CPU. The primary trade-off is that algorithms must be explicitly reformulated to expose this parallelism, and not all computational problems are amenable to vectorization. There is algorithmic overlap with vectorisation and multi-threading on CPUs~\citep{Handley:2015vkr, Smith:2019ucc}; nonetheless, the GPU execution model is uniquely suited to truly massive parallelism. We seek to exploit this capability in the present work. \subsubsection{A vectorized formulation of nested sampling} \label{sec:background_vectorized_ns} The iterative, one-at-a-time nature of the traditional NS algorithm is intrinsically serial, making it a poor fit for the parallel architecture of GPUs. To overcome this limitation,~\cite{yallup2025nested} recently developed a vectorized formulation of the NS algorithm, specifically designed for highly parallel execution within the \texttt{blackjax} framework~\citep{cabezas2024blackjax}. One of the core parts of this approach is the idea of batch processing. Instead of replacing a single live point in each iteration, the algorithm removes a batch of $k$ points with the lowest likelihoods simultaneously~\citep{parallel_ns}. The critical step of replacing these points is then parallelized. The algorithm launches $k$ independent sampling processes on the GPU, with each process tasked with finding one new point that satisfies the likelihood constraint, $\mathcal{L} > \mathcal{L}_{\text{min}}$, where $\mathcal{L}_{\text{min}}$ is now the maximum likelihood of the discarded batch. This reformulation transforms the computationally intensive task of sample generation from a serial challenge into a massively parallel one, thereby leveraging the architectural strengths of the GPU. Design and implementation of massively parallel Markov Chain Monte Carlo (MCMC) kernels is a non-trivial task, and an area of active research~\citep{pmlr-v151-hoffman22a,pmlr-v130-hoffman21a}. The original work of \cite{yallup2025nested} proposed a Hit-and-Run Slice Sampling~\citep{Neal2003_slice} kernel for this task, alongside providing a general vectorized framework which we leverage in this work. Here, we implement one particular inner kernel used within the \texttt{bilby} and \texttt{dynesty} framework. Establishing this baseline demonstrates the flexibility of the \texttt{blackjax} nested sampling framework, and gives scope for future work to explore other parallel inner sampling methods. \vspace*{-20pt} \section{Methods} \label{sec:methods} \subsection{The inner sampling kernel} \label{sec:methods_kernel} \begin{figure} \centering \includegraphics{figures/nlive_comparison.pdf} \caption{Comparison of the number of live points in the sequential CPU and batched GPU implementations. Although the nominal number of live points used in \texttt{blackjax-ns} is higher than in \texttt{bilby}, the saw-tooth pattern means that the effective number of live points is the same. Here, we use a batch size of $k = 0.5 \times n_{\text{live}}$.} \label{fig:nlive_comparison} \end{figure} Several inner sampling methods are implemented within the \texttt{bilby} and \texttt{dynesty} framework~\citep{bilby_paper, dynesty}. In this work, we focus on a GPU-accelerated implementation of the `acceptance-walk' method, which is a robust and widely used choice for GW analyses. In the standard CPU-based \texttt{dynesty} implementation, the sampler generates a new live point by initiating a Markov Chain Monte Carlo (MCMC) walk from the position of the deleted live point. The proposal mechanism for the MCMC walk is based on Differential Evolution (DE)~\citep{DE, DE2}, which uses the distribution of existing live points to inform jump proposals. A new candidate point is generated by adding a scaled vector difference of two other randomly chosen live points to the current point in the chain. Under the default \texttt{bilby} configuration, the scaling factor for this vector is chosen stochastically: with equal probability, it is either fixed at 1 or drawn from a scaled gamma distribution\footnote{ $\text{Drawn from} \frac{2.38}{\sqrt{2 n_{\textrm{dim}}}} \times \Gamma(4,0.25)$, where $n_{\textrm{dim}}$ is the parameter space dimensionality.}. This proposed point is accepted if it satisfies the likelihood constraint, $\mathcal{L} > \mathcal{L}_{\text{min}}$, where $\mathcal{L}_{\text{min}}$ is the likelihood of the discarded point being replaced. The sampling process is performed in the unit hypercube space, with prior transformations applied to evaluate the likelihood of proposed points in the physical parameter space~\citep{dynesty}. The walk length is adaptive on a per-iteration basis; the algorithm adjusts the number of MCMC steps dynamically to target a pre-defined number of accepted steps (e.g., 60) for each new live point generated, up to a maximum limit. This per-iteration adaptive strategy, however, is ill-suited for GPU architectures. The variable walk length required for each parallel sampler would lead to significant thread divergence, where different cores on the GPU finish their tasks at different times, undermining the efficiency of the SIMD execution model. To leverage GPU acceleration effectively, the computational workload must be as uniform as possible across all parallel processes. Our implementation therefore preserves the core DE proposal mechanism but modifies the walk-length adaptation to be compatible with a vectorized framework. Within the \texttt{blackjax-ns} sampler, the number of MCMC steps is fixed for all parallel processes within a single batch of live point updates. The adaptive tuning is then performed at the batch level. After a batch of $k$ new points has been generated, we compute the mean acceptance rate across all $k$ walks. The walk length for the subsequent batch is then adjusted based on this average rate, using the same logic as \texttt{bilby} to target a desired number of accepted proposals. Adaptive tuning of walk lengths, or equivalent mechanisms~\citep{dau2021wastefreesequentialmontecarlo}, particularly in light of the GPU-batched approach, is a key avenue for future work to reduce overall runtime. While this batch-adaptive approach is essential for efficient GPU vectorization, it introduces some important differences, discussed further in Section~\ref{sec:sampler_config}. Despite this architectural modification, our kernel is designed to be functionally analogous to the trusted \texttt{bilby} sampler. This design represents the most direct translation of the \texttt{bilby} logic to GPUs, making only the minimal changes required for efficient execution. \subsection{Sampler configuration and settings} \label{sec:sampler_config} Several key hyperparameters govern the behaviour of the sampler and its inner kernel. The number of live points, \mbox{$n_{\text{live}}$}, controls the resolution of the run, where a higher value results in more accurate posteriors and evidence estimates. The \mbox{$\texttt{n\_accept}$} parameter sets the target number of accepted steps per MCMC walk, affecting the decorrelation of samples, and \mbox{$\texttt{maxmcmc}$} sets the maximum number of attempted steps per walk. The primary architectural difference between our GPU-based implementation and the standard CPU-based `acceptance-walk' kernel is the use of batched sampling. In our framework, a batch of $k$ new points is generated and added to the live set simultaneously. This batch size is a user-configurable parameter, \texttt{num\_delete}, and we find that a value of \mbox{$k \approx 0.5 \times \texttt{n\_live}$} provides a good balance of parallel efficiency and sampling accuracy for most problems~\citep{yallup2025nested}. % The primary architectural difference between our GPU-based implementation % and the standard CPU-based `acceptance-walk' kernel is the use of % batched sampling. In our framework, a batch of $k$ new points is % run and added to the live set simultaneously. This batch size is % a user-configurable parameter, \texttt{num\_delete}, and we find that a % value of \mbox{$k \approx 0.5 \times \texttt{n\_live}$} provides a good balance % of parallel efficiency and sampling accuracy for most problems~\citep{yallup2025nested}. This batched approach has direct consequences for the adaptive tuning of the MCMC walk length. The tuning is performed only once per \texttt{num\_delete} iterations, rather than at every iteration, and every point in a given batch is tuned to have the same walk length. This design is important for preventing thread divergence and is the most natural way to implement this algorithm on a GPU. However, this less frequent tuning renders the standard \texttt{bilby} tuning parameter `delay', a smoothing factor for the adaptation, sub-optimal. To achieve a comparable number of accepted steps per iteration, we therefore adjust this parameter for the parallel framework. The delay is modified to be a function of the batch size, recovering the standard \texttt{bilby} behaviour in the sequential limit, since it is functionally similar to averaging over batches of chains anyway. % There are several hyperparameters for the sampler and inner kernel. % The number of live points, \mbox{$n_{\text{live}}$}, controls the resolution of the % run, and a higher value will result in more accurate posteriors and evidence estimates. % The \mbox{$\texttt{num\_delete}$} parameter, as discussed above, controls the batch size. % The \mbox{$\texttt{n\_accept}$} parameter controls the target number of accepted steps per MCMC walk, % affecting the decorrelation of samples, and \mbox{$\texttt{maxmcmc}$} % sets the maximum number of attempted steps per walk. Another subtle but critical consequence of the architectural shift to GPUs is its effect on the evolution of the live point set. In the sequential CPU case, the number of live points is approximately constant throughout the run. In the batched GPU case, the number of live points follows a `saw-tooth' pattern, decreasing from $n_{\text{live}}$ to \mbox{$n_{\text{live}} - k$} over one cycle of $k$ deletions before being replenished (Figure~\ref{fig:nlive_comparison}). This pattern causes the effective number of live points to be lower than the nominal number, making it appear that the sampler is able to converge faster than its sequential counterpart when both are configured with the same number of live points. It is important to note that the saw-tooth pattern is common to all parallel nested sampling, including CPU-based implementations~\citep{Handley:2015vkr}, and it is only the massively parallel nature of the GPU that makes it noticeable. \begin{figure} \centering \includegraphics{figures/weights_comparison.pdf} \caption{Comparison of the sample weights for each dead point for the sequential CPU and batched GPU implementations. The weights are calculated using the prior volumes enclosed between successive dead points and the likelihoods. The shapes are similar, and both implementations enter the bulk of the posterior distribution at similar iterations, indiciating that setting the number of live points in \texttt{blackjax-ns} to 1.4 times the number of live points in \texttt{bilby} does indeed result in a like-for-like comparison. The same saw-tooth pattern can be seen in the weights for the \texttt{blackjax-ns} implementation.} \label{fig:weights_comparison} \end{figure} %\vspace{-0.25cm} To conduct a fair and direct comparison between the two frameworks, this discrepancy must be accounted for. We therefore adjust the number of live points in the GPU sampler, $n_{\text{GPU}}$, such that the expected prior volume compression matches that of the CPU sampler. Under these settings, both implementations will converge in approximately the same number of nested sampling iterations (see Figure~\ref{fig:weights_comparison}). The following derivation outlines this adjustment. \subsubsection{Setting $n_{\text{GPU}}$} \label{sec:setting_n_gpu} The expected log prior volume fraction, $\log X$, remaining after $k$ iterations of a nested sampling run is given by~\citep{skilling, dynamic_ns, aeons} \begin{equation} E[\log X_k] = \sum_{i=1}^{k} \frac{1}{n_i}, \end{equation} where $n_i$ is the number of live points at iteration $i$. In the CPU-based implementation, $n_i = n_{\text{CPU}}$ is roughly constant. After $k$ iterations, the expected log volume fraction is therefore \begin{equation} \log X_{\text{CPU}} = \frac{k}{n_{\text{CPU}}}. \end{equation} In our GPU implementation, the number of live points decreases over one batch cycle of $k = \texttt{num\_delete}$ deletions. The total log volume shrinkage over one such cycle is the sum over the decreasing number of live points: \begin{equation} \log X_{\text{GPU}} = \sum_{i=0}^{k-1} \frac{1}{n_{\text{GPU}}-i} \approx \ln\left(\frac{n_{\text{GPU}}}{n_{\text{GPU}}-k}\right). \end{equation} To ensure a fair comparison, we equate the expected shrinkage of both methods over one cycle of $k$ iterations: \begin{equation} \frac{k}{n_{\text{CPU}}} \approx \ln\left(\frac{n_{\text{GPU}}}{n_{\text{GPU}}-k}\right). \end{equation} Using our recommended setting of $k = 0.5 \times n_{\text{GPU}}$, this simplifies to $0.5 \times n_{\text{GPU}} / n_{\text{CPU}} \approx \ln(2)$. We can therefore derive the required number of live points for the GPU sampler to be \begin{equation} n_{\text{GPU}} \approx 2 \ln(2) \times n_{\text{CPU}} \approx 1.4 \times n_{\text{CPU}}. \end{equation} In all comparative analyses presented in this paper, we configure the number of live points according to this relation to ensure an equal effective number of live points and like-for-like timing comparisons. \subsection{Likelihood} To assess the performance of our framework, we employ a standard frequency-domain likelihood~\citep{lal, NelsonMeyerReview}. The total speedup in our analysis is achieved by accelerating the two primary computational components of the inference process: the sampler and the likelihood evaluation. The former is parallelized at the batch level as described in Sec.~\ref{sec:methods_kernel}, while the latter is accelerated using a GPU-native waveform generator. For this purpose, we generate gravitational waveforms using the \texttt{ripple} library, which provides GPU-based implementations of common models~\citep{ripple, wong2023fastgravitationalwaveparameter, Wouters_BNS}. This allows the waveform to be calculated in parallel across the frequency domain and across multiple sets of parameters, enabling massive efficiency gains by ensuring that this calculation does not become a serial bottleneck. To isolate the speedup from this combined GPU-based framework, we deliberately avoid other established acceleration methods like heterodyning~\citep{TL_relativebinning, relativebinning2, relativebinning3, relativebinning4}, though these are available in the \texttt{jimgw} library too~\citep{wong2023fastgravitationalwaveparameter}. For the analyses in this paper, we restrict our consideration to binary black hole systems with aligned spins, for which we use the \texttt{IMRPhenomD} waveform approximant~\citep{Khan:2015jqa}. Further details on the specific likelihood configuration for each analysis, including noise curves and data segments, are provided in Section~\ref{sec:results}. \subsection{Priors} For this initial study, we adopt a set of standard, separable priors on the source parameters, which are summarized in Table~\ref{tab:priors}. The specific ranges for these priors are dependent on the duration of the signal, and are also given in Section~\ref{sec:results}. As is the default within the \texttt{bilby} framework, we sample directly in chirp mass, $\mathcal{M}$, and mass ratio, $q$. We use priors that are uniform in these parameters directly, instead of uniform in the component masses. The aligned spin components, $\chi_1$ and $\chi_2$, are also taken to be uniform over their allowed range. The coalescence time, $t_c$, is assumed to be uniform within a narrow window around the signal trigger time. For the luminosity distance, $d_L$, we adopt a power-law prior of the form $p(d_L) \propto d_L^2$. This prior corresponds to a distribution of sources that is uniform in a Euclidean universe. While this is a simplification that is less accurate at higher redshifts~\citep{bilby_validation}, it is a standard choice for many analyses and serves as a robust baseline for this work. These priors were chosen to facilitate a direct, like-for-like comparison against the CPU-based \texttt{bilby} and \texttt{dynesty} framework, and in all such comparisons identical priors were used. The implementation of more astrophysically motivated, complex prior distributions for mass, spin, and luminosity distance is left to future work and is independent of the algorithmic performance. \begin{table} \setlength{\tabcolsep}{3pt} % Reduce space between columns (default is 6pt) \centering \caption{Prior distributions for the parameters of the binary black hole system. The specific ranges for the masses and spins are dependent on the signal and are specified in Section~\ref{sec:results}.} \label{tab:priors} \begin{tabular}{l l l c c} \hline \hline \textbf{Parameter} & \textbf{Description} & \textbf{Prior Distribution} & \textbf{Range}\\ \hline $M_c$ & Chirp Mass & Uniform & - \\ $q$ & Mass Ratio & Uniform & - \\ $\chi_1, \chi_2$ & Aligned spin components & Uniform & - \\ $d_L$ & Luminosity Distance & Power Law (2) & [100, 5000] Mpc \\ $\theta_{\textrm{JN}}$ & Inclination Angle & Sine & [0, $\pi$] rad \\ $\psi$ & Polarization Angle & Uniform & [0, $\pi$] rad \\ $\phi_c$ & Coalescence Phase & Uniform & [0, 2$\pi$] rad \\ $t_c$ & Coalescence Time & Uniform & [-0.1, 0.1] s\\ $\alpha$ & Right Ascension & Uniform & [0, 2$\pi$] rad \\ $\delta$ & Declination & Cosine & [-$\pi$/2, $\pi$/2] rad \\ \hline \hline \end{tabular} \end{table} \section{Results and discussion} \label{sec:results} We now validate the performance of our framework through a series of analyses. In all of the results, the \texttt{bilby} analyses were executed on a 16-core CPU instance using Icelake nodes, while the \texttt{blackjax-ns} analyses were performed on a single NVIDIA L4 GPU, unless otherwise specified. \subsection{Simulated signals} \subsubsection{4-second simulated signal} \label{sec:4s_simulated_signal} We begin by analysing a 4-second simulated signal from a binary black hole (BBH) merger. The injection parameters for this signal are detailed in Table~\ref{tab:injection_params}. To ensure a direct, like-for-like comparison, the signal was injected into simulated detector noise using the \texttt{bilby} library, and then loaded into both sets of analyses. The analysis uses a three-detector network, with the design sensitivity for the fourth LIGO-Virgo-KAGRA observing run (O4). The frequency range of data analysed is from 20 Hz to 1024 Hz, and the network matched filter SNR is 39.6. Both the CPU and GPU-based analyses were configured with 1000 effective live points. As detailed in Sec.~\ref{sec:setting_n_gpu}, this corresponded to setting the number of live points in \texttt{blackjax-ns} to 1400, with \mbox{\texttt{num\_delete} = 700}. The termination condition in both cases was set to dlogZ < 0.1 (the default in \texttt{bilby}) and we used the settings \mbox{\texttt{naccept} = 60}, \mbox{\texttt{maxmcmc} = 5000} and \mbox{\texttt{use\_ratio = True}}. In both cases, periodic boundary conditions were used for the right ascension, polarization angle, and coalescence phase parameters. The prior ranges for the chirp mass and mass ratio were set to $[25.0, 50.0]~M_{\odot}$ and $[0.25, 1.0]$, respectively, with priors on the aligned spin components over the range $[-1, 1]$. The recovered posterior distributions, shown in Figure~\ref{fig:4s_posteriors}, demonstrate excellent statistical agreement between the two frameworks. This result validates that our custom `acceptance-walk' kernel within the vectorized \texttt{blackjax-ns} framework is functionally equivalent to the trusted sequential implementation in \texttt{bilby}. The computed log-evidence values are also in strong agreement, as shown in Figure~\ref{fig:4s_logZ_comparison}, confirming that our implementation provides a robust unified framework for both parameter estimation and model selection. The CPU-based \texttt{bilby} run completed in 2.99 hours on 16 cores, for a total of 47.8 CPU-hours. In contrast, the GPU-based \texttt{blackjax-ns} run completed in 1.25 hours. This corresponds to a sampling time speedup factor of 38$\times$. The \texttt{bilby} and \texttt{blackjax-ns} runs performed 62.9 million and 62.5 million likelihood evaluations and had a mean number of accepted steps per iteration of 28.5 and 30.7, respectively. Beyond sampling time, we also consider the relative financial cost. Based on commercial on-demand rates from Google Cloud\footnote{Accessed on 2025-09-03 from https://cloud.google.com/compute/gpus-pricing.}, the rental cost for a 16-core CPU instance and an L4 GPU instance were approximately equivalent at the time of this work. This equivalence in hourly cost implies a direct cost-saving factor of approximately 2.4$\times$ for the GPU-based analysis. In the interest of benchmarking, we also performed the analysis on an A100 NVIDIA GPU configured with the same run settings as the L4 GPU. The A100 is a more powerful GPU and resulted in a lower runtime. Interestingly, however, when comparing the relative commercial hourly rates of the two GPUs, the L4 GPU analysis was actually cheaper. Indeed the A100 analysis actually had a worse cost compared to the CPU implementation. We summarise these results in Table~\ref{tab:4s_time_comparison}. \begin{table} \centering \caption{Injection parameters for the 4s signal.} \label{tab:injection_params} \begin{tabular}{l l l c c} \hline \hline \textbf{Parameter} & \textbf{Value} \\ \hline $\mathcal{M}$ & 35.0 M$_{\odot}$ \\ $q$ & 0.90 \\ $\chi_1$ & 0.40 \\ $\chi_2$ & -0.30 \\ $d_L$ & 1000 Mpc \\ $\theta_{\textrm{JN}}$ & 0.40 rad \\ $\psi$ & 2.66 rad \\ $\phi$ & 1.30 rad \\ $t_c$ & 0.0 s\\ $\alpha$ & 1.38 rad \\ $\delta$ & -1.21 rad \\ \hline \hline \end{tabular} \end{table} \begin{figure} \centering \includegraphics{figures/4s_logZ_comparison.pdf} \caption{Comparison of the log evidence for the 4s signal. The results are in excellent agreement, demonstrating the robustness of the \texttt{blackjax-ns} implementation in recovering the same posteriors and evidence as the \texttt{bilby} implementation. This unifies parameter estimation and evidence evaluation into a single GPU-accelerated framework.} \label{fig:4s_logZ_comparison} \end{figure} \begin{table} \setlength{\tabcolsep}{3pt} \centering \caption{Comparison of the sampling times and cost savings for the 4s signal.} \label{tab:4s_time_comparison} \begin{tabular}{l l l c c} \hline \hline \textbf{Implementation + Hardware} & \textbf{Time (h)} & \textbf{Speedup} & \textbf{Cost Saving} \\ \hline \texttt{bilby} (16 Icelake CPU cores) & 47.8 & - & - \\ \texttt{blackjax-ns} (NVIDIA L4) & 1.25 & 38$\times$ & 2.4$\times$ \\ \texttt{blackjax-ns} (NVIDIA A100) & 0.93 & 51$\times$ & 0.6$\times$ \\ \hline \hline \end{tabular} \end{table} \subsection{Injection Study} \label{sec:injection_study} To systematically assess the performance and robustness of our framework across a diverse parameter space, we performed an injection study comparing our GPU-based \texttt{blackjax-ns} sampler against the CPU-based \texttt{bilby}+\texttt{dynesty} implementation. A population of 100 simulated BBH signals was generated using \texttt{bilby} and injected into noise representative of the O4 detector network sensitivity. The network signal-to-noise ratios (SNR) for this injection set span a range from 1.84 to 17.87 (Figure~\ref{fig:snr_dist}). As above, we use a three-detector network for the analysis. \begin{figure} \centering \includegraphics{figures/injection_snr_hist.pdf} \caption{Distribution of the network signal-to-noise ratios (SNR) for the injected signals.} \label{fig:snr_dist} \end{figure} For this study, the prior ranges were set to $[20.0, 50.0]~M_{\odot}$ for the chirp mass and $[0.5, 1.0]$ for the mass ratio. The aligned spin components were bounded by uniform priors over the range $[-0.8, 0.8]$. All other prior distributions are as defined in Table~\ref{tab:priors}. All analyses in this study were performed using 1000 live points for \texttt{bilby} and 1400 live points for \texttt{blackjax-ns}. Given that quite a few signals in the set have a low SNR, and therefore a low Bayes factor when comparing the signal hypothesis to the noise-only hypothesis, a more robust termination condition was required to ensure accurate evidence computation as well as posterior estimation. We therefore set the termination criterion based on the fractional remaining evidence, such that the analysis stops when the estimated evidence in the live points is less than 0.1\% of the accumulated evidence. Both samplers were configured with \mbox{\texttt{naccept} = 60} and \mbox{\texttt{maxmcmc} = 5000}, and the \texttt{blackjax-ns} sampler used a batch size of \mbox{\texttt{num\_delete} = 700}. \begin{figure} \centering \includegraphics{figures/chain_length_comparison.pdf} \caption{Comparison of the accepted number of steps per iteration for the sequential CPU and batched GPU analyses of the first signal from the injection study. We adapt our `delay' parameter from the tuning formula such that we obtain similar accepted steps for the two implementations. The \texttt{blackjax-ns} implementation can only perform tuning every \mbox{\texttt{num\_delete}} iterations, so we tune the chain length more aggressively.} \label{fig:chain_length_comparison} \end{figure} The resulting PP plot for our \texttt{blackjax-ns} sampler with the `acceptance-walk' kernel is shown in Figure~\ref{fig:pp_coverage}. This plot evaluates the self-consistency of the posteriors by checking the distribution of true injected parameter values within their recovered credible intervals. The results demonstrate that the credible intervals are well-calibrated, with the cumulative distributions for all parameters lying within the expected statistical variance of the line of perfect calibration. This confirms the robustness of our implementation. The full set of posterior and evidence results for all 100 injections is made publicly available in the data repository accompanying this paper~\citep{Prathaban_2025_zenodo}. We show the consistency of the evidence estimates in Figure~\ref{fig:evidence_vs_steps_iters}. \begin{figure} \centering \includegraphics[width=\columnwidth]{figures/pp_coverage_blackjax.pdf} \caption{Percentile-percentile (PP) coverage plot for the 100-injection study, obtained with the \texttt{blackjax-ns} sampler. The cumulative fraction of events where the true injected parameter falls below a given credible level is plotted against that credible level. The proximity of all parameter curves to the diagonal indicates that the posterior credible intervals are statistically well-calibrated. A corresponding plot for the \texttt{bilby+dynesty} analysis, which should be identical, is provided in the Appendix for reference.} \label{fig:pp_coverage} \end{figure} \begin{figure} \centering \includegraphics{figures/walltime_speedup.pdf} \caption{The runtime speedups for all 100 events in the injection study. For the CPU-based analyses, we take the wall-time and multiply by 16 to get the total runtime. For the GPU-based analyses, we use the wall-time directly.} \label{fig:speedup_comparison} \end{figure} \begin{figure} \centering \includegraphics{figures/cost_reduction.pdf} \caption{The cost reductions for all 100 events in the injection study. Although the runtime speedups are significant, the more honest and fair comparison is of the relative costs of the two sets of analyses, since GPUs are more expensive than CPUs. We use the commercial on-demand rates for a 16-core CPU instance and an L4 GPU instance at the time of this work, and compare the cost savings obtained from using our GPU-accelerated pipeline.} \label{fig:cost_reduction} \end{figure} \begin{figure} \centering \includegraphics{figures/evidence_vs_steps_iters.pdf} \caption{The log evidence difference between the \texttt{blackjax-ns} and \texttt{bilby+dynesty} samplers for each injection in the study. For each estimate, we take the mean log evidence estimated by \texttt{bilby} to be the reference value, and subtract the \texttt{bilby} estimates from the log evidences calculated by \texttt{blackjax-ns}. The error bars represent 1$\sigma$ confidence intervals for these differences. There appears to be some underestimation of the error bars, potentially due to some sets of samples not being as well decorrelated as others, despite a reasonable average decorrelation. However, a calibration test was performed to quantify this. For the $\Delta \log \mathcal{Z}$ estimates, we calculate the percentiles at which $\Delta \log \mathcal{Z} = 0$ lies. If the error bars are correctly calibrated, the resulting distribution of percentiles should be uniform; the Kolmogorov-Smirnov test returns a p-value of $0.062$, indicating that at the $5\%$ significance level the error bars are well-calibrated.} \label{fig:evidence_vs_steps_iters} \end{figure} The average number of likelihood evaluations for the \texttt{blackjax-ns} sampler was 38.4 million, consistent with the average of 37.5 million for the \texttt{bilby+dynesty} sampler. A more detailed breakdown of all run statistics is provided in Appendix~\ref{app:injection_study_stats}. In terms of runtime, the \texttt{blackjax-ns} sampler completed the analyses in an average of 0.82 hours per injection, with individual run times ranging from 32 to 76 minutes. In contrast, the CPU-based runs required an average of 26.3 total CPU-hours, with some runs taking as long as 45 hours. Figure~\ref{fig:speedup_comparison} shows the distribution of the resulting runtime speedup factors, which have a mean value of 32$\times$. As in section~\ref{sec:4s_simulated_signal}, we also translate this performance gain into a cost reduction using commercial cloud computing rates. As shown in Figure~\ref{fig:cost_reduction}, the speedup corresponds to an average cost-reduction factor of 2.0$\times$ for the GPU-based analysis compared to its CPU-based counterpart. \subsection{Disentangling Sources of GPU Acceleration} \label{sec:disentangling_acceleration} The overall performance gain of our framework arises from two distinct forms of parallelisation: the parallel evaluation of a single likelihood across its frequency bins (intra-likelihood), and the parallel sampling process itself, where multiple MCMC chains are run simultaneously (inter-sample). In this section, we attempt to disentangle these two contributions. We configured our \texttt{blackjax-ns} sampler to run in a sequential mode. This was achieved by setting the batch size to one (\texttt{num\_delete} = 1) and the number of live points to 1000, identical to the CPU analysis. With a batch size of one, the `saw-tooth' pattern in the live point population described in Sec.~\ref{sec:sampler_config} is eliminated, making the effective number of live points equal to the nominal value. This removes the need for the corrective scaling of $n_{\text{live}}$ applied in our main parallel analyses, and we can also use the same `delay' for tuning the walk lengths as for the \texttt{bilby} runs. In this configuration, the algorithm proceeds analogously to the \texttt{bilby} implementation, and the only source of acceleration relative to the CPU is the GPU's ability to parallelise the likelihood calculation over the frequency domain. For this test, we used the first signal from our injection study (network SNR of 8.82). The baseline CPU-based \texttt{bilby} analysis required 30.4 total CPU-hours to complete. The sequential-GPU analysis, which benefits only from intra-likelihood parallelisation, completed in 9.1 hours. This represents a speedup of 3.3$\times$. Finally, the fully parallel run, using its correctly scaled live point count of 1400 and a batch size of \texttt{num\_delete} = 700, completed in 0.82 hours. This corresponds to a further speedup of 11.1$\times$ over the sequential-GPU run. This result demonstrates that while a GPU-native likelihood provides a significant performance benefit, the more dominant contribution to the overall speedup (a total of 37.1$\times$ in this case) originates from the massively parallel, batched sampling architecture. This result underscores the importance of the algorithmic reformulation of nested sampling pioneered in~\cite{yallup2025nested}. \subsection{8-second simulated signal} \label{sec:8s_simulated_signal} To investigate the scalability of our framework with a more computationally demanding likelihood, we now analyse a simulated 8-second BBH signal. The injection parameters are detailed in Table~\ref{tab:8s_injection_params}. For this analysis, the data are evaluated between 20 Hz and 2048 Hz, quadrupling the number of frequency bins compared to the 4-second signal analysis. As above, the signal was injected into noise from a three-detector network with O4 sensitivity, resulting in a network SNR of 11.25. The prior ranges for the chirp mass and mass ratio were set to $[10.0, 25.0]~M_{\odot}$ and $[0.25, 1.0]$, respectively. The aligned spin components were bounded by uniform priors over the range $[-0.8, 0.8]$, with all other prior distributions as defined in Table~\ref{tab:priors}. The sampler configurations were identical to those used in the injection study (Sec.~\ref{sec:injection_study}), with $n_{\text{live}}=1400$ for \texttt{blackjax-ns} and $n_{\text{live}}=1000$ for \texttt{bilby}, and an \texttt{naccept} target of 60. Again, we use the termination condition described in Section~\ref{sec:injection_study}. \begin{figure*} \centering \includegraphics{figures/8s_corner_comparison.pdf} \caption{Recovered posterior distributions for the 8s simulated signal, comparing our GPU-based \texttt{blackjax-ns} sampler with the CPU-based \texttt{bilby} sampler. The injected values are marked by black lines. The strong statistical agreement confirms the validity of our implementation for longer-duration signals. The GPU implementation provided a wall-time speedup of 46$\times$ and a cost reduction of 2.9$\times$.} \label{fig:8s_posteriors} \end{figure*} \begin{figure} \centering \includegraphics{figures/8s_logZ_comparison.pdf} \caption{Comparison of the recovered log-evidence ($Z$) for the 8s signal. The results from both the \texttt{bilby} and \texttt{blackjax-ns} frameworks are consistent within their estimated numerical uncertainties.} \label{fig:8s_logZ} \end{figure} The recovered posterior distributions and log-evidence values are shown in Figure~\ref{fig:8s_posteriors} and Figure~\ref{fig:8s_logZ}. We again find excellent agreement between the two frameworks, further validating the accuracy of our implementation, despite the differences arising from parallelisation. The CPU-based \texttt{bilby} analysis required 167.2 total CPU-hours to converge, performing 58 million likelihood evaluations. Our GPU-based \texttt{blackjax-ns} sampler completed the same analysis in 4.55 hours, with 61 million likelihood evaluations, corresponding to a wall-time speedup factor of 37$\times$. The associated cost-reduction factor for this analysis was 2.3$\times$. The scaling of the runtime for this longer-duration signal provides a key insight into the practical limits of GPU parallelisation. In an ideal model where the GPU has sufficient parallel cores for every frequency bin, the likelihood evaluation time would be independent of signal duration. In such a scenario, the total runtime would be dictated primarily by the number of likelihood evaluations needed for convergence. In this case, the analysis should have taken roughly the same time as the 4-second signal analysis, or $1.25 \times 61/63 = 1.21$ hours. However, we observe that the runtime increased significantly more than what would be predicted by the relative number of likelihood evaluations alone. This discrepancy arises because the computational load of the larger frequency array exceeds the parallel processing capacity of the L4 GPU. As a result, the GPU's internal scheduler must batch the calculation across the frequency domain, re-introducing a partial serial dependency that makes the likelihood evaluation time scale with the number of bins. This result serves as an important practical clarification to a common argument for GPU acceleration. While the idealised model of parallelisation suggests that evaluation time should be independent of signal duration, our work demonstrates that this does not hold indefinitely. Once the problem size saturates the GPU's finite resources, such as its compute or memory bandwidth, the runtime once again begins to scale with the number of frequency bins, a behaviour analogous to the scaling seen in the CPU-based case. % \begin{figure} % \centering % \includegraphics{figures/likelihood_scaling.pdf} % \caption{The scaling of the likelihood evaluation % time with the number of frequency bins and particles % over which it is evaluated. Beyond a certain limit, % we see a linear scaling of the evaluation times, % indicating a practical limit to the speedup that % can be achieved through GPU parallelisation with % the likelihoods considered here. Using acceleration techniques, % such as heterodyning~\citep{TL_relativebinning,relativebinning2,relativebinning3,relativebinning4}, % can alleviate this.} % \label{fig:likelihood_scaling} % \end{figure} \section{Conclusions} \label{sec:conclusions} \begin{table} \centering \caption{Injection parameters for the 8s signal.} \label{tab:8s_injection_params} \begin{tabular}{l l l c c} \hline \hline \textbf{Parameter} & \textbf{Value} \\ \hline $\mathcal{M}$ & 10.0 M$_{\odot}$ \\ $q$ & 0.70 \\ $\chi_1$ & -0.34 \\ $\chi_2$ & 0.01 \\ $d_L$ & 500 Mpc \\ $\theta_{\textrm{JN}}$ & 1.30 rad \\ $\psi$ & 1.83 rad \\ $\phi$ & 5.21 rad \\ $t_c$ & 0.0 s\\ $\alpha$ & 1.07 rad \\ $\delta$ & -1.01 rad \\ \hline \hline \end{tabular} \end{table} In this work, we have presented the development and validation of a GPU-accelerated implementation of the `acceptance-walk' nested sampling kernel, a trusted algorithm from the community-standard \texttt{bilby} and \texttt{dynesty} framework. By integrating this kernel into the vectorized \texttt{blackjax-ns} framework, we have created a tool for rapid, unified Bayesian parameter estimation and model selection. Through systematic studies of simulated binary black hole signals, we demonstrated that our implementation is functionally analogous to its CPU-based counterpart, producing statistically consistent posterior distributions and evidence estimates with typical wall-time speedups of 20-40$\times$ and cost reductions of 1.5-2.5$\times$. It is important to contextualize the cost savings reported in this work. As the trajectory of high-performance computing is increasingly driven by large-scale investment in GPU hardware for artificial intelligence, we anticipate that the cost-per-computation on these devices will continue to fall relative to CPUs. Therefore, the cost-saving advantage of our approach is expected to become more pronounced in the future, even if the relative runtimes remain unchanged. Furthermore, the JAX~\citep{jax2018github} ecosystem is very well suited to distribution over multiple devices and architectures. It has been closely developed alongside the Tensor Processing Unit (TPU) architecture~\citep{jouppi2023tpu}, offering a clear avenue to integrate this work with such frontier low-cost hardware. A key conclusion of this work is that the performance gains reported here are primarily a consequence of the architectural shift to the GPU. This implies that similar relative speedups are achievable for more computationally demanding waveforms, such as those including spin precession and higher order modes (e.g., \texttt{IMRPhenomXPHM}~\citep{IMRPhenomXPHM}), provided they are also implemented in a GPU-native framework. Our analysis of a longer-duration signal confirmed, however, that this scalability is ultimately limited by the finite resources of the hardware, a practical constraint that can be addressed with more powerful GPUs or likelihood-acceleration techniques like heterodyning~\citep{TL_relativebinning,relativebinning2,relativebinning3,relativebinning4}. Importantly, by significantly reducing the computational cost of nested sampling, our framework makes tackling previously prohibitive next-generation analyses~\citep{HuAccelerationReview, makaibaker2025}, such as those involving long-duration signals or high SNR signals, a more viable prospect. Beyond future-proofing the community's core analysis tools, this work establishes an important performance baseline. As the community develops novel, GPU-accelerated sampling algorithms, it is essential to disentangle performance gains originating from the hardware parallelization itself from those arising from genuine algorithmic innovation. By developing a functionally equivalent, GPU-native version of one of the community-standard algorithms, we have isolated and quantified the speedup attributable solely to the architectural shift, which allows for batched sampling and parallelized likelihood evaluations on a GPU. This provides a robust reference against which future, more advanced GPU-based samplers can be rigorously benchmarked, in order to guide the development of next generation Bayesian inference tools. There are several future avenues of research and development. Having established this performance baseline, we can now rigorously evaluate novel inner sampling kernels designed specifically for the GPU architecture. While the speedups reported in this work are significant, the `acceptance-walk' sampling method was never designed to be optimised for the GPU architecture. This motivates the exploration of alternative inner sampling kernels, such as those based on the Hit-and-Run Slice Sampling (HRSS) algorithm introduced in~\cite{yallup2025nested}, which may offer more efficient performance on this architecture. GPU-accelerated sampling algorithms such as the one presented in this work can only run when paired with a GPU-native likelihood. Our work therefore provides additional motivation for the continued development and porting of more complex and physically comprehensive waveform models to GPU-based libraries like \texttt{ripple}~\citep{ripple}. On the software side, future work will involve integrating additional features from \texttt{bilby}~\citep{bilby_paper} into our framework, such as the astrophysically motivated prior distributions for masses, spins, and luminosity distance, to enable more realistic analyses that are of interest to the GW community. Finally, the batched architecture of our sampler is highly compatible with machine learning techniques used for accelerating Bayesian inference. Methods that use normalizing flows to speed up the core nested sampling algorithm, such as those in~\cite{Williams2021Nessai} and~\cite{Prathaban}, are most computationally efficient when they can evaluate a large number of samples in parallel. The serial nature of a conventional CPU-based sampler represents a significant bottleneck for these models. In contrast, our parallel framework is a natural fit for these operational requirements. This compatibility enables the development of highly efficient, hybrid inference pipelines that combine GPU-accelerated sampling with GPU-native machine learning models, further lowering the computational cost of future gravitational-wave analyses. \section*{Acknowledgements} This work was supported by the research environment and infrastructure of the Handley Lab at the University of Cambridge. MP is supported by the Harding Distinguished Postgraduate Scholars Programme (HDPSP). JA is supported by a fellowship from the Kavli Foundation. This work was performed using the Cambridge Service for Data Driven Discovery (CSD3), part of which is operated by the University of Cambridge Research Computing on behalf of the STFC DiRAC HPC Facility (www.dirac.ac.uk). The DiRAC component of CSD3 was funded by BEIS capital funding via STFC capital grants ST/P002307/1 and ST/R002452/1 and STFC operations grant ST/R00689X/1. DiRAC is part of the National e-Infrastructure. This material is based upon work supported by the Google Cloud research credits program, with the award GCP397499138. \section*{Data Availability} All the data used in this analysis, including the relevant nested sampling chains, can be obtained from~\cite{Prathaban_2025_zenodo}. We include a file with all the code to reproduce the plots in this paper. The \texttt{blackjax-ns} code is available from~\cite{blackjax_ns_github}, and the custom kernel implementated in this paper can be found at~\cite{Prathaban_2025_github}. The latter link also contains instructions and example files on how to use the kernel to run GW analyses. \section*{Software and Tools} The authors acknowledge the use of AI tools in preparing this manuscript and the accompanying software. Generative models, specifically Gemini 2.5 Pro and Claude 4.0 Sonnet, were used to assist with drafting and refining the text. Additionally, the AI coding agent Claude Code was utilized during software development. All AI-generated outputs, including text and code, were carefully reviewed, edited, and validated by the authors to ensure their accuracy. %%%%%%%%%%%%%%%%%%%% REFERENCES %%%%%%%%%%%%%%%%%% % The best way to enter references is to use BibTeX: \bibliographystyle{mnras} \bibliography{references} % if your bibtex file is called example.bib % Alternatively you could enter them by hand, like this: % This method is tedious and prone to error if you have lots of references %\begin{thebibliography}{99} %\bibitem[\protect\citeauthoryear{Author}{2012}]{Author2012} %Author A.~N., 2013, Journal of Improbable Astronomy, 1, 1 %\bibitem[\protect\citeauthoryear{Others}{2013}]{Others2013} %Others S., 2012, Journal of Interesting Stuff, 17, 198 %\end{thebibliography} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %%%%%%%%%%%%%%%%% APPENDICES %%%%%%%%%%%%%%%%%%%%% \appendix \section{PP coverage plot for \texttt{bilby} and \texttt{dynesty}} \label{app:bilby_pp_plot} For completeness, we present the PP coverage plot for the CPU-based \texttt{bilby}+\texttt{dynesty} analysis in Figure~\ref{fig:pp_coverage_bilby}. This serves as a reference for the corresponding plot for our \texttt{blackjax-ns} implementation shown in the main text. As expected for two functionally equivalent samplers, the resulting PP plot and p-values demonstrate results consistent with those from our GPU-based framework. We note, however, that while the two samplers are algorithmically analogous, apart from differences in the parallelisation strategy, they will not produce identical PP plots due to their stochastic nature and the inherent uncertainties in posterior reconstruction from nested sampling. The primary sources of this variance include the statistical uncertainty in the weights assigned to the dead points, which are themselves estimates~\citep{skilling, Higson_Errors}, and an additional uncertainty inherent to parameter estimation with nested sampling that is not captured by standard error propagation methods~\citep{Prathaban_PE_errors}. Consequently, the PP curves and their associated p-values are themselves subject to statistical fluctuations. The results from the two frameworks are therefore only expected to be consistent within these intrinsic uncertainties. We do not explicitly account for these uncertainties here, instead leaving this analysis for future work. \begin{figure} \centering \includegraphics[width=\columnwidth]{figures/pp_coverage_bilby.pdf} \caption{PP coverage plot for the 100-injection study, obtained with the CPU-based \texttt{bilby+dynesty} sampler. This plot is provided for direct comparison with the results from our \texttt{blackjax-ns} implementation shown in Figure~\ref{fig:pp_coverage}. The results confirm that the CPU-based and GPU-based implementations are functionally similar.} \label{fig:pp_coverage_bilby} \end{figure} \section{Full run statistics for injection study} \label{app:injection_study_stats} We present a detailed comparison of the internal sampling statistics for the 100-injection study in Figure~\ref{fig:injection_study_stats}. We tuned the `delay' parameter in the walk length tuning formula to target a similar level of sample decorrelation to \texttt{bilby}. The results show that this was successful, and both samplers ended up performing similar numbers of likelihood evaluations, with similar acceptance rates. Figure~\ref{fig:injection_study_stats} shows the full statistics for each injection. Although we had to employ batch tuning to the chains, in contrast to the per-iteration strategy employed in \texttt{bilby}, the results show that this still resulted in similar behaviours for the chains and thus our results represent a like-for-like comparison. \begin{figure*} \centering \includegraphics{figures/performance_metrics.pdf} \caption{Comparison of internal run statistics for the 100-injection study. From left to right: the mean number of likelihood evaluations per iteration, the mean number of accepted steps, and the mean acceptance rate, for both the \texttt{bilby} and \texttt{blackjax-ns} runs. The results show that despite the differences in tuning strategy, motivated by architectural differences between the CPU and GPU, the two samplers exhibit similar properties.} \label{fig:injection_study_stats} \end{figure*} \section{Likelihood level parallelisation performance} The performance gains from a GPU sampling framework arises from the ability to exploit massive parallelism. In particular, we seek to exploit parallelism across multiple short Markov chains, as this is the central construction of NS. The potential bottlenecks for parallelism arise from the NS algorithms ability to effectively synchronise the chains, and (if this condition is met) the ability to parallelise the likelihood evaluation itself across these chains. Developing algorithms that can effectively achieve the former goal is part of the ongoing programme of research this work establishes on a GPU GW problem. Regardless of effective parallelism across chains, the algorithm in this work is bottlenecked by the ability to parallelise the likelihood evaluation itself. We demonstrate the scaling of the likelihood evaluation time with the number of samples in Figure~\ref{fig:scaling_l4}. The emerging pattern of polynomial runtime scaling with the number of samples, particularly for the high frequency bin number likelihoods we investigate primarily in this work, indicates that the parallel throughput of the GPU is being saturated. This bottleneck can be mitigated by either; using a compressed likelihood (e.g. by heterodyning) or expanding processing over multiple GPUs. If possible, rewriting the likelihood to be more parallelism friendly is the most efficient solution, but this is challenging for GW data analysis. \begin{figure} \centering \includegraphics[width=1.0\columnwidth]{figures/likelihood_scaling.pdf} \caption{Scaling of likelihood wallclock time on an NVIDIA L4 GPU. A typical representative likelihood for a 3-detector analysis on a 4 second signal using the \texttt{IMRPhenomD} waveform model from \texttt{ripple} is used. Calls of likelihood vectorized over $N$ samples are made, and the wall time averaged over many calls is reported. Comparison is made between different numbers of frequency bins over which the likelihood is evaluated. As can be seen, reducing the number of frequency bins alleviates this bottleneck, and one way to achieve this for realistic likelihoods is by using heterodyning.} \label{fig:scaling_l4} \end{figure} % The full run statistics for the injection study are shown in Figure~\ref{fig:injection_study_stats}. % In general, the accepted steps per iteration and the acceptance rates % for both samplers are similar. However, the \texttt{blackjax-ns} sampler % sometimes exhibits much longer chain lengths and lower acceptance rates, % skewing the mean accepted steps per iteration lower than the \texttt{bilby} % runs. This is a consequence of the batch tuning process, where the sampler % is only tunes every 700 iterations, instead of every iteration as in \texttt{bilby}. % In both cases, at certain points in the run the samplers will suddenly enter into % regions of the parameter space that are more difficult to sample, due to % the complex geometry of the likelihood. In the sequential CPU-based implementation, % the chain lengths can immediately begin adapt to this shift. However, in the batch-based % approach, if the live points enter this region soon after a tuning, the chains must wait % for potentially hundreds more iterations before the walk lengths can be tuned. This will lead % to very low acceptance rates for that batch of walks, and then an overcorrection in the % walk lengths when the tuning does eventually occur. This is why we sometimes see % this relationship between very long chain lengths and very low acceptance rates. % Unfortunately, since this type of batch tuning is the most natural way to % implement the sampler on a GPU, it is difficult to avoid this issue. % Users should be aware of this, and make sure that the resulting accepted steps per % iteration is not too low regardless, as this will cause the evidences to be inaccurate due % to overly correlated samples. % In the case of this injection study, most of the examples where we had much % lower acceptance rates than for the \texttt{bilby} runs were cases where the % mean chain length was long enough to still result in a similar number of % accepted steps per iteration, affecting the timing and total likelihood evaluations only, % not the quality of the results. % \begin{figure*} % \centering % \includegraphics{figures/performance_metrics.pdf} % \caption{Full run statistics for the injection study. The % \texttt{blackjax-ns} sampler sometimes has much longer chain lengths and % lower acceptance rates than \texttt{bilby} due to the % nature of batch tuning.} % \label{fig:injection_study_stats} % \end{figure*} %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Don't change these lines \bsp % typesetting comment \label{lastpage} \end{document} ``` 4. **Bibliographic Information:** ```bbl \begin{thebibliography}{} \makeatletter \relax \def\mn@urlcharsother{\let\do\@makeother \do\$\do\&\do\#\do\^\do\_\do\%\do\~} \def\mn@doi{\begingroup\mn@urlcharsother \@ifnextchar [ {\mn@doi@} {\mn@doi@[]}} \def\mn@doi@[#1]#2{\def\@tempa{#1}\ifx\@tempa\@empty \href {http://dx.doi.org/#2} {doi:#2}\else \href {http://dx.doi.org/#2} {#1}\fi \endgroup} \def\mn@eprint#1#2{\mn@eprint@#1:#2::\@nil} \def\mn@eprint@arXiv#1{\href {http://arxiv.org/abs/#1} {{\tt arXiv:#1}}} \def\mn@eprint@dblp#1{\href {http://dblp.uni-trier.de/rec/bibtex/#1.xml} {dblp:#1}} \def\mn@eprint@#1:#2:#3:#4\@nil{\def\@tempa {#1}\def\@tempb {#2}\def\@tempc {#3}\ifx \@tempc \@empty \let \@tempc \@tempb \let \@tempb \@tempa \fi \ifx \@tempb \@empty \def\@tempb {arXiv}\fi \@ifundefined {mn@eprint@\@tempb}{\@tempb:\@tempc}{\expandafter \expandafter \csname mn@eprint@\@tempb\endcsname \expandafter{\@tempc}}} \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2016}]{GW150914} Abbott B.~P., et~al., 2016, \mn@doi [Phys. Rev. Lett.] {10.1103/PhysRevLett.116.061102}, 116, 061102 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2017a}]{GW170817} Abbott B.~P., et~al., 2017a, \mn@doi [Phys. Rev. Lett.] {10.1103/PhysRevLett.119.161101}, 119, 161101 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2017b}]{siren} Abbott B.~P., et~al., 2017b, \mn@doi [Nature] {10.1038/nature24471}, 551, 85 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2019}]{GWTC1} Abbott B.~P., et~al., 2019, \mn@doi [Phys. Rev. X] {10.1103/PhysRevX.9.031040}, 9, 031040 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2020a}]{aLVK_prospects} Abbott B.~P., et~al., 2020a, \mn@doi [Living Rev. Rel.] {10.1007/s41114-020-00026-9}, 23, 3 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2020b}]{LIGO_guide_signalextraction} Abbott B.~P., et~al., 2020b, \mn@doi [Classical and Quantum Gravity] {10.1088/1361-6382/ab685e}, 37, 055002 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2021}]{GWTC2_GR} Abbott R., et~al., 2021, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.103.122002}, 103, 122002 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2023a}]{GWTC3_pop_analysis} Abbott R., et~al., 2023a, \mn@doi [Phys. Rev. X] {10.1103/PhysRevX.13.011048}, 13, 011048 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2023b}]{GWTC3} Abbott R., et~al., 2023b, \mn@doi [Phys. Rev. X] {10.1103/PhysRevX.13.041039}, 13, 041039 \bibitem[\protect\citeauthoryear{Abbott et~al.}{Abbott et~al.}{2024}]{GWTC2} Abbott R., et~al., 2024, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.109.022001}, 109, 022001 \bibitem[\protect\citeauthoryear{Acernese et~al.,}{Acernese et~al.}{2014}]{aVirgo} Acernese F., et~al., 2014, Classical and Quantum Gravity, 32, 024001 \bibitem[\protect\citeauthoryear{Ashton et~al.}{Ashton et~al.}{2019}]{bilby_paper} Ashton G., et~al., 2019, \mn@doi [Astrophys. J. Suppl.] {10.3847/1538-4365/ab06fc}, 241, 27 \bibitem[\protect\citeauthoryear{Ashton, Bernstein, Buchner \& et al.}{Ashton et~al.}{2022}]{NSNature} Ashton G., Bernstein N., Buchner J., et al. 2022, \mn@doi [Nature Reviews Methods Primers] {10.1038/s43586-022-00121-x}, 2, 39 \bibitem[\protect\citeauthoryear{Baker, Lasky, Thrane \& Golomb}{Baker et~al.}{2025}]{makaibaker2025} Baker A.~M., Lasky P.~D., Thrane E., Golomb J., 2025, Significant challenges for astrophysical inference with next-generation gravitational-wave observatories (\mn@eprint {arXiv} {2503.04073}) \bibitem[\protect\citeauthoryear{{Bayes}}{{Bayes}}{1763}]{Bayes1763} {Bayes} T., 1763, \mn@doi [Philosophical Transactions of the Royal Society of London] {10.1098/rstl.1763.0053}, \href {https://ui.adsabs.harvard.edu/abs/1763RSPT...53..370B} {53, 370} \bibitem[\protect\citeauthoryear{{Biwer}, {Capano}, {De}, {Cabero}, {Brown}, {Nitz} \& {Raymond}}{{Biwer} et~al.}{2019}]{pycbc} {Biwer} C.~M., {Capano} C.~D., {De} S., {Cabero} M., {Brown} D.~A., {Nitz} A.~H., {Raymond} V., 2019, \mn@doi [Publications of the Astronomical Society of the Pacific] {10.1088/1538-3873/aaef0b}, \href {https://ui.adsabs.harvard.edu/abs/2019PASP..131b4503B} {131, 024503} \bibitem[\protect\citeauthoryear{Bradbury et~al.,}{Bradbury et~al.}{2018}]{jax2018github} Bradbury J., et~al., 2018, {JAX}: composable transformations of {P}ython+{N}um{P}y programs, \url {http://github.com/google/jax} \bibitem[\protect\citeauthoryear{Branchesi et~al.}{Branchesi et~al.}{2023}]{ET_science_case} Branchesi M., et~al., 2023, \mn@doi [Journal of Cosmology and Astroparticle Physics] {10.1088/1475-7516/2023/07/068}, 2023, 068 \bibitem[\protect\citeauthoryear{Buchner}{Buchner}{2023}]{NS_methods_buchner} Buchner J., 2023, \mn@doi [Statistics Surveys] {10.1214/23-SS144}, 17, 169 \bibitem[\protect\citeauthoryear{Cabezas, Corenflos, Lao \& Louf}{Cabezas et~al.}{2024}]{cabezas2024blackjax} Cabezas A., Corenflos A., Lao J., Louf R., 2024, BlackJAX: Composable {B}ayesian inference in {JAX} (\mn@eprint {arXiv} {2402.10797}) \bibitem[\protect\citeauthoryear{Christensen \& Meyer}{Christensen \& Meyer}{2022}]{NelsonMeyerReview} Christensen N., Meyer R., 2022, \mn@doi [Rev. Mod. Phys.] {10.1103/RevModPhys.94.025001}, 94, 025001 \bibitem[\protect\citeauthoryear{Cornish}{Cornish}{2013}]{relativebinning4} Cornish N.~J., 2013, Fast Fisher Matrices and Lazy Likelihoods (\mn@eprint {arXiv} {1007.4820}), \url {https://arxiv.org/abs/1007.4820} \bibitem[\protect\citeauthoryear{Dau \& Chopin}{Dau \& Chopin}{2021}]{dau2021wastefreesequentialmontecarlo} Dau H.-D., Chopin N., 2021, Waste-free Sequential Monte Carlo (\mn@eprint {arXiv} {2011.02328}), \url {https://arxiv.org/abs/2011.02328} \bibitem[\protect\citeauthoryear{Edwards, Wong, Lam, Coogan, Foreman-Mackey, Isi \& Zimmerman}{Edwards et~al.}{2024}]{ripple} Edwards T. D.~P., Wong K. W.~K., Lam K. K.~H., Coogan A., Foreman-Mackey D., Isi M., Zimmerman A., 2024, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.110.064028}, 110, 064028 \bibitem[\protect\citeauthoryear{Handley, Hobson \& Lasenby}{Handley et~al.}{2015}]{Handley:2015vkr} Handley W.~J., Hobson M.~P., Lasenby A.~N., 2015, \mn@doi [Mon. Not. Roy. Astron. Soc.] {10.1093/mnras/stv1911}, 453, 4385 \bibitem[\protect\citeauthoryear{Henderson \& Goggans}{Henderson \& Goggans}{2014}]{parallel_ns} Henderson R.~W., Goggans P.~M., 2014, in AIP Conference Proceedings. pp 100--105, \mn@doi{10.1063/1.4903717} \bibitem[\protect\citeauthoryear{Higson, Handley, Hobson \& Lasenby}{Higson et~al.}{2018}]{Higson_Errors} Higson E., Handley W., Hobson M., Lasenby A., 2018, \mn@doi [Bayesian Analysis] {10.1214/17-BA1075}, 13, 873 \bibitem[\protect\citeauthoryear{Higson, Handley, Hobson \& Lasenby}{Higson et~al.}{2019}]{dynamic_ns} Higson E., Handley W., Hobson M., Lasenby A., 2019, \mn@doi [Statistics and Computing] {10.1007/s11222-018-9844-0}, 29, 891 \bibitem[\protect\citeauthoryear{Hoffman \& Sountsov}{Hoffman \& Sountsov}{2022}]{pmlr-v151-hoffman22a} Hoffman M.~D., Sountsov P., 2022, in Camps-Valls G., Ruiz F. J.~R., Valera I., eds, Proceedings of Machine Learning Research Vol. 151, Proceedings of The 25th International Conference on Artificial Intelligence and Statistics. PMLR, pp 7799--7813, \url {https://proceedings.mlr.press/v151/hoffman22a.html} \bibitem[\protect\citeauthoryear{Hoffman, Radul \& Sountsov}{Hoffman et~al.}{2021}]{pmlr-v130-hoffman21a} Hoffman M., Radul A., Sountsov P., 2021, in Banerjee A., Fukumizu K., eds, Proceedings of Machine Learning Research Vol. 130, Proceedings of The 24th International Conference on Artificial Intelligence and Statistics. PMLR, pp 3907--3915, \url {https://proceedings.mlr.press/v130/hoffman21a.html} \bibitem[\protect\citeauthoryear{Hu \& Veitch}{Hu \& Veitch}{2024}]{HuAccelerationReview} Hu Q., Veitch J., 2024, Costs of Bayesian Parameter Estimation in Third-Generation Gravitational Wave Detectors: a Review of Acceleration Methods (\mn@eprint {arXiv} {2412.02651}), \url {https://arxiv.org/abs/2412.02651} \bibitem[\protect\citeauthoryear{Hu, Baryshnikov \& Handley}{Hu et~al.}{2024}]{aeons} Hu Z., Baryshnikov A., Handley W., 2024, \mn@doi [Mon. Not. Roy. Astron. Soc.] {10.1093/mnras/stae1754}, 532, 4035 \bibitem[\protect\citeauthoryear{Jouppi et~al.,}{Jouppi et~al.}{2023}]{jouppi2023tpu} Jouppi N., et~al., 2023, in Proceedings of the 50th annual international symposium on computer architecture. pp 1--14 \bibitem[\protect\citeauthoryear{Khan, Husa, Hannam, Ohme, P\"urrer, Jim\'enez~Forteza \& Boh\'e}{Khan et~al.}{2016}]{Khan:2015jqa} Khan S., Husa S., Hannam M., Ohme F., P\"urrer M., Jim\'enez~Forteza X., Boh\'e A., 2016, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.93.044007}, 93, 044007 \bibitem[\protect\citeauthoryear{Krishna, Vijaykumar, Ganguly, Talbot, Biscoveanu, George, Williams \& Zimmerman}{Krishna et~al.}{2023}]{TL_relativebinning} Krishna K., Vijaykumar A., Ganguly A., Talbot C., Biscoveanu S., George R.~N., Williams N., Zimmerman A., 2023, Accelerated parameter estimation in Bilby with relative binning (\mn@eprint {arXiv} {2312.06009}), \url {https://arxiv.org/abs/2312.06009} \bibitem[\protect\citeauthoryear{Leslie, Dai \& Pratten}{Leslie et~al.}{2021}]{relativebinning3} Leslie N., Dai L., Pratten G., 2021, \mn@doi [Physical Review D] {10.1103/physrevd.104.123030}, 104 \bibitem[\protect\citeauthoryear{{NVIDIA Corporation}}{{NVIDIA Corporation}}{2025}]{CUDAGuide} {NVIDIA Corporation} 2025, {CUDA C++ Programming Guide}, \url {https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html} \bibitem[\protect\citeauthoryear{Neal}{Neal}{2003}]{Neal2003_slice} Neal R.~M., 2003, \mn@doi [The Annals of Statistics] {10.1214/aos/1056562461}, 31, 705 \bibitem[\protect\citeauthoryear{Owens, Houston, Luebke, Green, Stone \& Phillips}{Owens et~al.}{2008}]{GPU_computing} Owens J.~D., Houston M., Luebke D., Green S., Stone J.~E., Phillips J.~C., 2008, \mn@doi [Proceedings of the IEEE] {10.1109/JPROC.2008.917757}, 96, 879 \bibitem[\protect\citeauthoryear{Prathaban \& Handley}{Prathaban \& Handley}{2024}]{Prathaban_PE_errors} Prathaban M., Handley W., 2024, \mn@doi [Monthly Notices of the Royal Astronomical Society] {10.1093/mnras/stae1908}, 533, 1839 \bibitem[\protect\citeauthoryear{Prathaban et~al.}{Prathaban et~al.}{2025b}]{Prathaban_2025_github} Prathaban M., et~al., 2025b, {blackjax\_ns\_gw: blackjax-ns for Gravitational Wave Inference on GPUs}, \url{https://github.com/mrosep/blackjax_ns_gw} \bibitem[\protect\citeauthoryear{Prathaban, Yallup, Alvey, Yang, Templeton \& Handley}{Prathaban et~al.}{2025a}]{Prathaban_2025_zenodo} Prathaban M., Yallup D., Alvey J., Yang M., Templeton W., Handley W., 2025a, {Gravitational-wave inference at GPU speed: A bilby-like nested sampling kernel within blackjax-ns}, \mn@doi{10.5281/zenodo.17012011}, \url {https://doi.org/10.5281/zenodo.17012011} \bibitem[\protect\citeauthoryear{Prathaban, Bevins \& Handley}{Prathaban et~al.}{2025c}]{Prathaban} Prathaban M., Bevins H., Handley W., 2025c, \mn@doi [Monthly Notices of the Royal Astronomical Society] {10.1093/mnras/staf962}, p. staf962 \bibitem[\protect\citeauthoryear{Pratten et~al.,}{Pratten et~al.}{2021}]{IMRPhenomXPHM} Pratten G., et~al., 2021, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.103.104056}, 103, 104056 \bibitem[\protect\citeauthoryear{Romero-Shaw et~al.,}{Romero-Shaw et~al.}{2020}]{bilby_validation} Romero-Shaw I.~M., et~al., 2020, \mn@doi [Monthly Notices of the Royal Astronomical Society] {10.1093/mnras/staa2850}, 499, 3295 \bibitem[\protect\citeauthoryear{Skilling}{Skilling}{2006}]{skilling} Skilling J., 2006, \mn@doi [Bayesian Analysis] {10.1214/06-BA127}, 1, 833 \bibitem[\protect\citeauthoryear{Smith, Ashton, Vajpeyi \& Talbot}{Smith et~al.}{2020}]{Smith:2019ucc} Smith R. J.~E., Ashton G., Vajpeyi A., Talbot C., 2020, \mn@doi [Mon. Not. Roy. Astron. Soc.] {10.1093/mnras/staa2483}, 498, 4492 \bibitem[\protect\citeauthoryear{Speagle}{Speagle}{2020}]{dynesty} Speagle J.~S., 2020, \mn@doi [Monthly Notices of the Royal Astronomical Society] {10.1093/mnras/staa278}, 493, 3132–3158 \bibitem[\protect\citeauthoryear{Storn \& Price}{Storn \& Price}{1997}]{DE} Storn R., Price K., 1997, \mn@doi [J. Global Optim.] {10.1023/A:1008202821328}, 11, 341 \bibitem[\protect\citeauthoryear{{The LIGO Scientific Collaboration} et~al.}{{The LIGO Scientific Collaboration} et~al.}{2015}]{aLIGO} {The LIGO Scientific Collaboration} et~al., 2015, \mn@doi [Classical and Quantum Gravity] {10.1088/0264-9381/32/7/074001}, 32, 074001 \bibitem[\protect\citeauthoryear{Thrane \& Talbot}{Thrane \& Talbot}{2019}]{Thrane_2019} Thrane E., Talbot C., 2019, \mn@doi [Publications of the Astronomical Society of Australia] {10.1017/pasa.2019.2}, 36 \bibitem[\protect\citeauthoryear{Veitch et~al.,}{Veitch et~al.}{2015}]{lal} Veitch J., et~al., 2015, \mn@doi [Physical Review D] {10.1103/physrevd.91.042003}, 91 \bibitem[\protect\citeauthoryear{{Williams}, {Veitch} \& {Messenger}}{{Williams} et~al.}{2021}]{Williams2021Nessai} {Williams} M.~J., {Veitch} J., {Messenger} C., 2021, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.103.103006}, \href {https://ui.adsabs.harvard.edu/abs/2021PhRvD.103j3006W} {103, 103006} \bibitem[\protect\citeauthoryear{Wong, Isi \& Edwards}{Wong et~al.}{2023a}]{wong2023fastgravitationalwaveparameter} Wong K. W.~K., Isi M., Edwards T. D.~P., 2023a, Fast gravitational wave parameter estimation without compromises (\mn@eprint {arXiv} {2302.05333}), \url {https://arxiv.org/abs/2302.05333} \bibitem[\protect\citeauthoryear{Wong, Gabri\'e \& Foreman-Mackey}{Wong et~al.}{2023b}]{flowMC} Wong K. W.~k., Gabri\'e M., Foreman-Mackey D., 2023b, \mn@doi [J. Open Source Softw.] {10.21105/joss.05021}, 8, 5021 \bibitem[\protect\citeauthoryear{Wouters, Pang, Dietrich \& Van Den~Broeck}{Wouters et~al.}{2024}]{Wouters_BNS} Wouters T., Pang P. T.~H., Dietrich T., Van Den~Broeck C., 2024, \mn@doi [Phys. Rev. D] {10.1103/PhysRevD.110.083033}, 110, 083033 \bibitem[\protect\citeauthoryear{Yallup, Handley, Cabezas, et~al.}{Yallup et~al.}{2025b}]{blackjax_ns_github} Yallup D., Handley W., Cabezas J., et~al., 2025b, {blackjax}: A Library of Bayesian Inference Algorithms in JAX (Nested Sampling Branch), \url{https://github.com/handley-lab/blackjax/tree/nested_sampling} \bibitem[\protect\citeauthoryear{Yallup, Kroupa \& Handley}{Yallup et~al.}{2025a}]{yallup2025nested} Yallup D., Kroupa N., Handley W., 2025a, in Frontiers in Probabilistic Inference: Learning meets Sampling. \url {https://openreview.net/forum?id=ekbkMSuPo4} \bibitem[\protect\citeauthoryear{Zackay, Dai \& Venumadhav}{Zackay et~al.}{2018}]{relativebinning2} Zackay B., Dai L., Venumadhav T., 2018, Relative Binning and Fast Likelihood Evaluation for Gravitational Wave Parameter Estimation (\mn@eprint {arXiv} {1806.08792}), \url {https://arxiv.org/abs/1806.08792} \bibitem[\protect\citeauthoryear{{ter Braak}}{{ter Braak}}{2006}]{DE2} {ter Braak} C. J.~F., 2006, \mn@doi [Statistics and Computing] {10.1007/s11222-006-8769-1}, 16, 239 \makeatother \end{thebibliography} ``` 5. **Author Information:** - Lead Author: {'name': 'Metha Prathaban'} - Full Authors List: ```yaml Metha Prathaban: phd: start: 2022-10-01 supervisors: - Will Handley thesis: null partiii: start: 2020-10-01 end: 2021-06-01 supervisors: - Will Handley thesis: Evidence for a Palindromic Universe original_image: images/originals/metha_prathaban.png image: /assets/group/images/metha_prathaban.jpg links: Harding Scholar: https://www.hardingscholars.fund.cam.ac.uk/metha-prathaban-2022-cohort GitHub: https://github.com/mrosep David Yallup: postdoc: start: 2021-01-10 thesis: null original_image: images/originals/david_yallup.jpg image: /assets/group/images/david_yallup.jpg links: ORCiD: https://orcid.org/0000-0003-4716-5817 linkedin: https://www.linkedin.com/in/dyallup/ James Alvey: coi: start: 2024-10-01 thesis: null image: https://www.kicc.cam.ac.uk/sites/default/files/styles/inline/public/images/profile/profilepicture-min_1.jpeg?itok=ccmz2RPK links: Department webpage: https://www.kicc.cam.ac.uk/staff/dr-james-alvey Ming Yang: partiii: start: 2024-10-01 end: 2025-06-01 supervisors: - Will Handley thesis: null original_image: images/originals/ming_yang.jpeg image: /assets/group/images/ming_yang.jpg Will Templeton: partiii: start: 2024-10-01 end: 2025-06-01 supervisors: - Will Handley thesis: null original_image: images/originals/will_templeton.png image: /assets/group/images/will_templeton.jpg Will Handley: pi: start: 2020-10-01 thesis: null postdoc: start: 2016-10-01 end: 2020-10-01 thesis: null phd: start: 2012-10-01 end: 2016-09-30 supervisors: - Anthony Lasenby - Mike Hobson thesis: 'Kinetic initial conditions for inflation: theory, observation and methods' original_image: images/originals/will_handley.jpeg image: /assets/group/images/will_handley.jpg links: Webpage: https://willhandley.co.uk ``` This YAML file provides a concise snapshot of an academic research group. It lists members by name along with their academic roles—ranging from Part III and summer projects to MPhil, PhD, and postdoctoral positions—with corresponding dates, thesis topics, and supervisor details. Supplementary metadata includes image paths and links to personal or departmental webpages. A dedicated "coi" section profiles senior researchers, highlighting the group’s collaborative mentoring network and career trajectories in cosmology, astrophysics, and Bayesian data analysis. ==================================================================================== Final Output Instructions ==================================================================================== - Combine all data sources to create a seamless, engaging narrative. - Follow the exact Markdown output format provided at the top. - Do not include any extra explanation, commentary, or wrapping beyond the specified Markdown. - Validate that every bibliographic reference with a DOI or arXiv identifier is converted into a Markdown link as per the examples. - Validate that every Markdown author link corresponds to a link in the author information block. - Before finalizing, confirm that no LaTeX citation commands or other undesired formatting remain. - Before finalizing, confirm that the link to the paper itself [2509.04336](https://arxiv.org/abs/2509.04336) is featured in the first sentence. Generate only the final Markdown output that meets all these requirements. {% endraw %}