Hunting for bumps in the margins
In the search for new particles at collider experiments, accurately distinguishing a faint signal from an overwhelming background is a central challenge. Our paper, “Hunting for bumps in the margins,” by David Yallup and Will Handley, presents a principled Bayesian framework to address the critical and often-overlooked issue of modelling uncertainty in data-driven background estimation.
The Challenge of Background Modelling
At particle colliders like the LHC, many searches for new physics involve looking for a small “bump” in a smoothly falling mass spectrum. The key difficulty is that the precise shape of this background is often not known from first principles and must be modelled using flexible functions fit to the data itself. This procedure introduces two sources of uncertainty:
- Parametric Uncertainty: The uncertainty on the parameters of a given background model, arising from the statistical fluctuations in the data.
- Modelling Uncertainty: The uncertainty from the choice of the model itself. Is an exponential function better than a power law? How many terms should be in the function?
Traditionally, this modelling uncertainty is handled by fitting several candidate models and creating an envelope of the results, a process which can involve ad-hoc choices and tuning, as seen in methods like discrete profiling (1408.6865). Our work proposes a more rigorous and automated approach rooted in Bayesian inference.
A Principled Bayesian Solution
We treat the entire problem within a Bayesian framework, where the a central quantity is the marginal likelihood, or evidence. The evidence for a model naturally embodies Occam’s razor: it rewards models that fit the data well but penalises unnecessary complexity. By comparing evidences, we can perform robust model selection and, even better, average over different models weighted by their evidence.
Our method employs a flexible background model constructed from a sum of simple basis functions (such as exponentials, power laws, and logarithmic polynomials). Crucially, we extend the inference to include discrete hyperparameters that govern the choice of function family and the number of basis functions. To navigate this complex, multimodal parameter space, we leverage the power of Nested Sampling, a computational technique specifically designed for evidence calculation, as implemented in the PolyChord
sampler (1506.00171).
From Backgrounds to Bump Hunts
We first apply this machinery to a toy problem inspired by Higgs boson measurements in the diphoton channel. By marginalising over all model choices, we produce a single, robust background prediction with a principled uncertainty band that fully incorporates both parametric and modelling uncertainties. Our results show that this Bayesian approach promotes a greater diversity of functional forms compared to standard maximum-likelihood methods, preventing overconfidence in any single model.
The framework is then extended to a full-fledged anomaly detection task—a generic “bump hunt.” We introduce a signal hypothesis (a Gaussian bump with unknown amplitude, position, and width) and compare its evidence ($\Z_\psi$) to the evidence for the background-only null hypothesis ($\Z_0$). The ratio of these evidences, $\Z_\psi/\Z_0$, provides a direct measure of how much the data favor the presence of a signal.
Our key findings demonstrate the power of this end-to-end Bayesian pipeline:
- True Positives: When a small signal is injected into the data, our method successfully identifies it, with the evidence ratio providing clear “betting odds” in its favour. The posterior distributions on the signal parameters correctly locate the injected bump.
- False Positives: When run on a null dataset with no signal, the evidence ratio correctly disfavours the signal hypothesis, demonstrating that the Bayesian Occam’s razor effectively guards against mistaking statistical fluctuations for new physics. This addresses a notoriously difficult aspect of resonance searches, as discussed in works like 1902.03243.
By constructing a well-calibrated, data-driven background model and seamlessly integrating it into a signal search, this work provides a powerful and statistically justifiable pipeline for anomaly detection, with the potential to uncover subtle signals that might be missed by more conventional techniques.
Content generated by gemini-2.5-pro using this prompt.
Image generated by imagen-3.0-generate-002 using this prompt.