Bayesian Anomaly Detection for Ia Cosmology: Automating SALT3 Data Curation

AI generated image

Type Ia supernovae (SNe Ia) are cornerstones of modern cosmology, providing the crucial standardisable candles that led to the discovery of the Universe’s accelerating expansion. In a new paper, 2509.13394, our group introduces a sophisticated Bayesian framework to automate the critical process of data curation for SNe Ia light curves, a vital step for ensuring the precision needed for next-generation cosmological analyses. As we enter an era of unprecedented data volumes from surveys like the Vera C. Rubin Observatory’s Legacy Survey of Space and Time (LSST), traditional methods of manually inspecting and cleaning datasets are becoming entirely impractical. The work, led by S. A. K. Leeney along with W. J. Handley, H. T. J. Bevins, and E. de Lera Acedo, tackles this challenge head-on by embedding anomaly detection directly into the statistical heart of the analysis.

A Principled Approach to Imperfect Data

Current SNe Ia analyses often rely on subjective, manual decisions to reject anomalous photometric measurements or entire bandpasses. This not only introduces potential biases but also lacks the scalability required for the millions of supernovae LSST is expected to observe (10.3847/1538-4357/ab042c). Our framework provides an automated, objective, and reproducible alternative.

The core of our method, adapted from techniques developed for 21cm cosmology (2211.15448), is to treat the quality of each data point as a latent variable to be inferred. Instead of making a hard decision to keep or reject a measurement, our model calculates the posterior probability that each point is an anomaly. This is achieved by reformulating the likelihood to include two competing models for each data point:

A “good” model, where the data point is described by the physical supernova model (in this case, the widely-used SALT3 model).
A “bad” model, which treats the data point as a contaminant, described by a broad, uninformative distribution.

By marginalising over all possible combinations of good and bad data points, the framework simultaneously infers the supernova’s physical parameters while probabilistically identifying and down-weighting outliers. To make this computationally intensive process feasible for large datasets, the entire framework is implemented on GPUs using the JAX-bandflux library.

Key Capabilities and Findings

Applying this framework to the Hawaii Supernova Flows dataset, we demonstrated three principal advantages over traditional data curation:

Robust Outlier Mitigation: The model effectively identifies and mitigates the influence of isolated, sporadic outliers that could otherwise skew parameter estimates.
Automated Filter Rejection: In cases where an entire photometric bandpass is corrupted, the framework automatically identifies every data point within it as anomalous, replicating the decision an expert analyst would make without any manual intervention.
Enhanced Data Preservation: Perhaps most significantly, the method can flag specific anomalous points within an otherwise useful filter. This “filter preservation” retains valid data that would be discarded in an all-or-nothing rejection scheme, leading to more precise and robust parameter constraints.

Beyond simply cleaning the data, our analysis quantified the systematic biases introduced by these anomalies. We found that contaminating data points are systematically brighter and bluer than the underlying supernova model would predict. If left uncorrected, this bias would lead to an underestimation of dust extinction. Through the well-known Tripp relation, this would systematically lower the inferred distance moduli, making supernovae appear closer than they are and potentially biasing key cosmological measurements, including the dark energy equation of state and the Hubble constant.

This work represents a critical step towards building fully autonomous, end-to-end analysis pipelines for precision cosmology. While demonstrated with the SALT3 model (10.3847/1538-4357/ac30d8), the framework’s model-agnostic design makes it a versatile tool for any likelihood-based analysis. By embedding data quality assessment within the core statistical model, we can build a more robust, efficient, and powerful foundation for the coming era of data-driven cosmological discovery.

S. A. K. Leeney W. J. Handley H. T. J. Bevins

Content generated by gemini-2.5-pro using this prompt.

Image generated by imagen-4.0-generate-001 using this prompt.