AI generated image

Our new work, 2511.08438, introduces “Galactification,” a novel framework for rapidly generating realistic mock galaxy catalogs by leveraging the power of transformer-based AI. This research, led by Shivam Pandey along with collaborators Christopher C. Lovell, Chirag Modi, and Benjamin D. Wandelt, tackles a fundamental challenge in modern cosmology: the immense computational cost of simulating the Universe at the scale and detail required by next-generation astronomical surveys.

The Cosmological Simulation Challenge

To interpret the vast datasets from surveys like Euclid, we need to compare observations with theoretical predictions. The most accurate theoretical models come from hydrodynamical simulations, which self-consistently evolve dark matter, gas, stars, and black holes. However, their computational expense is staggering—a single high-fidelity run can consume hundreds of millions of CPU hours. This makes it prohibitive to generate the large ensembles of simulations needed for robust statistical analyses, such as those used in simulation-based inference (SBI) [10.1073/pnas.1912789117].

On the other hand, dark-matter-only (N-body) simulations are over 100 times faster, as they only model gravitational interactions. The key challenge is to bridge the gap: how can we accurately populate these fast N-body simulations with galaxies that have realistic properties and spatial distributions, mirroring what a full hydrodynamical simulation would produce?

Galactification: An AI-Powered Solution

Our approach is to learn the complex, conditional relationship between the underlying dark matter scaffolding and the galaxies that inhabit it. We developed a multi-modal, transformer-based model that effectively “paints” galaxies onto N-body simulations.

Model Architecture and Data

The model is trained on the CAMELS simulation suite [10.3847/1538-4357/abf7ba], which provides paired N-body and hydrodynamical runs across a range of cosmological and astrophysical parameters. Our model’s architecture, which builds upon the framework developed in 2409.11401, is designed to capture information at multiple scales:

  • Input: The model takes 3D dark matter density and velocity fields from five different snapshots in time. This provides crucial information about the local environment and the growth of large-scale structure.
  • Encoder: A Convolutional Block Attention Module (CBAM) and a Vision Transformer (ViT) work in tandem to extract both local features and long-range correlations from the input fields.
  • Decoder: A cross-attention decoder then generates a sequence of “tokens” that represent the full galaxy catalog. Each galaxy is described by a “word” of six tokens corresponding to its 3D position, line-of-sight velocity, stellar mass, and g-band magnitude.

This generative approach allows the model to capture the inherent stochasticity of galaxy formation, producing varied and realistic catalogs from a single dark matter input. The entire process is incredibly efficient, generating a full galaxy catalog in approximately 30 seconds on a single GPU—a dramatic acceleration compared to the thousands of CPU-hours for the equivalent hydrosimulation.

Validating the Results

We rigorously tested our model’s output against the ground-truth hydrodynamical simulations from a held-out test set. The results demonstrate remarkable fidelity across a range of statistical measures.

  • Multi-Dimensional Distribution: Using the PQMass methodology [2402.04355], we performed a quantitative comparison of the full six-dimensional distribution of galaxy properties (3 position, 3 physical). The results confirm that our generated catalogs are statistically indistinguishable from the true ones.
  • One-Point Statistics: As shown in the paper, the model accurately reproduces the histograms of stellar mass, g-band magnitude, and line-of-sight velocity. Crucially, it also captures how these distributions vary as the underlying cosmological parameters (like $\Omega_{\rm m}$) change.
  • Two-Point Statistics: We validated the spatial clustering of the generated galaxies using the redshift-space power spectrum. The model not only matches the unweighted power spectrum but also correctly reproduces spectra weighted by stellar mass and magnitude. This provides a powerful test of its ability to learn the joint distribution of galaxy positions and their physical properties.

By successfully capturing the complex galaxy-halo connection, its stochasticity, and its dependence on cosmological parameters, Galactification provides a powerful and accelerated forward model for cosmological inference. Future work will focus on scaling this framework to larger simulation volumes and incorporating a wider range of observable galaxy properties.

Chris Lovell

Content generated by gemini-2.5-pro using this prompt.

Image generated by imagen-4.0-generate-001 using this prompt.