Performance¶

The forward model is pure JAX, so every backend (MAP, NUTS, geoVI, …) runs against the same compiled computation graph. That makes “how fast is tengri?” a small set of numbers that travel together.

This page summarises what the existing benchmark suite measures, the headline numbers from the last full run, and how to reproduce them on your hardware. The full benchmark suite ships under bench/scripts/benchmark_*.py and is consolidated behind one entry point — see Health check & dispatcher below.

Warning

The headline numbers below were measured in April–May 2026. Several have not been re-run after recent forward-model changes and may be stale. Treat them as ballpark, not authoritative; re-run the relevant script (see Reproducing the headline numbers) before quoting in a paper or PR. The bench/reports/ directory carries the date of every measurement event.

Headline numbers (Apple M-series CPU, x64, JAX 0.9, last run May 2026)¶

Forward photometric prediction on SDSS ugriz at z = 0.1, 5 bands, running on a single CPU core:

Configuration	Exact	Compositional	Hybrid (precomputed)
Stellar only	23.9 ms	1.5 ms	59 µs (408×)
+ nebular (BakedIn)	24.7 ms	1.5 ms	58 µs (424×)
+ nebular (Cue emulator)	61.6 ms	2.4 ms	567 µs (109×)
+ dust IR (THEMIS)	27.3 ms	2.0 ms	158 µs (173×)
+ radio + X-ray + AGN	76.4 ms	4.6 ms	2.44 ms (31×)
Kitchen sink (all emitters)	76.1 ms	4.6 ms	2.45 ms (31×)

— full table at bench/reports/2026-05-06_forward_model_speedup.md

Inference backends on a 7-parameter mock fit (compile + sample wall):

Backend	First call	Steady-state
MAP (L-BFGS)	~5 s	< 1 s
Laplace	~5 s	< 1 s
Pathfinder	~10 s	~2 s
NUTS (1k samples)	~30 s	~5 s
`vi_native` (geoVI, JAX-native)	~10 s	2.3 s
`vi` (NIFTy.re)	~75 s	43.7 s

— full breakdowns: 2026-04-17_native_vs_nifty.md, 2026-04-22_pathfinder_vs_window_nuts.md, 2026-05-06_compile_vs_sampling_breakdown.md

vi_native is 19–25× faster than the NIFTy path on smooth-SFH fits but is not drop-in posterior-equivalent: PSD-timescale parameters differ by an order of magnitude on stochastic fits. Validate per problem before swapping.

Persistent compile cache¶

JAX recompiles XLA programs on every cold start. Tengri auto-enables a persistent on-disk cache at ~/.cache/tengri_jax_cache so notebook restarts, slurm tasks, and benchmark runs all skip the expensive first compile (geoVI ~75 s, MGVI ~10 s, NUTS warmup tens of seconds).

export TENGRI_JAX_CACHE_DIR=/scratch/$USER/jax_cache  # custom location
export TENGRI_DISABLE_JAX_CACHE=1                     # opt out

After upgrading JAX, wipe stale entries:

import tengri
tengri.clear_cache()

Default min_compile_time_secs=5.0 keeps small SSP/dust kernels out of the cache. See compilation_cache.md and compilation_diagnostics.md for full details.

Health check and dispatcher¶

A one-command quick read of your install:

python -m tengri.bench

prints the JAX backend, default device, persistent compile-cache size, and a 1-galaxy + 100-galaxy timing on SDSS ugriz. ~30 s on CPU after the cache is warm.

Every comprehensive benchmark script under scripts/ is also reachable through one entry point:

python -m tengri.bench list                      # show all
python -m tengri.bench help forward_model        # what does it measure?
python -m tengri.bench forward_model             # run it

Available benchmarks (bench list):

Name	What it measures
`forward_model`	Forward photometry: exact / compositional / hybrid across all emitters
`components`	Per-component (stellar, dust, nebular, AGN, …) wall-clock timing
`jit_compile`	Population-scale JIT compile time vs N galaxies
`jit_real_path`	Compile time on the production forward-model path
`inference_engines`	MAP / Laplace / NUTS / VI / NSS at D = 7, 12, 20
`vi_native_vs_nifty`	geoVI: pure-JAX `vi_native` vs the NIFTy.re reference path
`vi_xlarge`	VI scaling on stochastic-SFH problems with D >> 100
`population_native`	Hierarchical PopulationFitter: per-iteration cost vs N galaxies
`adam_vs_lbfgs`	MAP optimizers head-to-head
`cue`	Cue (Li+2025) nebular emulator timing in isolation
`loss_timing`	Per-call loss / negative-log-posterior timing
`joint_indices_e2e`	End-to-end timing for joint photometry + spectral indices
`precompute_analytic`	Analytic precompute lookup vs full-spectrum integration
`precompute_quad`	Quadrature precompute: accuracy vs grid resolution
`ztable_interp`	Metallicity-table interpolation kernel timing

Reproducing the headline numbers¶

JAX_PLATFORMS=cpu python -m tengri.bench forward_model
JAX_PLATFORMS=cpu python -m tengri.bench inference_engines

Each script writes its dated report to bench/reports/ (or to stdout, depending on the script). The reports there are the source of truth for every number quoted on this page. bench/RERUN.md tracks which scripts are due for a re-run.

Hardware notes¶

All numbers above are single CPU core on Apple M-series hardware. Tengri runs on JAX, so the same code executes on GPU/TPU without modification — but those platforms have not been benchmarked. See Getting Started → GPU for setup.
JAX Metal (Apple GPU) is experimental and causes test failures; CPU is the supported reference platform for benchmarks. Set JAX_PLATFORMS=cpu to be explicit.
Memory: smooth D = 7 fits run in ~100 MB; stochastic D = 137 in ~1.5 GB. NUTS warmup with dense_mass=True peaks 3–6× steady state on small models and can hit 20+ GB on D ≥ 8 with dense_basis SFHs; multi-fit notebooks need dense_mass=False. See Memory expectations for the full table and the two recurring OOM patterns.

When numbers look wrong¶

If python -m tengri.bench shows a much slower 1-galaxy timing than the table above:

Confirm x64: True (some downstream behaviour assumes 64-bit).
Confirm default device: cpu — Metal sometimes silently picks itself up and slows things down. Force CPU with JAX_PLATFORMS=cpu.
Check the cache size — if it’s in the GB range with hundreds of files, tengri.clear_cache() after a JAX upgrade is sometimes the fix.
The default bench SSP grid is whatever first matches data/ssp_*.h5; a multi-Z, full-α/Fe grid is meaningfully slower than the prsc_miles grid used in notebooks/00_quickstart. The relative numbers (vmap speedup, exact-vs-hybrid ratio) are what matters.