Benchmarks

All benchmarks use the same Lindblad master equation for $M$ coupled cavities with Fock truncation $d=3$:

\[\dot{\rho} = -i[H, \rho] + \gamma \sum_m \left( a_m \rho a_m^\dagger - \frac{1}{2}\{a_m^\dagger a_m, \rho\} \right)\]

with $H = \sum_m \chi\, n_m(n_m - 1) + \sum_{n \ne m} \kappa\, a_n^\dagger a_m$, $\chi = 2\pi \times 0.2$, $\kappa = 2\pi \times 0.1$, $\gamma = 0.01$. Initial state: $|1,0,\ldots,0\rangle\langle 1,0,\ldots,0|$.

When to Use cuDensityMat

cuDensityMat uses tensor network contraction — it never materializes the full Liouvillian superoperator. For small systems ($M \le 6$), the per-action overhead makes it slower than sparse matrix approaches. For large systems ($M \ge 8$), it is the only viable option because the sparse Liouvillian exceeds GPU memory.

Regime	Best approach
$M \le 4$ (D ≤ 81)	CPU sparse (QuantumToolbox.jl)
$M = 5\text{--}6$ (D ≤ 729)	GPU cuSPARSE SpMV
$M \ge 8$ (D ≥ 6,561)	cuDensityMat (only option)

A100: CuQuantum.jl vs QuantumToolbox.jl

Same hardware (NVIDIA A100-SXM4-40GB, 12 vCPUs), same RK4 integrator, same time steps.

Full 100-step RK4 Simulation

M	D	CuQuantum.jl	QT.jl GPU cuSPARSE	QT.jl CPU
4	81	0.46 s	0.018 s	0.021 s
6	729	2.62 s	0.32 s	5.89 s
8	6,561	250 s	infeasible	infeasible

At $M=6$, cuSPARSE is 8× faster than cuDensityMat. At $M=8$, the Liouvillian is $43\text{M} \times 43\text{M}$ sparse (~77 GB) — only cuDensityMat can run.

Single $L[\rho]$ Action

M	D	CuQuantum.jl	QT.jl GPU cuSPARSE
2	9	0.267 ms	0.039 ms
4	81	1.22 ms	0.048 ms
6	729	6.45 ms	0.90 ms
8	6,561	620 ms	infeasible

Batched Evolution (A100)

Many density matrices evolved with different coupling strengths in one kernel launch — the parameter sweep use case for quantum optimal control.

M	D	Batch	cuDM batched	cuDM sequential	Speedup
2	9	64	0.45 ms	20.6 ms	46×
2	9	256	0.38 ms	70.1 ms	186×
4	81	64	2.79 ms	70.1 ms	25×
4	81	256	8.05 ms	280.7 ms	35×
6	729	64	297 ms	497 ms	1.7×
6	729	256	1,146 ms	1,990 ms	1.7×

At small system sizes ($M=2\text{--}4$), native batching amortizes kernel launch overhead by up to 186×. At $M=6$, the per-action cost dominates and batching gives ~1.7× speedup.

Cross-Framework Comparison (Tesla T4)

Five frameworks, same problem. cuDensityMat benchmarks use RK4; QuantumToolbox.jl and QuTiP use adaptive solvers.

Full 100-step Simulation

M	D	CuQuantum.jl	Python CuPy	JAX ext	QT.jl CPU	QuTiP
2	9	0.22 s	1.36 s	0.38 s	0.0003 s	0.005 s
4	81	1.48 s	1.01 s	1.08 s	0.021 s	0.043 s
6	729	60.0 s	64.0 s	64.4 s	6.67 s	12.3 s

On T4, CPU sparse wins at all sizes. The T4 has only 0.25 TFLOPS FP64 — cuDensityMat's tensor network contraction needs A100-class throughput to be competitive.

Single $L[\rho]$ Action (T4)

M	D	CuQuantum.jl	Python CuPy	JAX ext	QT.jl GPU cuSPARSE	QuTiP
2	9	0.48 ms	0.74 ms	0.88 ms	0.046 ms	0.005 ms
4	81	4.93 ms	5.79 ms	6.00 ms	0.072 ms	14.0 ms
6	729	149 ms	159 ms	161 ms	4.06 ms	OOM

Wrapper Overhead

All three cuDensityMat wrappers call the same C library. The overhead at $M=6$ on T4:

Wrapper	Single $L[\rho]$	vs Julia
CuQuantum.jl	149 ms	—
Python CuPy	159 ms	+7%
JAX extension	161 ms	+8%

Julia is faster due to direct ccall without Python GIL overhead. The gap narrows at large sizes where the cuDensityMat kernel dominates.

Validation

CPU and GPU produce bitwise-identical results for the static Liouvillian at $t=0$. All frameworks agree on final populations to within solver tolerances across 100 time steps.

Environment

	A100 benchmarks	T4 benchmarks
GPU	A100-SXM4-40GB	Tesla T4 (16 GB)
CPU	12 vCPUs, 85 GB	8 vCPUs, 30 GB
Julia	1.12.5	1.12.5
CUDA	12.8	12.8
Python	—	3.10 / 3.11 (JAX)
QuTiP	—	5.2.3
JAX	—	0.6.2

Reproduction

Benchmark scripts are in benchmark/comparison/ and benchmark/batched/. See the comparison README for setup and run instructions.