Benchmarks

All benchmarks use the same Lindblad master equation for $M$ coupled cavities with Fock truncation $d=3$:

\[\dot{\rho} = -i[H, \rho] + \gamma \sum_m \left( a_m \rho a_m^\dagger - \frac{1}{2}\{a_m^\dagger a_m, \rho\} \right)\]

with $H = \sum_m \chi\, n_m(n_m - 1) + \sum_{n \ne m} \kappa\, a_n^\dagger a_m$, $\chi = 2\pi \times 0.2$, $\kappa = 2\pi \times 0.1$, $\gamma = 0.01$. Initial state: $|1,0,\ldots,0\rangle\langle 1,0,\ldots,0|$.

When to Use cuDensityMat

cuDensityMat uses tensor network contraction — it never materializes the full Liouvillian superoperator. For small systems ($M \le 6$), the per-action overhead makes it slower than sparse matrix approaches. For large systems ($M \ge 8$), it is the only viable option because the sparse Liouvillian exceeds GPU memory.

RegimeBest approach
$M \le 4$ (D ≤ 81)CPU sparse (QuantumToolbox.jl)
$M = 5\text{--}6$ (D ≤ 729)GPU cuSPARSE SpMV
$M \ge 8$ (D ≥ 6,561)cuDensityMat (only option)

A100: CuQuantum.jl vs QuantumToolbox.jl

Same hardware (NVIDIA A100-SXM4-40GB, 12 vCPUs), same RK4 integrator, same time steps.

Full 100-step RK4 Simulation

MDCuQuantum.jlQT.jl GPU cuSPARSEQT.jl CPU
4810.46 s0.018 s0.021 s
67292.62 s0.32 s5.89 s
86,561250 sinfeasibleinfeasible

At $M=6$, cuSPARSE is 8× faster than cuDensityMat. At $M=8$, the Liouvillian is $43\text{M} \times 43\text{M}$ sparse (~77 GB) — only cuDensityMat can run.

Single $L[\rho]$ Action

MDCuQuantum.jlQT.jl GPU cuSPARSE
290.267 ms0.039 ms
4811.22 ms0.048 ms
67296.45 ms0.90 ms
86,561620 msinfeasible

Batched Evolution (A100)

Many density matrices evolved with different coupling strengths in one kernel launch — the parameter sweep use case for quantum optimal control.

MDBatchcuDM batchedcuDM sequentialSpeedup
29640.45 ms20.6 ms46×
292560.38 ms70.1 ms186×
481642.79 ms70.1 ms25×
4812568.05 ms280.7 ms35×
672964297 ms497 ms1.7×
67292561,146 ms1,990 ms1.7×

At small system sizes ($M=2\text{--}4$), native batching amortizes kernel launch overhead by up to 186×. At $M=6$, the per-action cost dominates and batching gives ~1.7× speedup.

Cross-Framework Comparison (Tesla T4)

Five frameworks, same problem. cuDensityMat benchmarks use RK4; QuantumToolbox.jl and QuTiP use adaptive solvers.

Full 100-step Simulation

MDCuQuantum.jlPython CuPyJAX extQT.jl CPUQuTiP
290.22 s1.36 s0.38 s0.0003 s0.005 s
4811.48 s1.01 s1.08 s0.021 s0.043 s
672960.0 s64.0 s64.4 s6.67 s12.3 s

On T4, CPU sparse wins at all sizes. The T4 has only 0.25 TFLOPS FP64 — cuDensityMat's tensor network contraction needs A100-class throughput to be competitive.

Single $L[\rho]$ Action (T4)

MDCuQuantum.jlPython CuPyJAX extQT.jl GPU cuSPARSEQuTiP
290.48 ms0.74 ms0.88 ms0.046 ms0.005 ms
4814.93 ms5.79 ms6.00 ms0.072 ms14.0 ms
6729149 ms159 ms161 ms4.06 msOOM

Wrapper Overhead

All three cuDensityMat wrappers call the same C library. The overhead at $M=6$ on T4:

WrapperSingle $L[\rho]$vs Julia
CuQuantum.jl149 ms
Python CuPy159 ms+7%
JAX extension161 ms+8%

Julia is faster due to direct ccall without Python GIL overhead. The gap narrows at large sizes where the cuDensityMat kernel dominates.

Validation

CPU and GPU produce bitwise-identical results for the static Liouvillian at $t=0$. All frameworks agree on final populations to within solver tolerances across 100 time steps.

Environment

A100 benchmarksT4 benchmarks
GPUA100-SXM4-40GBTesla T4 (16 GB)
CPU12 vCPUs, 85 GB8 vCPUs, 30 GB
Julia1.12.51.12.5
CUDA12.812.8
Python3.10 / 3.11 (JAX)
QuTiP5.2.3
JAX0.6.2

Reproduction

Benchmark scripts are in benchmark/comparison/ and benchmark/batched/. See the comparison README for setup and run instructions.