CI and Benchmarks
CuQuantum.jl uses GitHub Actions for continuous integration. Tests run on both CPU-only runners (GitHub-hosted) and GPU runners (on-demand EC2 instances via machulav/ec2-github-runner).
Workflows
CI (CI.yml)
Triggers on every push to main and on pull requests.
CPU matrix – runs on ubuntu-latest across Julia 1.10, 1.11, and 1.12. Tests that require a GPU are automatically skipped via the @gpu_test macro (see test/setup.jl). This validates module loading, the test infrastructure itself, and any non-GPU logic.
GPU tests – a three-job pattern that spins up a g4dn.xlarge (NVIDIA T4) EC2 instance on demand:
- Start GPU Runner – authenticates with AWS via OIDC, launches the instance using a multi-AZ configuration for capacity resilience, and installs Julia via
juliaup. - GPU Tests – runs the full test suite (
test/runtests.jl) on the EC2 instance. All@gpu_testblocks execute here sinceCUDA.functional()returnstrue. - Stop GPU Runner – terminates the EC2 instance. This job runs with
if: always()so the instance is cleaned up even when tests fail.
For pull requests from forks, GPU tests are skipped (security constraint – self-hosted runners should not execute untrusted code).
GPU Benchmark (gpu-benchmark.yml)
Triggered manually via workflow_dispatch from the Actions tab.
Inputs:
| Input | Description |
|---|---|
instance-type | EC2 instance type: g4dn.xlarge (T4), g5.xlarge (A10G), p4d.24xlarge (8x A100), p5.48xlarge (8x H100) |
benchmark-script | Julia script to run (default: benchmark/run_benchmarks.jl) |
Uses the same three-job start/work/stop pattern as CI. Results are uploaded as a workflow artifact named benchmark-results-<instance-type>-<run-number>.
The default benchmark (benchmark/run_benchmarks.jl) compares GPU (cuDensityMat) vs CPU (dense and sparse) performance for Liouvillian action $L[\rho]$ on a dual-rail cavity system at increasing numbers of cavities. Larger systems (M=8+) can only be evaluated on GPU – CPU methods exceed memory limits.
Running locally
# Run the full test suite (GPU tests skip automatically without a GPU)
julia --project -e 'using Pkg; Pkg.test()'
# Run benchmarks (requires a CUDA-capable GPU)
julia --project=benchmark benchmark/run_benchmarks.jlInfrastructure
The EC2 runner infrastructure is managed via Terraform in the aws-infra repo (terraform/modules/gpu_runner/). It provisions:
- An OIDC role for GitHub Actions (no stored AWS credentials)
- A security group (egress-only)
- An EC2 instance profile with SSM access for debugging
- The Deep Learning Base AMI (Ubuntu 22.04, pre-installed NVIDIA drivers)
Required secrets
| Secret | Purpose |
|---|---|
GH_RUNNER_PAT | Classic PAT with repo scope (required by ec2-github-runner) |
AWS_GPU_RUNNER_ROLE_ARN | IAM role ARN for OIDC authentication |
AWS_GPU_RUNNER_AZ_CONFIG | JSON array of availability zone configs (imageId, subnetId, securityGroupId per AZ) |
AWS_GPU_RUNNER_INSTANCE_PROFILE | EC2 instance profile name attached to launched instances |
Known constraints
- vCPU quota: The AWS account has a limited on-demand G-family vCPU quota. A
g4dn.xlargeuses 4 vCPUs, so only one GPU runner can be active at a time. Concurrent CI and benchmark runs will fail with capacity errors. - No spot fallback: Instances launch as on-demand. If an AZ lacks capacity, the action tries the next AZ in the config before failing.
- Instance cleanup: If a start job succeeds but the stop job fails to run (e.g., workflow cancelled), check the AWS console for orphaned instances tagged
gpu-runner=true.