GPU Support (CUDA + KernelAbstractions)
Julia has outstanding support for GPUs as it compiles machine code for the particular devices. Importantly, all modern GPUs are supported, which implies that it is quite straightforward to write GPU kernels in Julia, for example using packages such as KernelAbstractions with no need to write new code when you use AMD instead of NVIDIA GPU's
PETSc has also added GPU support in recent years, and PETSc vector and matrix objects, along with many of the solvers, can be moved to the GPU.
GPU support in PETSc.jl requires a locally built PETSc with CUDA or HIP enabled — the precompiled PETSc_jll binaries do not include GPU support. See Installation for instructions on pointing PETSc.jl to a local library. The example below are given for CUDA. Doing this on AMD machines (HIP) will likely work the same, but will require a specific extension to be added.
Prerequisites
PETSc built with CUDA — configure with
--with-cuda=1(and the matching--with-cuda-dir). Confirm with:grep -i cuda $PETSC_DIR/$PETSC_ARCH/include/petscconf.h # should show: #define PETSC_HAVE_CUDA 1CUDA.jl in your Julia environment:
pkg> add CUDAKernelAbstractions.jl if you want to write portable GPU/CPU kernels:
pkg> add KernelAbstractions
How it works
When CUDA.jl is loaded alongside PETSc.jl, the PETScCUDAExt extension is activated automatically. It registers CUDA-aware implementations for the functions below via Julia's package extension mechanism — no extra configuration is needed.
PETSc manages where vector data lives (host or device). The extension inspects the PetscMemType returned by VecGetArray*AndMemType calls and either wraps the device pointer as a CuArray (zero-copy) or allocates a bounce buffer if the data needs to move between host and device.
Public API
withlocalarray!
withlocalarray! gives callback-based access to the underlying array of one or more Vecs. When CUDA.jl is loaded, it automatically returns a CuArray for device-resident Vecs and a plain Vector for host-resident Vecs:
using PETSc, CUDA, KernelAbstractions
# single Vec
withlocalarray!(my_vec; write=true) do arr
# arr is CuArray on GPU, Vector on CPU
fill!(arr, 42)
end
# two Vecs — backend selected from the array type at runtime
withlocalarray!(g_fx, l_x; read=(true, true), write=(true, false)) do fx, lx
kern = KernelAbstractions.get_backend(fx)
my_kernel!(kern, 256)(fx, lx; ndrange = length(fx))
KernelAbstractions.synchronize(kern)
endThe array type (CuArray or Vector) is determined automatically from the PetscMemType of each Vec, so the same code works on both CPU and GPU. On coarser multigrid levels where Vecs may be host-resident even in a CUDA run, the CPU path is taken transparently.
Writing portable kernels with KernelAbstractions
using CUDA is sufficient — loading it activates PETScCUDAExt automatically and no extra imports are needed. Write kernels with @kernel and select the backend at runtime via KernelAbstractions.get_backend:
using PETSc, CUDA, KernelAbstractions
@kernel function my_kernel!(out, inp)
i = @index(Global)
out[i] = inp[i] * 2
end
withlocalarray!(out_vec, inp_vec; read=(true, true), write=(true, false)) do out, inp
kern = KernelAbstractions.get_backend(out)
my_kernel!(kern, 256)(out, inp; ndrange = length(out))
KernelAbstractions.synchronize(kern)
endThe same kernel runs on CPU (when Vecs are host-resident) and GPU (when Vecs are device-resident) without any code changes.
Example
examples/ex19.jl is a full 2D driven-cavity example (velocity–vorticity–temperature) that demonstrates:
- Switching between CPU and GPU with a single
useCUDAflag. - FD coloring-based Jacobian assembly running entirely on-device.
withlocalarray!in the residual callback for transparent CPU/GPU dispatch.- Using KernelAbstractions to run kernels on various flavors of GPUs or CPUs.
- Multigrid preconditioning with coarser levels falling back to a CPU Jacobian.
To run it on a GPU:
- Set
const useCUDA = truenear the top ofex19.jl. - Ensure your local PETSc build has CUDA support and is linked via
PETSc.set_library!. - Launch:
julia --project ex19.jl -da_grid_x 256 -da_grid_y 256 \ -pc_type mg -pc_mg_levels 4 \ -mg_levels_ksp_type chebyshev -mg_levels_pc_type jacobi \ -snes_monitor -ksp_monitor
[!NOTE] The
PetscInttype in yourLocalPreferences.tomlmust match the PETSc build. Check withgrep sizeof_PetscInt $PETSC_DIR/$PETSC_ARCH/include/petscconf.h.
Profiling
To profile GPU activity, wrap the solve call with CUDA.@profile:
using CUDA
CUDA.@profile PETSc.solve!(x, snes)This records a trace compatible with NSight Systems (nsys profile) and NVTX. For a quick check of kernel timing, CUDA.@profile external=true launches the process under nsys automatically.
Performance
We have checked the performance of examples/ex19.jl by running it on GPU and on 1 or 32 CPU's of a Grace-Hopper 200 machine, using the following options:
$mpiexec -n 1 julia --project ex19.jl -snes_monitor -snes_converged_reason -snes_monitor -pc_type mg -mg_levels_ksp_type chebyshev -mg_levels_pc_type jacobi -ksp_monitor -log_view -mg_levels_esteig_ksp_max_it 20 -mg_levels_esteig_ksp_max_it 20 -mg_levels_ksp_chebyshev_esteig 0,0.1,0,1.1 -mg_levels_ksp_max_it 3 -da_grid_x 513 -da_grid_y 513 -pc_mg_levels 6 We performed tests in which we doubled the resolution in x and y while increasing the number of multigrid levels such that the coarse grid level has the same size in all cases.
The results of the full solve include booting up Julia and computing the coloring pattern (which doesn't scale well).
Below we report results for the inner solve itself (KSPSolve) on a GPU (with 1 MPI rank, which is a requirement of PETSc), and compare that with the results on CPU on 1 core and on 32 cores
KSPSolve — time and throughput:
| Resolution | GPU (s) | GPU GFlop/s | 1 core (s) | 1 core GFlop/s |
|---|---|---|---|---|
| 513² | 0.116 | 144.3 | 4.337 | 4.1 |
| 1025² | 0.299 | 249.5 | 19.10 | 3.9 |
| 2049² | 1.118 | 295.5 | 89.76 | 3.6 |
| 4097² | 4.540 | 312.4 | 422.2 | 3.3 |
| Resolution | GPU (s) | GPU GFlop/s | 32 cores (s) | 32 cores GFlop/s |
|---|---|---|---|---|
| 513² | 0.116 | 144.3 | 0.297 | 61.0 |
| 1025² | 0.299 | 249.5 | 1.196 | 61.7 |
| 2049² | 1.118 | 295.5 | 5.757 | 56.6 |
| 4097² | 4.540 | 312.4 | 28.30 | 49.4 |
SNESSolve — time and throughput:
| Resolution | GPU (s) | GPU GFlop/s | 1 core (s) | 1 core GFlop/s |
|---|---|---|---|---|
| 513² | 3.645 | 6.7 | 8.041 | 3.2 |
| 1025² | 5.127 | 20.5 | 32.75 | 3.2 |
| 2049² | 11.37 | 39.7 | 144.1 | 3.1 |
| 4097² | 36.88 | 47.7 | 658.3 | 2.9 |
| Resolution | GPU (s) | GPU GFlop/s | 32 cores (s) | 32 cores GFlop/s |
|---|---|---|---|---|
| 513² | 3.645 | 6.7 | 2.118 | 12.3 |
| 1025² | 5.127 | 20.5 | 3.698 | 28.2 |
| 2049² | 11.37 | 39.7 | 10.57 | 42.3 |
| 4097² | 36.88 | 47.7 | 43.58 | 43.2 |
From this it is clear that the KSPSolve itself is very efficient on the GPU (and clearly beats the CPU), but that there is quite some overhead when we compare it with SNESSolve where this difference is not so large anymore.
***GPU Utilisation Evidence — ex19.jl on GH200 (SM90)***
Lets have a look in detail on whether the example actually runs on the GPU:
All GPU runs use a CUDA-enabled PETSc build (cuSPARSE, cuBLAS) with KernelAbstractions.jl providing the residual kernel. The PETSc GPU %F column (fraction of flops executed on GPU) and the host↔device transfer logs provide direct evidence of efficient GPU utilisation.
1. Flop fraction on GPU
Nearly all floating point work executes on the GPU across all resolutions and all major solver phases:
| Event | GPU %F |
|---|---|
| KSPSolve | 100% |
| PCApply | 100% |
| PCSetUp | 98–99% |
| SNESSolve | 99–100% |
At the kernel level, all key GMRES and multigrid operations run fully on the GPU: MatMult, MatResidual, VecMAXPY, VecMDot, VecAXPBYCZ, VecAYPX, VecPointwiseMult, VecNormalize all report 100% GPU %F.
2. Sparse matrix storage and transfers
System matrices are assembled on the host and uploaded once per Newton iteration via MatCUSPARSECopyTo, then all SpMV operations run in cuSPARSE format. Only 3 GpuToCpu copies occur per solve (one per Newton iteration), confirming matrices remain GPU-resident throughout.
| Resolution | Matrix upload (MB) |
|---|---|
| 513² | 495 |
| 1025² | 1,990 |
| 2049² | 7,950 |
| 4097² | 31,800 |
3. Vector transfers
Vectors stay GPU-resident. The number of host↔device copies is small relative to the hundreds of thousands of vector operations performed:
| Resolution | CpuToGpu count | CpuToGpu (MB) | GpuToCpu count | GpuToCpu (MB) |
|---|---|---|---|---|
| 513² | 37 | 109 | 24 | 42 |
| 1025² | 55 | 437 | 36 | 168 |
| 2049² | 64 | 1,750 | 42 | 672 |
| 4097² | 73 | 6,980 | 48 | 2,690 |
4. Residual evaluation (KernelAbstractions)
MatFDColorApply performs residual evaluations for the finite-difference Jacobian — reports 0% GPU %F in PETSc's profiler. This is expected: the residual kernel is launched by Julia's CUDA.jl runtime (via KernelAbstractions) and its flops are invisible to PETSc's event system. GPU execution is confirmed indirectly by the GpuToCpu transfer pattern in MatFDColorApply: PETSc hands off perturbed vectors, the KA kernel evaluates the residual on the GPU, and the result is returned.