GPU Support

Dagger supports GPU acceleration for CUDA, ROCm (AMD), Intel oneAPI, Metal (Apple), and OpenCL devices. GPU support enables automatic data movement between CPU and GPU memory, distributed GPU computing across multiple devices, and seamless integration with Julia's GPU ecosystem.

Dagger's GPU support is built on top of the KernelAbstractions.jl package, as well as the specific GPU-specific packages for each backend (e.g. CUDA.jl, AMDGPU.jl, oneAPI.jl, Metal.jl, and OpenCL.jl). Dagger's GPU support is designed to be fully interoperable with the Julia GPU ecosystem, allowing you to use Dagger to distribute your GPU computations across multiple devices.

There are a few ways to use Dagger's GPU support:

KernelAbstractions: Use the KernelAbstractions.jl interface to write GPU kernels, and then use Dagger.Kernel and Dagger.@spawn to execute them.
DArray: Use the DArray interface to create distributed GPU arrays, and then call regular array operations on them, which will be automatically executed on the GPU.
Datadeps: Use the Datadeps.jl interface to create GPU-compatible algorithms, within which you can call kernels or array operations.
Manual: Use Dagger.gpu_kernel_backend() to get the appropriate backend for the current processor, and use that to execute kernels.

In all cases, you need to ensure that right GPU-specific package is loaded.

Package Loading

Dagger's GPU support requires loading one of the following packages:

CUDA.jl for NVIDIA GPUs
AMDGPU.jl for AMD GPUs
oneAPI.jl for Intel GPUs
Metal.jl for Apple GPUs
OpenCL.jl for OpenCL devices

Backend Detection

You can check if a given kind of GPU is supported by calling:

CUDA: Dagger.gpu_can_compute(:CUDA)
AMDGPU: Dagger.gpu_can_compute(:ROC)
oneAPI: Dagger.gpu_can_compute(:oneAPI)
Metal: Dagger.gpu_can_compute(:Metal)
OpenCL: Dagger.gpu_can_compute(:OpenCL)

Backend-Specific Scopes

Once you've loaded the appropriate package, you can create a scope for that backend by calling:

# First GPU of different GPU types
cuda_scope = Dagger.scope(cuda_gpu=1)
rocm_scope = Dagger.scope(rocm_gpu=1)  
intel_scope = Dagger.scope(intel_gpu=1)
metal_scope = Dagger.scope(metal_gpu=1)
opencl_scope = Dagger.scope(cl_device=1)

These kinds of scopes can be passed to Dagger.@spawn or Dagger.with_options to enable GPU acceleration on the given backend. Note that by default, Dagger will not use any GPU if a compatible scope isn't provided through one of these mechanisms.

KernelAbstractions

The most direct way to use GPU acceleration in Dagger is through the KernelAbstractions.jl interface. Dagger provides seamless integration with KernelAbstractions, automatically selecting the appropriate backend for the current processor.

Basic Kernel Usage

Write your kernels using the standard KernelAbstractions syntax:

using KernelAbstractions

@kernel function vector_add!(c, a, b)
    i = @index(Global, Linear)
    c[i] = a[i] + b[i]
end

@kernel function fill_kernel!(arr, value)
    i = @index(Global, Linear)
    arr[i] = value
end

Using `Dagger.Kernel` for Automatic Backend Selection

Dagger.Kernel wraps your kernel functions and automatically selects the correct backend based on the current processor:

# Use in tasks - backend is selected automatically
cpu_array = Dagger.@mutable zeros(1000)
gpu_array = Dagger.@mutable CUDA.zeros(1000)

# Runs on CPU
fetch(Dagger.@spawn Dagger.Kernel(fill_kernel!)(cpu_array, 42.0; ndrange=length(cpu_array)))

# Runs on GPU when scoped appropriately
Dagger.with_options(;scope=Dagger.scope(cuda_gpu=1)) do
    fetch(Dagger.@spawn Dagger.Kernel(fill_kernel!)(gpu_array, 42.0; ndrange=length(gpu_array)))

    # Synchronize the GPU
    Dagger.gpu_synchronize(:CUDA)
end

Notice the usage of Dagger.@mutable to create mutable arrays on the GPU. This is required when mutating arrays in-place with Dagger-launched kernels.

Manual Backend Selection with `gpu_kernel_backend`

For more control, use Dagger.gpu_kernel_backend() to get the backend for the current processor:

function manual_kernel_execution(arr, value)
    # Get the backend for the current processor
    backend = Dagger.gpu_kernel_backend()

    # Create kernel with specific backend
    kernel = fill_kernel!(backend)

    # Execute kernel
    kernel(arr, value; ndrange=length(arr))

    return arr
end

# Use within a Dagger task
arr = Dagger.@mutable CUDA.zeros(1000)
Dagger.with_options(;scope=Dagger.scope(cuda_gpu=1)) do
    fetch(Dagger.@spawn manual_kernel_execution(arr, 42.0))

    Dagger.gpu_synchronize(:CUDA)
end

Kernel Synchronization

Dagger handles synchronization automatically within Dagger tasks, but if you mixed Dagger-launched and non-Dagger-launched kernels, you can synchronize the GPU manually:

# Launch kernel as a task - Dagger.Kernel handles backend selection automatically
arr = Dagger.@mutable CUDA.zeros(1000)
Dagger.with_options(;scope=Dagger.scope(cuda_gpu=1)) do
    result = fetch(Dagger.@spawn Dagger.Kernel(fill_kernel!)(arr, 42.0; ndrange=length(arr)))

    # Synchronize kernels launched by Dagger tasks
    Dagger.gpu_synchronize()

    # Launch kernel as a task - Dagger.Kernel handles backend selection automatically
    fill_kernel(CUDABackend())(arr, 42.0; ndrange=length(arr))

    return result
end

DArray: Distributed GPU Arrays

Dagger's DArray type seamlessly supports GPU acceleration, allowing you to create distributed arrays that are automatically allocated in GPU memory when using appropriate scopes.

GPU Array Allocation

Allocate DArrays directly on GPU devices:

using CUDA  # or AMDGPU, oneAPI, Metal

# Single GPU allocation
gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    # All standard allocation functions work
    DA_rand = rand(Blocks(32, 32), Float32, 128, 128)
    DA_ones = ones(Blocks(32, 32), Float32, 128, 128)
    DA_zeros = zeros(Blocks(32, 32), Float32, 128, 128)
    DA_randn = randn(Blocks(32, 32), Float32, 128, 128)
end

Multi-GPU Distribution

Distribute arrays across multiple GPUs:

# Use all available CUDA GPUs
all_gpu_scope = Dagger.scope(cuda_gpus=:)
Dagger.with_options(;scope=all_gpu_scope) do
    DA = rand(Blocks(64, 64), Float32, 256, 256)
    # Each chunk may be allocated on a different GPU
end

# Use specific GPUs
multi_gpu_scope = Dagger.scope(cuda_gpus=[1, 2, 3])
Dagger.with_options(;scope=multi_gpu_scope) do
    DA = ones(Blocks(32, 32), Float32, 192, 192)
end

Converting Between CPU and GPU Arrays

Move existing arrays to GPU:

# Create CPU DArray
cpu_array = rand(Blocks(32, 32), 128, 128)

# Convert to GPU
gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    gpu_array = similar(cpu_array)
    # gpu_array now has the same structure but is allocated on GPU
end

# Convert back to CPU
cpu_result = collect(gpu_array)  # Brings all data back to CPU

(Advanced) Verifying GPU Allocation

If necessary for testing or debugging, you can check that your DArray chunks are actually living on the GPU:

gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    DA = rand(Blocks(4, 4), Float32, 8, 8)

    # Check each chunk
    for chunk in DA.chunks
        raw_chunk = fetch(chunk; raw=true)
        @assert raw_chunk isa Dagger.Chunk{<:CuArray}

        # Verify it's on the correct GPU device
        @assert remotecall_fetch(raw_chunk.handle.owner, raw_chunk) do chunk
            arr = Dagger.MemPool.poolget(chunk.handle)
            return CUDA.device(arr) == CUDA.devices()[1]  # GPU 1
        end
    end
end

Datadeps: GPU-Compatible Algorithms

Datadeps regions work seamlessly with GPU arrays, enabling complex GPU algorithms with automatic dependency management. Unlike without Datadeps, you don't need to use Dagger.@mutable, as Datadeps ensures that array mutation is performed correctly.

In-Place GPU Operations

using LinearAlgebra

# Create GPU arrays
gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    DA = rand(Blocks(4, 4), Float32, 8, 8)
    DB = rand(Blocks(4, 4), Float32, 8, 8)
    DC = zeros(Blocks(4, 4), Float32, 8, 8)

    # In-place matrix multiplication on GPU
    Dagger.spawn_datadeps() do
        Dagger.@spawn mul!(Out(DC), In(DA), In(DB))
    end

    # Verify result
    @assert collect(DC) ≈ collect(DA) * collect(DB)
end

Notice that we didn't need to call Dagger.gpu_synchronize() here, because the DArray is automatically synchronized when the DArray is collected.

Out-of-Place GPU Operations

Because Dagger options propagate into function calls, you can call algorithms that use Datadeps (such as DArray matrix multiplication) on GPUs, without having to do any extra work:

gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    DA = rand(Blocks(4, 4), Float32, 8, 8)
    DB = rand(Blocks(4, 4), Float32, 8, 8)

    # Out-of-place operations
    DC = DA * DB  # Automatically runs on GPU

    @assert collect(DC) ≈ collect(DA) * collect(DB)
end

Complex GPU Algorithms

using LinearAlgebra

gpu_scope = Dagger.scope(cuda_gpu=1)
Dagger.with_options(;scope=gpu_scope) do
    # Create a positive definite matrix for Cholesky decomposition
    A = rand(Float32, 8, 8)
    A = A * A'
    A[diagind(A)] .+= size(A, 1)
    DA = DArray(A, Blocks(4, 4))
    
    # Cholesky decomposition on GPU
    chol_result = cholesky(DA)
    @assert collect(chol_result.U) ≈ cholesky(collect(DA)).U
end

Cross-Backend Operations

# You can even mix different GPU types in a single computation
# (though data movement between different GPU types goes through CPU)

cuda_data = Dagger.with_options(;scope=Dagger.scope(cuda_gpu=1)) do
    rand(Blocks(32, 32), Float32, 64, 64)
end

rocm_result = Dagger.with_options(;scope=Dagger.scope(rocm_gpu=1)) do
    # Data automatically moved: CUDA GPU -> CPU -> ROCm GPU
    fetch(Dagger.@spawn sum(cuda_data))
end

Distributed GPU Computing

You can easily combine GPU acceleration with distributed computing across multiple workers.

Multi-Worker GPU Setup

using Distributed
addprocs(4)  # Add 4 workers

@everywhere using Dagger, CUDA

# Use GPU 1 on worker 2
distributed_gpu_scope = Dagger.scope(worker=2, cuda_gpu=1)

Dagger.with_options(;scope=distributed_gpu_scope) do
    # Create a GPU array and sum it on worker 2, GPU 1
    DA = rand(Blocks(32, 32), Float32, 128, 128)
    result = fetch(Dagger.@spawn sum(DA))
end

Load Balancing Across GPUs and Workers

# Distribute work across multiple workers and their GPUs
workers_with_gpus = [
    Dagger.scope(worker=2, cuda_gpu=1),
    Dagger.scope(worker=3, cuda_gpu=1),
    Dagger.scope(worker=4, rocm_gpu=1)  # Mix of GPU types
]

# Dagger will automatically balance work across available resources
results = map(workers_with_gpus) do scope
    Dagger.with_options(;scope) do
        DA = rand(Blocks(16, 16), Float32, 64, 64)
        fetch(Dagger.@spawn sum(DA))
    end
end