If multiple MPI ranks trigger Julia's module precompilation, then a race condition can result in an error such as:
ERROR: LoadError: IOError: mkdir: file already exists (EEXIST) Stacktrace:  uv_error at ./libuv.jl:97 [inlined]  mkdir(::String; mode::UInt16) at ./file.jl:177  mkpath(::String; mode::UInt16) at ./file.jl:227  mkpath at ./file.jl:222 [inlined]  compilecache_path(::Base.PkgId) at ./loading.jl:1210  compilecache(::Base.PkgId, ::String) at ./loading.jl:1240  _require(::Base.PkgId) at ./loading.jl:1029  require(::Base.PkgId) at ./loading.jl:927  require(::Module, ::Symbol) at ./loading.jl:922  include(::Module, ::String) at ./Base.jl:377  exec_options(::Base.JLOptions) at ./client.jl:288  _start() at ./client.jl:484
This can be worked around be either:
Triggering precompilation before launching MPI processes, for example:
julia --project -e 'using Pkg; pkg"instantiate"' julia --project -e 'using Pkg; pkg"precompile"' mpiexec julia --project script.jl
Launching julia with the
--compiled-modules=nooption. This can result in much longer package load times.
UCX is a communication framework used by several MPI implementations.
When used with CUDA, UCX intercepts
cudaMalloc so it can determine whether the pointer passed to MPI is on the host (main memory) or the device (GPU). Unfortunately, there are several known issues with how this works with Julia:
By default, MPI.jl disables this by setting
ENV["UCX_MEMTYPE_CACHE"] = "no"
__init__ which may result in reduced performance, especially for smaller messages.
By default, UCX will error if this signal is raised (#337), resulting in a message such as:
Caught signal 11 (Segmentation fault: invalid permissions for mapped object at address 0xXXXXXXXX)
This signal interception can be controlled by setting the environment variable
UCX_ERROR_SIGNALS: if not already defined, MPI.jl will set it as:
ENV["UCX_ERROR_SIGNALS"] = "SIGILL,SIGBUS,SIGFPE"
__init__. If set externally, it should be modified to exclude
SIGSEGV from the list.