llama.cpp/ggml
uaruss 5d9f64c54e ggml-cuda: fix ROCm multi-GPU illegal memory access in recurrent state restore
Remove early-return optimization in ggml_cuda_set_device() that caused
hipErrorIllegalAddress on ROCm multi-GPU setups with hybrid recurrent
models (Mamba/SSM architectures).

On ROCm, hipGetDevice() can return an unexpected value on threads that
have never explicitly called hipSetDevice(). If this value matches
ctx->device, the early-return fires and hipSetDevice() is never called,
causing the subsequent hipMemcpyAsync to fail with current device: -1.

cudaSetDevice() with the already-active device is a near no-op in
modern CUDA/ROCm drivers, so removing the optimization has negligible
performance impact while eliminating this class of thread context bugs.

Also add missing ggml_cuda_set_device() call in
ggml_backend_cuda_set_tensor_async() for consistency with all other
cudaMemcpyAsync call sites in this file.

Fixes #21140
Tested on: 2x AMD Radeon AI Pro R9700 (gfx1201), ROCm 7.2.0
2026-03-29 23:31:27 -04:00
..
cmake ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094) 2025-08-07 13:45:41 +02:00
include llama : enable chunked fused GDN path (#20340) 2026-03-11 22:46:40 +02:00
src ggml-cuda: fix ROCm multi-GPU illegal memory access in recurrent state restore 2026-03-29 23:31:27 -04:00
.gitignore
CMakeLists.txt ggml : fix typo gmml (#20512) 2026-03-13 14:36:13 +01:00