Remove early-return optimization in ggml_cuda_set_device() that caused hipErrorIllegalAddress on ROCm multi-GPU setups with hybrid recurrent models (Mamba/SSM architectures). On ROCm, hipGetDevice() can return an unexpected value on threads that have never explicitly called hipSetDevice(). If this value matches ctx->device, the early-return fires and hipSetDevice() is never called, causing the subsequent hipMemcpyAsync to fail with current device: -1. cudaSetDevice() with the already-active device is a near no-op in modern CUDA/ROCm drivers, so removing the optimization has negligible performance impact while eliminating this class of thread context bugs. Also add missing ggml_cuda_set_device() call in ggml_backend_cuda_set_tensor_async() for consistency with all other cudaMemcpyAsync call sites in this file. Fixes #21140 Tested on: 2x AMD Radeon AI Pro R9700 (gfx1201), ROCm 7.2.0 |
||
|---|---|---|
| .. | ||
| cmake | ||
| include | ||
| src | ||
| .gitignore | ||
| CMakeLists.txt | ||