llama.cpp/ggml
Oliver Simons 9f5e1edb10
CUDA: Limit DeviceSegmentedSort to immediate mode (#21718)
* CUDA: Limit DeviceSegmentedSort to immediate mode

DeviceSegmentedSort is currently not capturable in a cuda graph. Hence,
we have to go for the slower DeviceSegmentedRadixSort in that case.

Perf numbers on RTX Pro 6000 Blackwell Max-Q:
DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs)

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 12291 runs -   105.94 us/run -     8192 kB/run -   73.75 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 10245 runs -   115.08 us/run -    16384 kB/run -  135.77 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   221.22 us/run -    32768 kB/run -  141.26 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   430.98 us/run -    65536 kB/run -  145.02 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1028 runs -  1185.83 us/run -   131072 kB/run -  105.41 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  387 runs -  2748.62 us/run -   262144 kB/run -   90.95 GB/s

DeviceSegmentedSort in immediate mode

  ARGSORT(type=f32,ne=[2048,512,1,1],order=1):                 16388 runs -    71.17 us/run -     8192 kB/run -  109.78 GB/s
  ARGSORT(type=f32,ne=[4096,512,1,1],order=1):                 12294 runs -    81.38 us/run -    16384 kB/run -  192.00 GB/s
  ARGSORT(type=f32,ne=[8192,512,1,1],order=1):                  5125 runs -   240.81 us/run -    32768 kB/run -  129.77 GB/s
  ARGSORT(type=f32,ne=[16384,512,1,1],order=1):                 2565 runs -   406.60 us/run -    65536 kB/run -  153.71 GB/s
  ARGSORT(type=f32,ne=[32768,512,1,1],order=1):                 1285 runs -   873.23 us/run -   131072 kB/run -  143.15 GB/s
  ARGSORT(type=f32,ne=[65536,512,1,1],order=1):                  516 runs -  2288.46 us/run -   262144 kB/run -  109.24 GB/s

* Add test case for dispatch to DeviceSegmentedRadixSort

We currently lack a way to force graph mode in CUDA, patch callback to
invoke ggml_backend_compare_graph_backend twice to enforce each test to
run in graph mode
2026-04-13 11:14:06 +02:00
..
cmake ggml: backend-agnostic tensor parallelism (experimental) (#19378) 2026-04-09 16:42:19 +02:00
include ggml: backend-agnostic tensor parallelism (experimental) (#19378) 2026-04-09 16:42:19 +02:00
src CUDA: Limit DeviceSegmentedSort to immediate mode (#21718) 2026-04-13 11:14:06 +02:00
.gitignore
CMakeLists.txt ggml: backend-agnostic tensor parallelism (experimental) (#19378) 2026-04-09 16:42:19 +02:00