Some changes were made in 5ea3be265b
which were incomplete. In the case of non-CUB, bitonic sort and its
limitations of ncols < 1024 have to apply, similar to argsort.cu
This commit adds a macro guard around argsort_f32_i32_cuda_cub usage
in the top-k fallback path, falling back to bitonic sort when
GGML_CUDA_USE_CUB is not defined.
The motivation for this is that some environments like AMD HIP
do not have CUB available, causing compilation failure.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/19728226426/job/56523606840#step:6:208
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4
* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
* first commit naive test to enable mmq for RDNA4
* adding appropriate WMMA instructions
* git rebase on top of master: fixing the correctness of the mat mul operations, updating layout mappings for RDNA4
* clean up merge conflicts
* add comments and code clean up
* PR clean up, addressed comments
* enable MMQ fallback on RDNA4
* addressed comments: add guards in load generic, separate wmma branch for use_mmq function
* Revert build-xcframework.sh
* Formating: remove trailing whitespace
* revert CMake files
* clean up after rebase: remove duplicated change, revert cmake files
* clean up after rebase: revert changes from build-xcframework.sh
* clean up: remove extra space line in mma.cuh
* Revert "clean up: remove extra space line in mma.cuh"
This reverts commit b39ed57c45.
* mmf for rdna4
* align the padding for rdna4
* forbit mul_mat_f for rdna4
* fix as comment
* remove device kernels
* add constexpr for early return
* update based on review comment
* change based on the review comment
* pass compile error
* keep code consistency
---------
Co-authored-by: zhang hui <you@example.com>
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
Argsort is used for top-k currently. WE optimize argsort by 2 things:
1. Use `DeviceRadixSort` for single-row/sequence to parallelize it
across our SMs
2. Use `DeviceSegmentedSort` for multi-row/sequence as this is the
correct entrypoint (the function chooses different execution paths,
it contains `DeviceSegmentedRadixSort` as one of the paths and will
choose the best one according to heuristics.
https://nvidia.github.io/cccl/cub/api/structcub_1_1DeviceSegmentedSort.html#overview
Some perf numbers for a RTX PRO 6000:
On the kernel level, tested with
`GGML_CUDA_DISABLE_GRAPHS=1 ./test-backend-ops -o ARGSORT perf`
Before:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 359.24 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 8192 runs - 861.34 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 1020.01 us/run
```
After:
```
ARGSORT(type=f32,ne=[65000,16,1,1],order=0): 4130 runs - 312.41 us/run
ARGSORT(type=f32,ne=[200000,1,1,1],order=0): 16384 runs - 63.48 us/run
ARGSORT(type=f32,ne=[200000,16,1,1],order=0): 1343 runs - 874.36 us/run
```
---
On the model level, tested with
`llama-cli -m gpt-oss-20b-mxfp4.gguf -n 200 -p "What is
the Capital of Sweden?" -no-cnv -fa 1 --backend-sampling`
Before:
```
llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 824701.20 tokens per second)
llama_perf_context_print: load time = 18215.58 ms
llama_perf_context_print: prompt eval time = 28.20 ms / 7 tokens ( 4.03 ms per token, 248.19 tokens per second)
llama_perf_context_print: eval time = 714.79 ms / 199 runs ( 3.59 ms per token, 278.40 tokens per second)
llama_perf_context_print: total time = 857.62 ms / 206 tokens
```
After
```
llama_perf_sampler_print: sampling time = 0.25 ms / 207 runs ( 0.00 ms per token, 828000.00 tokens per second)
llama_perf_context_print: load time = 18366.92 ms
llama_perf_context_print: prompt eval time = 35.92 ms / 7 tokens ( 5.13 ms per token, 194.87 tokens per second)
llama_perf_context_print: eval time = 532.79 ms / 199 runs ( 2.68 ms per token, 373.50 tokens per second)
llama_perf_context_print: total time = 683.65 ms / 206 tokens
```
* CUDA: add fused rope
* move k forward_expand up
* create helper function instead of re-using params
* make assert statement more in line with comment
* rope_norm: coalesced writes to global mem
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
* make CI happy
* remove headers not needed
* reduced bank conflicts for fp16 and bf16
* add missing const*
* now bank conflicts free
* use padding instead of swizzling
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* CUDA: Remove unneded bias/gate dims in fused mmvq
Pointed out
[here](https://github.com/ggml-org/llama.cpp/pull/16847#discussion_r2476798989)
that only a single value is needed per target col per thread
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Fix "Error 991-D: extra braces are nonstandard" during compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Volta tensor core support for MMF
* more generic checks for hardware support
* Update ggml/src/ggml-cuda/mmf.cuh
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
This is realised by loading them into registers before computation of
the dot-product, effectively batching them together with said
dot-product. As a lot of threads are alive here, the warp scheduler has
enough threads available to effectively hide the cost of additionally
loading those two floats.
* CUDA: Fix bug in topk-moe for gpt-oss
When using ggml_can_fuse_subgraph, the output nodes which are passed are wrong. This causes `test-backend-ops` to still fuse ndoes (because the nodes are not used elsewhere in the graph),
but it actually doesn't fuse in the actual gpt-oss
* fix for qwen3 too
* change ifndef to ifdef
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning