This commit updates the include directive in cumsum.cu to use
cub/cub.cuh instead of cub/block/block_scan.cuh.
The motivation of this change is that without it compilation fails
with the following error:
```console
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(196): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum(nullptr,
^
/llama.cpp/ggml/src/ggml-cuda/cumsum.cu(207): error: name followed by "::" must be a class or namespace name
cub::DeviceScan::InclusiveSum((void *) tmp_alloc.get(), tmp_size, src, dst, ne, stream);
^
2 errors detected in the compilation of "/llama.cpp/ggml/src/ggml-cuda/cumsum.cu".
gmake[2]: *** [ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/build.make:317: ggml/src/ggml-cuda/CMakeFiles/ggml-cuda.dir/cumsum.cu.o] Error 2
```
Commit 83b3b1c271 ("cuda: optimize
cumsum cub path (#18362)") updated the include directive replacing
device_scan.cuh which is causing this issue.
This commit uses cub/cub.cuh umbrella header which is consistent with
other files in the ggml-cuda directory like mean.cu, sum.cu, etc.
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.
* ggml-cuda: fix blackwell native builds
Replace 12x in native architectures by 12xa
* replace for GGML_NATIVE=OFF too
* only replace for native
* remove 120f-virtual for default compilation
---------
Co-authored-by: Aman Gupta <aman>
* CONV_TRANSPOSE_1D kernel_size>255
* remove condition check
* fix the bug of type conversion
* removing trailing whitespaces
* fix: return true in the switch case
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
* refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
* refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
* add comment
* refactor: remove redundant buffer checks in hexagon supported operations
* wip
* add missing include to fix weak symbol warning
* add ggml_hexagon_op_generic
* refactor: simplify tensor operation initialization and buffer management in hexagon implementation
* refactor: streamline hexagon operation initialization and buffer management
* refactor: update function signatures and streamline request handling in hexagon operations
* wip
* ggml-hexagon: clean up code formatting and improve unary operation handling
* wip
* rename
* fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
* refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility
refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility
refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity
refactor: remove redundant buffer checks in hexagon supported operations
add missing include to fix weak symbol warning
add ggml_hexagon_op_generic
refactor: simplify tensor operation initialization and buffer management in hexagon implementation
refactor: streamline hexagon operation initialization and buffer management
refactor: update function signatures and streamline request handling in hexagon operations
ggml-hexagon: clean up code formatting and improve unary operation handling
fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations
# Conflicts:
# ggml/src/ggml-hexagon/ggml-hexagon.cpp
* hexagon: fix merge conflicts
* hexagon: minor cleanup for buffer support checks
* hexagon: factor out op_desc and the overal op logging
* hexagon: further simplify and cleanup op dispatch logic
* snapdragon: update adb scripts to use llama-cli and llama-completion
* fix pipeline failure
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* feat: working gelu with src0 put on vtcm
* feat: gelu ping-pong for both in and out
* fix: fixu compile error
* break: distinguish dma ddr->vtcm and vtcm->ddr operation
* fix: fix dma queue size
* break: update dma api to either pop src or dst ptr
* fix: fix activation vtcm allocation issue for src1 when swapperd
* refactor: ping-pong gelu logic to avoid unnecessary if else
* dma: improved queue interface and prefetch handling
* gelu: fix N+2 block prefetch
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
The goal is to enable the async loading code paths in
llama_model_loader::load_all_data, originally from #7896. This works and the
loads themselves are faster, but with host visible vidmem I think the cost of
allocating/mapping vidmem moves and becomes more expensive, and I don't see a
benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a
significant improvement in model loading time.
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
* Some improvement on mul_mat_iq2_xs
Refactor calculations for db values and grid data to optimize performance and reduce redundancy.
* Fix trailing whitespace