This shader uses coopmat1 to do the Q*K^T multiply. The P*V multiply is more
difficult for various reasons so I haven't done it. Performance for this
shader is around 2.5x better than for the scalar shader when doing prompt
processing. Some of the benefit may be from other optimizations like staging
through shared memory, or splitting by rows.
* batched-bench : fix pp batch contents
* metal : optimize multi-sequence FA vec kernel
ggml-ci
* metal : use FA-vec kernel up to batch size 20
ggml-ci
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
* ggml-cpu: Integrate fp32=bf16xbf16 SME KleidiAI kernel
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * code review fixes
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
* * adds a comment that clarifies barrier usage
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
---------
Signed-off-by: Dan Johansson <dan.johansson@arm.com>
Co-authored-by: Charles Xu <charles.xu@arm.com>
* vulkan: scalar flash attention implementation
* vulkan: always use fp32 for scalar flash attention
* vulkan: use vector loads in scalar flash attention shader
* vulkan: remove PV matrix, helps with register usage
* vulkan: reduce register usage in scalar FA, but perf may be slightly worse
* vulkan: load each Q value once. optimize O reduction. more tuning
* vulkan: support q4_0/q8_0 KV in scalar FA
* CI: increase timeout to accommodate newly-supported tests
* vulkan: for scalar FA, select between 1 and 8 rows
* vulkan: avoid using Float16 capability in scalar FA
* sycl : Implemented reorder Q4_0 mmvq
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* sycl : Fixed mmvq being called when reorder is disabled
* sycl : Improved comments in the quants header
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
* Use static_assert
* safe_div -> ceil_div
* Clarify qi comment
* change the reorder tensor from init to execute OP
* dbg
* Undo changes to test-backend-ops
* Refactor changes on top of q4_0 reorder fix
* Missing Reverts
* Refactored opt_for_reorder logic to simplify code path
* Explicit inlining and unroll
* Renamed mul_mat_algo enum for consistency
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@codeplay.com>
Co-authored-by: romain.biessy <romain.biessy@codeplay.com>
This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf:
GGML_ASSERT(nei0 * nei1 <= 3072);
The tensor is 8 x 512. Increase this array size to accommodate.
* ggml : remove MSVC warnings pragmas
This commit removes the MSVC-specific pragmas as these are now handled
in ggml/CMakeLists.txt.
* whisper : remove MSVC warning pragmas
This commit removes the MSVC-specific pragmas. These are now handled in
the ggml/CMakeLists.txt file.
This patch upstreams llamafile's cpu matrix multiplication kernels for ppc64le using MMA builtins for BF16 data type.
This change results in 9x - 40x gains
in total speed S t/s (ie all tokens/total time), across various batch sizes tested using llama-batched-bench benchmark.
The patch is tested with Meta-Lllama-3-8B,
and Mistral-7B models (BF16 models generated by using llama-quantize from corresponding FP32 models) on an IBM POWER10 machine.
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
The following scenario will cause an assertion failure in the graph
allocator:
- Build and allocate a graph containing a tensor with a non-NULL data
pointer
- Build and allocate a new graph where that data is NULL
Result:
ggml-alloc.c:819: GGML_ASSERT(talloc->buffer_id >= 0) failed
This happens during revalidation because we think that memory should
have been previously allocated based on the current graph but in
reality the previous graph was different. In this situation, we
should do a full reallocation pass.
* vulkan: Add bfloat16 support
This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16.
The extension is required for coopmat multiply support, but matrix-vector
multiply trivially promotes bf16 to fp32 and doesn't require the extension.
The copy/get_rows shaders also don't require the extension.
It's probably possible to fall back to non-coopmat and promote to fp32 when
the extension isn't supported, but this change doesn't do that.
The coopmat support also requires a glslc that supports the extension, which
currently requires a custom build.
* vulkan: Support bf16 tensors without the bf16 extension or coopmat support
Compile a variant of the scalar mul_mm shader that will promote the bf16
values to float, and use that when either the bf16 extension or the coopmat
extensions aren't available.
* vulkan: bfloat16 fixes (really works without bfloat16 support now)
* vulkan: fix spirv-val failure and reenable -O
This commit adds a check to makes sure that the target exists before
trying to add compile options to ignore warnings when using MSVC.
The motivation for this is currently the build is broken depending on
the cmake options provided. With this fix it should be possible to build
even if the targets are not actually available.
Refs: https://github.com/ggml-org/whisper.cpp/pull/3090#issuecomment-2842760104
* whisper: suppress Windows compiler warnings
This commit disables compiler warnings on window using MSVC.
The motivation for these changes is that some compilers generate
warnings for these conversion, for example Windows MSVC, and
there are quite a few of them. This makes it a little difficult to
spot new warnings that may be introduced and also can be difficult
for users/embedders of ggml where these warnings are hard to separate
from their own warnings.
* squash! whisper: suppress Windows compiler warnings
Move ggml related warnings into ggml. This commit also fixes the
indentation and adds a missing whitespace to the if statement.
Build fails with compilation error on power pc.
This patch fixes the same.
Tested with unit tests run via
--build <build_dir> && cd <build_dir> && make test
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* fix(rpc): Improve input validation and error handling
The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.
This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:
- **Type Validation:** `deserialize_tensor` now checks if the
`tensor->type` is within the valid `GGML_TYPE_COUNT` range
*before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
`set_tensor_hash`, and `get_tensor` handlers with error
logging and returning `false` when data/offset parameters
are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
`graph_compute` when calculating required message sizes based
on client-provided `n_nodes` and `n_tensors`. Returns early
if the reported sizes conflict with the actual message size or
would lead to overflow.
- **Error Propagation:**
- `create_node` now checks for `nullptr` return values from
`deserialize_tensor` and its recursive calls, propagating
`nullptr` upwards on failure. Uses `find` instead of `at`
for safer map access.
- `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
and sets the response status to failure if deserialization
or bounds checks fail.
- `graph_compute` now checks for `nullptr` return from
`create_node` and returns failure status correctly. The final
return value now reflects the actual computation status.
These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): address pr comments
removed comments and unnecessary returns
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): ambiguous nullptr from create_node
rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).
This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
`create_node` returns nullptr, correctly identifying failures
versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
nullptr unambiguously on failure during recursion.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): initial zero check in create_node
The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.
Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* fix(rpc): Handle get_alloc_size failure in server
Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): input size validation in graph_compute
Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove extra status code setting
Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* refactor(rpc): remove redundant check for tensor->type
Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
---------
Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
* SYCL: Add all missing unary kernels
ggml-ci
* decouple kernel launch range from data size using strided loop
* use ciel_div helper for num_blocks
ggml-ci
* clean auto imported header files
RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.
The performance impact of this change depends on the network latency.
* tune matmul for gcn
* this one is more power efficient
* Update ggml/src/ggml-vulkan/ggml-vulkan.cpp
Co-authored-by: 0cc4m <picard12@live.de>
* disable this tune for the proprietary driver
---------
Co-authored-by: 0cc4m <picard12@live.de>
Add RPC_CMD_HELLO for getting the version of the protocol implemend by
the server. Follow the semantic versioning rules at https://semver.org
Hopefully this bring better user experience when we make breaking
changes at the protocol level and avoid issues like #12465
* graph : make mla compatible with FA
* metal : add exp FA kernels for DeepSeek models
ggml-ci
* llama : minor naming updates
ggml-ci
* ggml : disable FA for DS head sizes
* tests : add FA tests for MLA shapes
ggml-ci
Submit operators using asynchronous threads to improve performance.
Use the environment variable GGML_CANN_ASYNC_MODE to control whether
asynchronous submission is enabled. It is disabled by default.
Testing shows a 10%–20% performance improvement in scenarios with
small parameter sizes, especially in quantized models.
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.
split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
* opencl: refactor - split the kernel files
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
* opencl: split more kernels into separate files
* opencl: specify subgroup size instead of querying it
* opencl: refine Adreno cl compiler version parsing
* opencl: skip some kernels not used by Adreno on old compilers
* opencl: refine logic for selecting Adreno kernels
* opencl: refine Adreno cl compiler version
* opencl: cleanup preprocessor for kernels
* opencl: consider Adreno CL compiler on Windows
* opencl: add final newline for `mul_mv_f16_f16.cl`
---------
Co-authored-by: Shangqing Gu <quic_shawngu@quicinc.com>
Replace compile-time `GGML_HIP_UMA` with environment variable `GGML_CUDA_ENABLE_UNIFIED_MEMORY`. This unifies the usage on NVIDIA and AMD GPUs, and allows a single binary to be shared between integrated and dedicated GPUs.
Multiple optional memory pools are provided for CANN, including VMM,
priority queue-based, and traditional memory pools.
1.When the memory pool is available and GGML_CANN_DISABLE_VMM_POOL
is not defined, the VMM pool is selected by default.
2.Otherwise, if GGML_CANN_ENABLE_BUF_PRIO_POOL is defined,
the priority queue-based memory pool is used.
3.If neither condition is met, the default memory pool is used.
The current usage of the SYCL-Graph extension checks for
the `sycl_ext_oneapi_graph` device aspect. However, it is also
possible to support `sycl_ext_oneapi_limied_graph` devices that
don't support update
* SYCL: Add fp16 support to some elementwise OP kernels
* remove comment
ggml-ci
* Use static_cast directly
* remove not needed cast from tanh
* Use static cast and remove unneeded castings
* Adjust device_support_op for unary OPs
* Use cast_data and typed_data struct to deduplicate casting code
* [CANN] Support ELU and CONV_TRANSPOSE_1D
* [CANN]Modification review comments
* [CANN]Modification review comments
* [CANN]name adjustment
* [CANN]remove lambda used in template
* [CANN]Use std::func instead of template
* [CANN]Modify the code according to the review comments
---------
Signed-off-by: noemotiovon <noemotiovon@gmail.com>
q4_k and q5_k had a lot of redundant global loads where the same 16B of
scale information is repeatedly loaded and decoded during each loop iteration.
This change restructures the loops to more explicitly iterate over whole
blocks in the outer loop (with unrolled inner loop) and to copy/decode the
scale data into shared memory once at the start of each outer loop. The copy
is pipelined so the scale load from global memory is relatively cheap.
This improves q4_k/q5_k model prompt processing performance by around 5-7%.
I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k
and hurt for q4_0.
The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped
variants isn't used as often as it originally was (e.g. due to the padded_N
change), so I trimmed it down to offset some of the new complexity of the
semi-manual loop unrolling.
* ggml : FA supports F32 V
* graph : cast KV to F16 when the KV cache is not used
ggml-ci
* server : add test that exercises embeddings with FA enabled
ggml-ci
* add bf16 support
* use convert_from_bf16_cuda instead of convert_unary_cuda for f32
* revert 7ec5085
* move functionality into convert_unary with constexpr
* cpu: refactor SIMD mappings and vectorized op functions into separate files
* Fix warning for ggml_float to float
* Fix warnings
* cpu: move all the operations (except mul_mat) to a separate c++ file
* fix whitespace
* Update ggml/src/ggml-cpu/vec.h
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Fix PR comments - use GGML_UNUSED, use cassert in ops.cpp
* Reverse the order of import for ops.h and vec.h, to match what was present in ggml-cpu.c previously
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple
of the number of rows in the matrix. The KV dim is a multiple of the number of
columns for the aligned shader.
There seems to be a bubble waking up from waitForFences, which costs a few
percent performance and also increased variance in performance. This change
inserts an "almost_ready" fence when the graph is about 80% complete and we
waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting
for the final fence to be signaled.
* Prefer vector flash decoding kernel for Gemma models
Vector flash decoding kernel was not being picked for models with head dimension 256. Gemma models are in this category.
Removing this limit improves e2e performance by upto 12% in gen phase throughput for Gemm models.
* Update ggml/src/ggml-cuda/fattn.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* CUDA: Simplify and improve CUDA graphs through use of indirect copy pointers
Previously there was complexity in the CUDA graphs implementation due
frequently changing parameters to copy kernels associated with K and V
cache pointers. This patch simplifies by using indirection to avoid
such parameters frequently changing, avoiding the need for frequent
graph updates.
Fixes#12152
* Addressed comments
* fix HIP builds
* properly sync to stream
* removed ggml_cuda_cpy_fn_ptrs
* move stream sync before free
* guard to only use indirection with graphs
* style fixes
* check for errors
---------
Co-authored-by: slaren <slarengh@gmail.com>
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
When adjacent batches of Q share the same batches of K/V, batch them into
the same workgroup. For example, when:
dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1))
previously we would run 32 workgroups computing 1 result each, now we will
run 8 workgroups computing 4 results each.
This doesn't directly translate to better performance (at least when you have
>=32 SMs), but in a subsequent change I'll enable split_k which will scale much
better with 4x fewer workgroups.