* hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
* hex-utils: add explicit l2flush and l2clear helpers
* hex-opreq: use fine-grain per tensor l2 management
* hex-opreq: avoid redundant invalidates for tensors we already flushed
* hex-opreq: update debug messages
* htp-opreq: reuse ops_context
* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
* hex-opreq: fix errors in log message
* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
* hexagon: limit l2 flushes to 1MB which covers l2 cache
* hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
* hexagon: drop cache flush size to 2MB
* hex-opreq: start reworking opreq packing
* hex-opreq: introduce new way of packing opbatch where tensors are stored separately
* hex-opreq: add a simple fastrpc call to force unmap all buffers
* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
* hex-opreq: bump opreq batch size to 256
* hex-mm: place src1 spad at the top of vtcm for easy reuse
* hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
* htp-opreq: use tensor pointers instead of copies
* hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
* hex-cumsum: fix error post opreq merge
* hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
* hex-buf: add support for allocating shared/pinned buffer for opreqs
* hex-opbatch: make opbatches configurable
* hex-naming: better name for ggml_hexagon_shared_buffer
* hex-naming: add session->c_name() helper
* hex-opbatch: start using shm but still copy for now
* hex-opbatch: use shared buffer for packing opbatch
* hex-opbatch: beter naming for opbatch related classes and code
* hex-opbatch: reuse batched tensors with same data/dims/strides
* hex-opbatch: update logging
* hex-opbatch: add support for vmem limit for op batching
* hex-opbatch: update htp side to properly support dynamic mmap/unmap
* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
* hex-opbatch: fixed src1 handling in act ops
* hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
* hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
* hex-opbatch: allocate extra 1KB for dspqueue overhead
* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
* hex-mm: properly handle hmx_disabled flag
* hex-ops: update comments
* hex-ops: add debug output for get/set-rows
* hex-mmap: optimize un/mapping of buffers
* hex-opreq: global cache flush and invalidate beyond 128KB threshold
* hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
* hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
* hex-mm: fixed hvx fallback path
* hex-mm: lower the vmem threshold a bit further to ~3GB
* hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
* hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
* hex-opbatch: cleanup naming and headers for opbatch and related descriptors
* hex-fa: it's now better to enable FA during TG to reduce graph splits
* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
* hexagon: fixed editorconfig check
* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
* ggml: backend-agnostic tensor parallelism
* support for GPT-OSS, Qwen 3 MoE
* partial Vulkan fix
* add support for 4/8 GPUs
* unconditional peer access
* re-use buffers + ggml contexts
* fix output pattern
* NCCL support
* GGML: HIP: add RCCL support
* Remove shfl and AllReduce from backend interface
* move allocation workaround out of ggml-alloc.c
* 2d tensor set/get support
* Fix the seg fault without NCCL
* Apply suggestion from JohannesGaessler
* support for tensor dims % n_devs != 0
* fix view_offs scaling
* arbitrary num. of GPUs/tensor split
* fix compilation
* better granularity estimate
* Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA.
Fix compilation errors.
* partial Qwen 3 Next support
* Fix qwen3 30b (#8)
* Fix crash with Qwen-30B-A3B Q4_0
Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation.
* Decide block size based on tensor quantization type
* Fix crashes due to KV cache serialization (#9)
KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset.
* metal : fix build (#7)
* static memory allocations, fix usage count
* fix tensor granularity
* more even memory distribution
* use BF16 for allreduce
* rebase fixup
* better error message for unsupported architectures
* Fix device mismatch during scatter of allReduce. (#11)
There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies
* Enable the previous allreduce implementation. It is better in both perf and stability (#12)
* delay AllReduce for Moe for less I/O
* build : clean-up compile warnings
* backend : move most of the meta backend API to ggml-backend-impl.h
* cont : hide unused public API in the implementation
* llama : use llama_device + remove ggml_backend_dev_is_meta()
* ggml-backend : remove unused alloc include
* minor : remove regex include
* ggml : introduce ggml-ext.h for staging new APIs
* rebase fixup
* fix tests
* llama : more robust logic for determining Meta devices (#16)
* llama : more robust logic for determining Meta devices
* cont : fix devs size check
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* cont : fix log type
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* disable roundtrip for meta backend
* fix arch selection
* Qwen 3.5 support
* fix Gemma 4 MoE
* fix OpenVino, SYCL
* fix test-llama-archs for CPU-only builds
* Fix Qwen 3.5 MoE
* disable meta backend tests for WebGPU
* tests : filter CPU-based devices from the Meta backend tests (#17)
* meta : formatting, naming, indentation (#18)
* formatting : llama-model.cpp
* formatting : ggml-ext.h
* formatting : ggml-backend-meta.cpp
* meta : add TODO
* add documentation
* better error messages
* fix GPT-OSS
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Gaurav Garg <gaugarg@nvidia.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* sycl : add flash-attn support for head size 512
This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512.
Changes:
- Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels.
- Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256).
- Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`.
- Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization.
- Added necessary template instances for the new 512 head size across various quantization types.
* remove defunct mxfp4 reorder from setting buffer type
* fix: free ctx_copy in ggml_opt_free to plug per-training-session leak
ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every
time the allocated graph shape changes. The last ctx_copy from the
final ggml_opt_alloc call survives until ggml_opt_free is invoked,
but ggml_opt_free was only freeing ctx_static and ctx_cpu, never
ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch
context — ~900 KB for a typical GNN training session in
sindarin-pkg-tensor, surfaced via AddressSanitizer.
ctx_copy is nullptr-initialized and ggml_free() handles NULL safely,
so the new release is guard-free.
* Update ggml/src/ggml-opt.cpp
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: realorko <realorko@nowhere.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* initial Q1_0 Metal backend
* tuning q1_0 metal kernels
* add Q1_0 to test-backend-ops
* add Q1_0<->F32 copy test
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ds_read_b128 for q4_0 and q4_1 mmq kernels
Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both.
* Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation
* Explicit for loop in mmq, renamed vec into tmp
* Fixed max_cpy usage in the loading loop
* Fixed typo in q4_1 kernel
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Renoved trailing white line 500
* Update mmq.cuh removed other whitelines
* Remove trailing whitespaces
---------
Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: iacopPBK <iacop@deneb.com>
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* Add parameters for different browsers in-flight submissions
* Update handling of batch size too
* Throttle ios as much as possible
* Increase timeout for llvm-pipe testing
Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.
On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.
The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.
AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
Fixes: #21517
* Write an optimized flash_attn_stream_k_fixup kernel
Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst
* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs
* Address review comments
* Address review comments
* Revert variable names to original
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
* ggml-zendnn : add braces to sgemm failure condition for consistency
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.
ref: #21265
ref: #20315
* hexagon : add cumsum op support
* hexagon: enable dma for cumsum op
* Fix line-ending
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* Introduced NVFP4 generic MMQ kernel
* Added extra FP8 guard, hope to solve ci HIP failure
* Rename tiles and use HIP_FP8_AVAILABLE
* Removed remaning FP8 straggler and added const int
* Const
* Removed DECL_MMQ_CASE artifact
* Removed newline
* Removed space after else
* Changed HIP FP8 NVFP4 conversion gate
* Added new line to bottom of mmq.cu 270
* Removed extra spaces
* Removed single space in front of else on line 814
* Added NVFP4 to generate cu script so HIP can see it, further tightened logic
* Include generated mmq-instance-nvfp4.cu
* Added NVFP4 mmq to HIP Check ignore list
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Changed to Q3_K tile to read MMQ_MMA_TILE_X_K_NVFP4 in tile assert
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/mmq.cuh
Added function name ending for end if
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Added function names to closing endif
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
The conditions cc == GGML_CUDA_CC_VOLTA || cc >= GGML_CUDA_CC_ADA_LOVELACE and cc >= GGML_CUDA_CC_TURING match all non-nvidia devices. This causes us to attempt to launch the kernel for batch sizes with larger configurations than our launch bounds on HIP devices. This pr fixes the conditionals in get_mmvq_mmid_max_batch.
Fixes#21191
* flash attention support for head dimension 512 added
* FA D=512 - match 576 configs, limit ncols2, revert vec cap
* fix HIP tile kernel build for D=512
* fix HIP tile kernel occupancy for D=512 on AMD
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix tile FA compilation
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* port cpy pipeline to shader lib with JIT compilation
* port glu pipeline to shader lib with JIT compilation
* port rope pipeline to shader lib with JIT compilation
* port soft_max pipeline to shader lib with JIT compilation
* removed unused functions from embed_wgsl.py which were used for
old AOT template expansion
* CANN: fix multi-thread set_tensor race conditions
When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:
1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
before uploading to device. Per-chunk transforms produced corrupt data.
2. ND-to-NZ weight conversion requires complete tensor data on device.
Per-chunk conversion operated on incomplete data.
3. The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
* CANN: fix L2_NORM ignoring eps parameter
The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.
* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching
When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.
Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().
* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison
The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.
Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
so the cache can distinguish tensors with different types but
identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
SCALE/UNARY/GLU/ROPE/POOL_2D