* vulkan: change gated_delta_net to shard a column across a subgroup
This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an
LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to
work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of
subgroup to invocation id, using subgroupAdd optionally, etc.).
This fixes a perf regression from the transposing of the values in memory
(!20443).
* vulkan: Spread columns across fewer lanes to reduce the number of workgroups
* CANN: add BF16 support for core operators
Add BF16 (bfloat16) type support to the CANN backend for the following
operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and
OUT_PROD. This enables BF16 models to run on Ascend NPUs.
* CANN: skip NZ weight format for BF16 and add 310P compile guards
NZ weight format conversion does not support BF16 tensors, skip it
in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID
and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P
guards for all BF16 operator support since 310P does not support BF16.
* migrate(vtcm): unify VTCM management for HMX merge
- Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled,
hmx_dma, vtcm_scratch_size, exp2_table
- Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for
entire session instead of per-op acquire/release
- Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in
session-hold mode, delegate in per-op mode
- Add VTCM tail reservation for precompute tables (256KB, 64KB aligned)
in htp_iface_start under HTP_HAS_HMX
- Add HMX init/cleanup hooks in htp_iface_start/stop
- Add precompute table recovery in vtcm_acquire after VTCM preemption
- Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation)
* migrate(repack): replace x4x2 with HMX tile-permuted super-block format
- Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants)
- Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted
- Implement inverse repack for get_tensor debug verification
- Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2
- MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy)
- Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks
- Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate
- Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op
* migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops
Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op
migration. Reuses existing dma_queue_flush() for sync points instead
of adding redundant dma_queue_drain().
* migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking
Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for
cleaner separation between HVX and HMX code. Simplify HMX hardware
locking by replacing the two-level lock design (SHARED HAP lock +
custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock
on the existing vtcm_rctx, which already has HMX capability.
Key changes:
- Create htp/hmx/ subdirectory with all HMX infrastructure and ops
- Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx)
- Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed)
- Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals)
- Update main.c includes to use hmx/ prefix
- Clean up duplicate declarations from hmx-worker-pool.h
* migrate(hmx-infra): consolidate HMX infrastructure into htp_context
- Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops
- Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func)
- Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx
- Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release
- Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue
- Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers
- Delete upstream llama.cpp AGENTS.md (not applicable to fork)
* migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table
- Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable
- Remove table duplication loop in precompute-table.c
- Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c
- Fix table_size to 65536 (single 64 KB copy) in main.c
The exp2 lookup table is read-only; concurrent VTCM reads do not cause
bank conflicts, so duplicating the table wastes 192 KB of VTCM for no
benefit.
* migrate(dsp-main): add HMX priority dispatch in packet_callback
- Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types)
- Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16)
- Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32
- Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled
- Split RMS_NORM and SCALE into separate case blocks for independent dispatch
- All HMX wrappers guarded by #ifdef HTP_HAS_HMX
* migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels
Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build
integration: compile hmx-matmul-ops, hmx-flash-attn-ops,
hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels
with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain
unchanged.
* migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility
- hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions
- hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names
- hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings
- hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable
* hmx: set Q/O element type to fp16 for flash attention
The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element
should be false to match the actual data layout.
* hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback
Remove the v73+ HMX-specific super-block/tile-permuted weight format
and unify all architectures on the HVX x4x2 packed format. The DSP now
decides at runtime whether to use the HMX or HVX matmul path based on
dimension constraints (M%32, N%32, K%256 alignment), rather than the
host rejecting ops in supports_op. This simplifies the host repack
logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL
quantization support across host and DSP.
Key changes:
- Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile
permutation (ggml-hexagon.cpp, hmx-quants.h)
- Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL
- Force is_host=false so tensor copies go through format conversion
- Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h)
- Rewrite DSP dequantizers to work directly on x4x2 layout
(hmx-matmul-ops.c)
- Fix mxclracc.hf placement: clear per output tile, not once globally
- Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c)
- Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks
- Add VTCM allocation overflow asserts
- Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt)
* Enhance HMX debugging capabilities with new tile dumping functions
- Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data.
- Updated hmx_dump_tile_rows to provide raw memory output for debugging.
- Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting.
- Refined existing macros for dumping HVX vector values to streamline debugging output.
These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations.
* OK for small mat mul
* hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers
The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix
DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing
truncated transfers and NaN results. Fix by using 2D DMA (per-row stride ×
n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both
x4x2 and fp16 weight paths.
Also includes:
- Use standard vlut16 instead of _nomatch variant for dequantization
- Add per-tile vscatter drain barrier for correctness
- Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default)
* hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline
Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over
the generic unary path. Re-enable DMA pipeline mode for QK matmul.
* hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow
All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride)
are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the
non-pipeline path by switching from 1D to 2D DMA, but the pipeline path
(3 call sites) was left unchanged and still used 1D DMA with
chunk_size = n_cols * row_stride as roiwidth, which overflows for any
practical matrix size when the pipeline is active.
Add a local hmx_dma_push_safe() helper that transparently handles overflow:
- Fast path (zero overhead): all params fit in 16 bits -> direct call.
- Contiguous block: reshapes into a single 2D descriptor with sub_width
that fits in 16 bits, preserving async DMA behavior.
- Stride overflow: row-by-row fallback for future large-k models where
per-row stride itself exceeds 65535.
Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use
the safe helper, including the 3 pipeline sites (1D -> 2D fix), the
FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the
output-stationary activation fetch.
* hexagon: multithread activation/output transfer and add HMX matmul fallback
- Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with
transfer_activation_chunk_multithread across all HMX matmul paths
- Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32
output store, following the same worker pool pattern
- Rename transfer_activation_chunk_no_prefetch back to
transfer_activation_chunk_fp32_to_fp16 and clean up stale comments
- Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error
* [todo]: dynamic alloc vtcm, cause prefill regression.
* hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults
Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges.
* hexagon: split unaligned-M HMX matmul into HMX+HVX phases
- keep HMX for the 32-aligned head rows and process tail rows with HVX
- force re-quantization for HVX tail after HMX phase to avoid stale VTCM state
- preserve fallback behavior when N is unaligned or no aligned M rows exist
* hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces
Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous
K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead.
The dequantize loop now takes the batch-4 path when 4 aligned K-tiles
are available within the same column tile, falling back to the original
single-tile path otherwise.
Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no
longer needed.
* hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag
Promote DSP response error from log to GGML_ABORT for fail-fast
behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX
matmul so the HVX path correctly re-quantizes activations.
* hexagon: support batch matmul. This fix perplexity issue
The problem comes from Grouped-Query Attention(GQA). Strides between batches are not well respected
TODO: optimize batch matmul to reuse weights between batches.
* hexagon: reuse weights in fp16 batch matmul
* hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test
* hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options
* hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL
* hexagon: remove the unnecessary changes not related to HMX
* hexagon: bypass HMX by default
* hexagon: add upstream repo link to htp-ops-lib ported file headers
* hexagon: restore host buffer support
* hexagon: add HMX=1 option for the adb scripts
* hex-hmx: improve DMA pipelining
* hex-hmx: further improvements to dma pipelining
* hex-hmx: minor cleanup
* hex-hmx: move hmx lock out of inner loops/calls
* hex-hmx: remove unnecessary state and wrappers
* hex-hmx: remove hmx dir and unify f32 to f16 conversions
* hex-hmx: further unify hvx conversions
* hex-hmx: revert f16 converter to the original for now
* hex-hmx: minor cleanup for f16 to f32 converter
* hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code
* hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm
* hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned
* hex-mm: use hvx_vec_splat_f16 in the hmx code
* hex-mm: use VLEN and HTP types in hmx-code
* hex-mm: remove duplicate QK and defs
* hexagon: pre-shuffle quants before vlut16
* hexagon: enable HMX by default
* hex-mm: code indent fixes for hmx-matmul
* hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm
* hex-mm: more formatting fixes
* hex-mm: minor naming updates in hmx code
* hex-mm: remove leftover from rebase conflict
* Fix the incorrect indents
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
RotaryPositionEmbedding on CANN fails when src and dst share the same
non-contiguous buffer (inplace + view), because the operator overwrites
source data before it is fully read.
Add a branch that detects this case and uses contiguous temporary
buffers: copy src to temp, run ROPE into another temp, then copy back
to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1,
inplace=1).
Signed-off-by: noemotiovon <757486878@qq.com>
- Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by
padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2,
then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp).
- Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of
sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with
48 heads); fixes buffer overflow and large numerical errors in those cases.
Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.
- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
the worker pool, supporting f32 and f16 types. The kernel parallelizes
across output rows and uses memcpy for each tile.
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* kleidiai : fix MUL_MAT support for batched (3D) inputs
The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.
This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.
Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.
Fixes#20608
* Kleidiai support_ops should only return true for 3D inputs, not also 4D
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
* kleidiai: add data type check to get_tensor_traits
* Added check for F16 data type into get_tensor_traits path with input data
not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7
* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
updated kleidiai.cpp file as per suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain
On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).
Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().
Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.
Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix
---------
Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* hexagon: fix tail corruption with rows sizes not multiple of 256
* hexagon: use different stride for repacking partial blocks
* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks
Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.
* hex-mm: update rmpy x8 for better optimizations
* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures
* hex-mm: use vzero to init accumulators
* hex-mm: properly call partial rmpy_x8
* Update build doc
* Add cgraph tensor output name to OV op name
* Update openvino build instructions
* Add initial NPU support
* draft NPU support version 2: prefill + kvcache
* NPU support version 2: prefill + kvcache
* Change due to ggml cgraph changes, not correct yet
* Change due to ggml cgraph changes, llama-3.2 CPU work
* Add AMD64 to CMakeLists
* Change due to ggml cgraph changes, all device work
* Refactor: clean, fix warning
* Update clang-format
* Statful transformation for CPU GPU
* Add SwiGLU
* Fuse to SDPA
* Replace Concat with Broadcast in MulMat for GQA
* Pull out indices creation for kv cache update
* Refactor: remove past_token_len from extra_inputs
* Fix Phi3 SwiGLU and SoftMax
* Pull out sin cos from rope
* Reduce memory: free ov weights node after graph conversion
* Fix CPY due to cgraph change
* Added OpenVINO CI/CD. Updated docs
* Fix llama-cli
* Fix Phi3 ROPE; Add test-backend-ops
* Fix NPU
* Fix llama-bench; Clang-format
* Fix llama-perplexity
* temp. changes for mark decomp
* matmul in fp32
* mulmat input conversion fix
* mulmat type conversion update
* add mark decomp pass
* Revert changes in fuse_to_sdpa
* Update build.md
* Fix test-backend-ops
* Skip test-thread-safety; Run ctest only in ci/run.sh
* Use CiD for NPU
* Optimize tensor conversion, improve TTFT
* Support op SET_ROWS
* Fix NPU
* Remove CPY
* Fix test-backend-ops
* Minor updates for raising PR
* Perf: RMS fused to OV internal RMS op
* Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
* Change openvino device_type to GPU; Enable flash_attn
* Update supports_buft and supports_op for quantized models
* Add quant weight conversion functions from genai gguf reader
* Quant models run with accuracy issue
* Fix accuracy: disable cpu_repack
* Fix CI; Disable test-backend-ops
* Fix Q4_1
* Fix test-backend-ops: Treat quantized tensors as weights
* Add NPU Q4_0 support
* NPU perf: eliminate zp
* Dequantize q4_1 q4_k q6_k for NPU
* Add custom quant type: q8_1_c, q4_0_128
* Set m_is_static=false as default in decoder
* Simpilfy translation of get_rows
* Fix after rebasing
* Improve debug util; Eliminate nop ReshapeReshape
* STYLE: make get_types_to_requant a function
* Support BF16 model
* Fix NPU compile
* WA for npu 1st token acc issue
* Apply EliminateZP only for npu
* Add GeGLU
* Fix Hunyuan
* Support iSWA
* Fix NPU accuracy
* Fix ROPE accuracy when freq_scale != 1
* Minor: not add attention_size_swa for non-swa model
* Minor refactor
* Add Q5_K to support phi-3-q4_k_m
* Requantize Q6_K (gs16) to gs32 on GPU
* Fix after rebasing
* Always apply Eliminate_ZP to fix GPU compile issue on some platforms
* kvcachefusion support
* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
* Fix for Phi3
* Fix llama-cli (need to run with --no-warmup)
* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
* fix after rebasing
* Fix llama-3-8b and phi3-mini q4_0 NPU
* Update to OV-2025.3 and CMakeLists.txt
* Add OV CI cache
* Apply CISC review and update CI to OV2025.3
* Update CI to run OV dep install before build
* Update OV dockerfile to use OV2025.3 and update build docs
* Style: use switch in supports_ops
* Style: middle ptr and ref align, omit optional struct keyword
* NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
* Clean placeholders in ggml-openvino.cpp
* NPU unify PD (handled internally)
* change graph to 4d, support multi sequences
* Fix llama-bench
* Fix NPU
* Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Remove the second decoder for node. Moving the function into the model decoder
* Fix error for naive
* NPU prefill chunking
* NPU fix llama-bench
* fallback naive run with accuracy issue
* NPU support llma-perplexity -b 512 --no-warmup
* Refactor: split ov_graph_compute for dynamic and static
* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
* minor update due to ov 2025.4
* remove unused API GgmlOvDecoder::get_output_names()
* remove unused API get_output_shape(const std::string & name)
* Modified API GgmlOvDecoder::get_output_type(const std::string & name)
* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
* Removed API get_output_ggml_tensor(const std::string & name)
* Removed API m_outputs
* Removed m_output_names
* Removed API GgmlOvDecoder::get_input_names()
* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
* Removed API get_input_type
* Removed API get_input_type
* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
* Fix error for decoder cache
* Reuse cached decoder
* GPU remove Q6_K requantization
* NPU fix wrong model output shape
* NPU fix q4 perf regression
* Remove unused variable nodes
* Fix decoder can_reuse for llama-bench
* Update build.md for Windows
* backend buffer: allocate on host
* Use shared_buffer for GPU NPU; Refactor
* Add ov_backend_host_buffer; Use cached remote context
* Put kvcache on GPU
* Use ggml_aligned_malloc
* only use remote tensor for kvcache
* only use remote tensor for kvcache for GPU
* FIX: use remote tensor from singleton
* Update build.md to include OpenCL
* NPU always requant to q4_0_128
* Optimize symmetric quant weight extraction: use single zp
* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
* Update build.md
* Support -ctk f32
* Initial stateful graph support
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* code cleanup
* npu perf fix
* requant to f16 for Q6 embed on NPU
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
* Create OPENVINO.md in llama.cpp backend docs
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update build.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* kq_mask naming fix
* Syntax correction for workflows build file
* Change ov backend buffer is_host to false
* Fix llama-bench -p -n where p<=256
* Fix --direct-io 0
* Don't put kvcache on GPU in stateful mode
* Remove hardcode names
* Fix stateful shapes
* Simplification for stateful and update output shape processing
* Remove hardcode names
* Avoid re-compilation in llama-bench
* Extract zp directly instead of bias
* Refactor weight tensor processing
* create_weight_node accept non-ov backend buffer
* remove changes in llama-graph.cpp
* stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
* hardcoded name handling for rope_freqs.weight
* Suppress logging and add error handling to allow test-backend-ops to complete
* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
* Use bias instead of zp in test-backend-ops
* Update OV in CI, Add OV CI Tests in GH Actions
* Temp fix for multithreading bug
* Update OV CI, fix review suggestions.
* fix editorconfig-checker, update docs
* Fix tabs to spaces for editorconfig-checker
* fix editorconfig-checker
* Update docs
* updated model link to be GGUF model links
* Remove GGML_CPU_REPACK=OFF
* Skip permuted ADD and MUL
* Removed static variables from utils.cpp
* Removed initializing non-existing variable
* Remove unused structs
* Fix test-backend-ops for OV GPU
* unify api calling
* Update utils.cpp
* When the dim is dynamic, throw an error, need to is stastic forst
* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
* No need to return
* Fix test-backend-ops for OV GPU LNL
* Fix test-thread-safety
* use the shape from infer request of output tensor create to avoid issue
* fix dynamic output shape issue
* fix issue for the unused node in tests
* Remove unused lock
* Add comment
* Update openvino docs
* update to OV release version 2026.0
* add ci ov-gpu self hosted runner
* fix editorconfig
* Fix perplexity
* Rewrite the model inputs finding mechanism (#54)
* Rewrite the model inputs finding logistic
* Put stateful shape handle in get input shape
* Put the iteration logistic in func
* Added ggml-ci-intel-openvino-gpu and doc update
* .hpp files converted to .h
* fix ggml-ci-x64-intel-openvino-gpu
* Fix for stateful execution bug in llama-bench
* Minor updates after stateful llama-bench fix
* Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* Remove multiple get_shape calls
* Bring back mutex into compute
* Fix VIEW op, which slice the input node
* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
* Temp. fix for test requant errors
* Update to OV ggml-ci to low-perf
* ci : temporary disable "test-llama-archs"
* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
* docs : update url
* Fix OV link in docker and Update docs
---------
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>