* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain
On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).
Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().
Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.
Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix
---------
Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
* hexagon: fix tail corruption with rows sizes not multiple of 256
* hexagon: use different stride for repacking partial blocks
* hex-mm: update repack and kernels to avoid shuffles for full 256-element blocks
Previous commit changed the repacking to use even:odd (0:1,2:3,..) packing
instead of the original (0:128,1:129,...) packing in order to fix tail corruption.
Since the mm kernels already deal with partial tails we can use even:odd
packing only for the last block.
This avoid performance penalty of having to shuffle to zip the elements
in the common case.
* hex-mm: update rmpy x8 for better optimizations
* hex-mm: tighten supported MUL_MAT checks to avoid spurios failures
* hex-mm: use vzero to init accumulators
* hex-mm: properly call partial rmpy_x8
* Update build doc
* Add cgraph tensor output name to OV op name
* Update openvino build instructions
* Add initial NPU support
* draft NPU support version 2: prefill + kvcache
* NPU support version 2: prefill + kvcache
* Change due to ggml cgraph changes, not correct yet
* Change due to ggml cgraph changes, llama-3.2 CPU work
* Add AMD64 to CMakeLists
* Change due to ggml cgraph changes, all device work
* Refactor: clean, fix warning
* Update clang-format
* Statful transformation for CPU GPU
* Add SwiGLU
* Fuse to SDPA
* Replace Concat with Broadcast in MulMat for GQA
* Pull out indices creation for kv cache update
* Refactor: remove past_token_len from extra_inputs
* Fix Phi3 SwiGLU and SoftMax
* Pull out sin cos from rope
* Reduce memory: free ov weights node after graph conversion
* Fix CPY due to cgraph change
* Added OpenVINO CI/CD. Updated docs
* Fix llama-cli
* Fix Phi3 ROPE; Add test-backend-ops
* Fix NPU
* Fix llama-bench; Clang-format
* Fix llama-perplexity
* temp. changes for mark decomp
* matmul in fp32
* mulmat input conversion fix
* mulmat type conversion update
* add mark decomp pass
* Revert changes in fuse_to_sdpa
* Update build.md
* Fix test-backend-ops
* Skip test-thread-safety; Run ctest only in ci/run.sh
* Use CiD for NPU
* Optimize tensor conversion, improve TTFT
* Support op SET_ROWS
* Fix NPU
* Remove CPY
* Fix test-backend-ops
* Minor updates for raising PR
* Perf: RMS fused to OV internal RMS op
* Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
* Change openvino device_type to GPU; Enable flash_attn
* Update supports_buft and supports_op for quantized models
* Add quant weight conversion functions from genai gguf reader
* Quant models run with accuracy issue
* Fix accuracy: disable cpu_repack
* Fix CI; Disable test-backend-ops
* Fix Q4_1
* Fix test-backend-ops: Treat quantized tensors as weights
* Add NPU Q4_0 support
* NPU perf: eliminate zp
* Dequantize q4_1 q4_k q6_k for NPU
* Add custom quant type: q8_1_c, q4_0_128
* Set m_is_static=false as default in decoder
* Simpilfy translation of get_rows
* Fix after rebasing
* Improve debug util; Eliminate nop ReshapeReshape
* STYLE: make get_types_to_requant a function
* Support BF16 model
* Fix NPU compile
* WA for npu 1st token acc issue
* Apply EliminateZP only for npu
* Add GeGLU
* Fix Hunyuan
* Support iSWA
* Fix NPU accuracy
* Fix ROPE accuracy when freq_scale != 1
* Minor: not add attention_size_swa for non-swa model
* Minor refactor
* Add Q5_K to support phi-3-q4_k_m
* Requantize Q6_K (gs16) to gs32 on GPU
* Fix after rebasing
* Always apply Eliminate_ZP to fix GPU compile issue on some platforms
* kvcachefusion support
* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
* Fix for Phi3
* Fix llama-cli (need to run with --no-warmup)
* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
* fix after rebasing
* Fix llama-3-8b and phi3-mini q4_0 NPU
* Update to OV-2025.3 and CMakeLists.txt
* Add OV CI cache
* Apply CISC review and update CI to OV2025.3
* Update CI to run OV dep install before build
* Update OV dockerfile to use OV2025.3 and update build docs
* Style: use switch in supports_ops
* Style: middle ptr and ref align, omit optional struct keyword
* NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
* Clean placeholders in ggml-openvino.cpp
* NPU unify PD (handled internally)
* change graph to 4d, support multi sequences
* Fix llama-bench
* Fix NPU
* Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Remove the second decoder for node. Moving the function into the model decoder
* Fix error for naive
* NPU prefill chunking
* NPU fix llama-bench
* fallback naive run with accuracy issue
* NPU support llma-perplexity -b 512 --no-warmup
* Refactor: split ov_graph_compute for dynamic and static
* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
* minor update due to ov 2025.4
* remove unused API GgmlOvDecoder::get_output_names()
* remove unused API get_output_shape(const std::string & name)
* Modified API GgmlOvDecoder::get_output_type(const std::string & name)
* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
* Removed API get_output_ggml_tensor(const std::string & name)
* Removed API m_outputs
* Removed m_output_names
* Removed API GgmlOvDecoder::get_input_names()
* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
* Removed API get_input_type
* Removed API get_input_type
* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
* Fix error for decoder cache
* Reuse cached decoder
* GPU remove Q6_K requantization
* NPU fix wrong model output shape
* NPU fix q4 perf regression
* Remove unused variable nodes
* Fix decoder can_reuse for llama-bench
* Update build.md for Windows
* backend buffer: allocate on host
* Use shared_buffer for GPU NPU; Refactor
* Add ov_backend_host_buffer; Use cached remote context
* Put kvcache on GPU
* Use ggml_aligned_malloc
* only use remote tensor for kvcache
* only use remote tensor for kvcache for GPU
* FIX: use remote tensor from singleton
* Update build.md to include OpenCL
* NPU always requant to q4_0_128
* Optimize symmetric quant weight extraction: use single zp
* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
* Update build.md
* Support -ctk f32
* Initial stateful graph support
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* code cleanup
* npu perf fix
* requant to f16 for Q6 embed on NPU
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
* Create OPENVINO.md in llama.cpp backend docs
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update build.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* kq_mask naming fix
* Syntax correction for workflows build file
* Change ov backend buffer is_host to false
* Fix llama-bench -p -n where p<=256
* Fix --direct-io 0
* Don't put kvcache on GPU in stateful mode
* Remove hardcode names
* Fix stateful shapes
* Simplification for stateful and update output shape processing
* Remove hardcode names
* Avoid re-compilation in llama-bench
* Extract zp directly instead of bias
* Refactor weight tensor processing
* create_weight_node accept non-ov backend buffer
* remove changes in llama-graph.cpp
* stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
* hardcoded name handling for rope_freqs.weight
* Suppress logging and add error handling to allow test-backend-ops to complete
* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
* Use bias instead of zp in test-backend-ops
* Update OV in CI, Add OV CI Tests in GH Actions
* Temp fix for multithreading bug
* Update OV CI, fix review suggestions.
* fix editorconfig-checker, update docs
* Fix tabs to spaces for editorconfig-checker
* fix editorconfig-checker
* Update docs
* updated model link to be GGUF model links
* Remove GGML_CPU_REPACK=OFF
* Skip permuted ADD and MUL
* Removed static variables from utils.cpp
* Removed initializing non-existing variable
* Remove unused structs
* Fix test-backend-ops for OV GPU
* unify api calling
* Update utils.cpp
* When the dim is dynamic, throw an error, need to is stastic forst
* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
* No need to return
* Fix test-backend-ops for OV GPU LNL
* Fix test-thread-safety
* use the shape from infer request of output tensor create to avoid issue
* fix dynamic output shape issue
* fix issue for the unused node in tests
* Remove unused lock
* Add comment
* Update openvino docs
* update to OV release version 2026.0
* add ci ov-gpu self hosted runner
* fix editorconfig
* Fix perplexity
* Rewrite the model inputs finding mechanism (#54)
* Rewrite the model inputs finding logistic
* Put stateful shape handle in get input shape
* Put the iteration logistic in func
* Added ggml-ci-intel-openvino-gpu and doc update
* .hpp files converted to .h
* fix ggml-ci-x64-intel-openvino-gpu
* Fix for stateful execution bug in llama-bench
* Minor updates after stateful llama-bench fix
* Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* Remove multiple get_shape calls
* Bring back mutex into compute
* Fix VIEW op, which slice the input node
* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
* Temp. fix for test requant errors
* Update to OV ggml-ci to low-perf
* ci : temporary disable "test-llama-archs"
* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
* docs : update url
* Fix OV link in docker and Update docs
---------
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande
winogrande_score : tokenizing selected tasks
winogrande_score : calculating winogrande score over selected tasks.
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 46
failed to decode the batch, n_batch = 2048, ret = 1
winogrande_score: llama_decode() failed
same for hellaswag:
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 99
failed to decode the batch, n_batch = 2048, ret = 1
hellaswag_score: llama_decode() failed
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>