* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it
* use FetchContent to get SPIRV-Headers
* Fetch spirv-headers unconditionally
* remove fetchcontent, rely on installed headers
* fix ubuntu job
* Update docs/build.md
actions/labeler@v6 removed the `all:` / `any:` composition keys.
The `server/webui` and `server` entries used `all:` to combine
`any-glob-to-any-file` with negated `all-globs-to-all-files`,
which now errors on every PR with:
Unknown config options were under "changed-files": all
Flatten both entries to a single `any-glob-to-any-file`. PRs
touching both webui and other server files will now receive both
labels instead of only `server/webui`.
Co-authored-by: Marxist-Leninist <noreply@users.noreply.github.com>
* experimenting CI
* Experimenting CI fix for MinGW
* experimenting CI on Windows
* modified script for integration with VisualStudio
* added proxy handling
* adding python version for Windows execution
* fix iterator::end() dereference
* fixed proxy handling
* Fix errors occurring on Windows
* fixed ci script
* Reverted to master
* Stripping test items to simplify Windows test
* adjusting script for windows testing
* Changed shell
* Fixed shell
* Fixed shell
* Fix CI setting
* Fix CI setting
* Fix CI setting
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* experimenting fix for unit test error
* Changed to use BUILD_LOW_PERF to skip python tests
* Fix CI
* Added option to specify Ninja generator
* Reverted proxy related changes
* ci : add AMD CPU label to PR labeler
Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files
* ci : rename label AMD CPU to AMD ZenDNN in labeler config
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
* Obtain source tag name from git tag
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* cann: update docker images to 8.5.0
- bump CANN base image from 8.3.rc2 to 8.5.0
- bump ASCEND_VERSION from 8.1.RC1.alpha001 to 8.5.0
Move to newer stable releases.
* cann: update CANN.md
* Update CANN.md to include BF16 support
Added BF16 support information to the CANN documentation and corrected formatting for the installation instructions.
* Fix formatting issues in CANN.md
Fix 234: Trailing whitespace
* scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process
* Return warning ignore
* obay pep8 inline double space before inline commets
* add # noqa: NP100 for other prints too
* Add script changes to cause autotrigger
* Remove make dependency
* Added option to specify Ninja generator
* use ninja-build as default for several CI
* Revert "use ninja-build as default for several CI"
This reverts commit f552c4559b.
* changed use plain string rather than arrays
* Enabled ninja build by default for experimentation
* ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Enabled ninja build by default on self-hosted envs for experimentation
* ci: revert generator to ninja instead of ninja multi-config
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ci: install ninja-build for self-hosted workflows
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ci: revert ninja from self-hosted runners
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ci: missed one self-hosted step
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ci: fix windows ci errors from an errenous revert
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Added explicit build types for Ninja
Also reverted some needless change
* ci: use ninja multi-config for vulkan-x64 build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* added time command to measure build time
* Keeping some configs to use Ninja which show improvement
* minor fix based on review
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* ci: rm `time` from custom containers
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* Update build doc
* Add cgraph tensor output name to OV op name
* Update openvino build instructions
* Add initial NPU support
* draft NPU support version 2: prefill + kvcache
* NPU support version 2: prefill + kvcache
* Change due to ggml cgraph changes, not correct yet
* Change due to ggml cgraph changes, llama-3.2 CPU work
* Add AMD64 to CMakeLists
* Change due to ggml cgraph changes, all device work
* Refactor: clean, fix warning
* Update clang-format
* Statful transformation for CPU GPU
* Add SwiGLU
* Fuse to SDPA
* Replace Concat with Broadcast in MulMat for GQA
* Pull out indices creation for kv cache update
* Refactor: remove past_token_len from extra_inputs
* Fix Phi3 SwiGLU and SoftMax
* Pull out sin cos from rope
* Reduce memory: free ov weights node after graph conversion
* Fix CPY due to cgraph change
* Added OpenVINO CI/CD. Updated docs
* Fix llama-cli
* Fix Phi3 ROPE; Add test-backend-ops
* Fix NPU
* Fix llama-bench; Clang-format
* Fix llama-perplexity
* temp. changes for mark decomp
* matmul in fp32
* mulmat input conversion fix
* mulmat type conversion update
* add mark decomp pass
* Revert changes in fuse_to_sdpa
* Update build.md
* Fix test-backend-ops
* Skip test-thread-safety; Run ctest only in ci/run.sh
* Use CiD for NPU
* Optimize tensor conversion, improve TTFT
* Support op SET_ROWS
* Fix NPU
* Remove CPY
* Fix test-backend-ops
* Minor updates for raising PR
* Perf: RMS fused to OV internal RMS op
* Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
* Change openvino device_type to GPU; Enable flash_attn
* Update supports_buft and supports_op for quantized models
* Add quant weight conversion functions from genai gguf reader
* Quant models run with accuracy issue
* Fix accuracy: disable cpu_repack
* Fix CI; Disable test-backend-ops
* Fix Q4_1
* Fix test-backend-ops: Treat quantized tensors as weights
* Add NPU Q4_0 support
* NPU perf: eliminate zp
* Dequantize q4_1 q4_k q6_k for NPU
* Add custom quant type: q8_1_c, q4_0_128
* Set m_is_static=false as default in decoder
* Simpilfy translation of get_rows
* Fix after rebasing
* Improve debug util; Eliminate nop ReshapeReshape
* STYLE: make get_types_to_requant a function
* Support BF16 model
* Fix NPU compile
* WA for npu 1st token acc issue
* Apply EliminateZP only for npu
* Add GeGLU
* Fix Hunyuan
* Support iSWA
* Fix NPU accuracy
* Fix ROPE accuracy when freq_scale != 1
* Minor: not add attention_size_swa for non-swa model
* Minor refactor
* Add Q5_K to support phi-3-q4_k_m
* Requantize Q6_K (gs16) to gs32 on GPU
* Fix after rebasing
* Always apply Eliminate_ZP to fix GPU compile issue on some platforms
* kvcachefusion support
* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
* Fix for Phi3
* Fix llama-cli (need to run with --no-warmup)
* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
* fix after rebasing
* Fix llama-3-8b and phi3-mini q4_0 NPU
* Update to OV-2025.3 and CMakeLists.txt
* Add OV CI cache
* Apply CISC review and update CI to OV2025.3
* Update CI to run OV dep install before build
* Update OV dockerfile to use OV2025.3 and update build docs
* Style: use switch in supports_ops
* Style: middle ptr and ref align, omit optional struct keyword
* NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
* Clean placeholders in ggml-openvino.cpp
* NPU unify PD (handled internally)
* change graph to 4d, support multi sequences
* Fix llama-bench
* Fix NPU
* Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Remove the second decoder for node. Moving the function into the model decoder
* Fix error for naive
* NPU prefill chunking
* NPU fix llama-bench
* fallback naive run with accuracy issue
* NPU support llma-perplexity -b 512 --no-warmup
* Refactor: split ov_graph_compute for dynamic and static
* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
* minor update due to ov 2025.4
* remove unused API GgmlOvDecoder::get_output_names()
* remove unused API get_output_shape(const std::string & name)
* Modified API GgmlOvDecoder::get_output_type(const std::string & name)
* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
* Removed API get_output_ggml_tensor(const std::string & name)
* Removed API m_outputs
* Removed m_output_names
* Removed API GgmlOvDecoder::get_input_names()
* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
* Removed API get_input_type
* Removed API get_input_type
* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
* Fix error for decoder cache
* Reuse cached decoder
* GPU remove Q6_K requantization
* NPU fix wrong model output shape
* NPU fix q4 perf regression
* Remove unused variable nodes
* Fix decoder can_reuse for llama-bench
* Update build.md for Windows
* backend buffer: allocate on host
* Use shared_buffer for GPU NPU; Refactor
* Add ov_backend_host_buffer; Use cached remote context
* Put kvcache on GPU
* Use ggml_aligned_malloc
* only use remote tensor for kvcache
* only use remote tensor for kvcache for GPU
* FIX: use remote tensor from singleton
* Update build.md to include OpenCL
* NPU always requant to q4_0_128
* Optimize symmetric quant weight extraction: use single zp
* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
* Update build.md
* Support -ctk f32
* Initial stateful graph support
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* code cleanup
* npu perf fix
* requant to f16 for Q6 embed on NPU
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
* Create OPENVINO.md in llama.cpp backend docs
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update build.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* kq_mask naming fix
* Syntax correction for workflows build file
* Change ov backend buffer is_host to false
* Fix llama-bench -p -n where p<=256
* Fix --direct-io 0
* Don't put kvcache on GPU in stateful mode
* Remove hardcode names
* Fix stateful shapes
* Simplification for stateful and update output shape processing
* Remove hardcode names
* Avoid re-compilation in llama-bench
* Extract zp directly instead of bias
* Refactor weight tensor processing
* create_weight_node accept non-ov backend buffer
* remove changes in llama-graph.cpp
* stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
* hardcoded name handling for rope_freqs.weight
* Suppress logging and add error handling to allow test-backend-ops to complete
* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
* Use bias instead of zp in test-backend-ops
* Update OV in CI, Add OV CI Tests in GH Actions
* Temp fix for multithreading bug
* Update OV CI, fix review suggestions.
* fix editorconfig-checker, update docs
* Fix tabs to spaces for editorconfig-checker
* fix editorconfig-checker
* Update docs
* updated model link to be GGUF model links
* Remove GGML_CPU_REPACK=OFF
* Skip permuted ADD and MUL
* Removed static variables from utils.cpp
* Removed initializing non-existing variable
* Remove unused structs
* Fix test-backend-ops for OV GPU
* unify api calling
* Update utils.cpp
* When the dim is dynamic, throw an error, need to is stastic forst
* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
* No need to return
* Fix test-backend-ops for OV GPU LNL
* Fix test-thread-safety
* use the shape from infer request of output tensor create to avoid issue
* fix dynamic output shape issue
* fix issue for the unused node in tests
* Remove unused lock
* Add comment
* Update openvino docs
* update to OV release version 2026.0
* add ci ov-gpu self hosted runner
* fix editorconfig
* Fix perplexity
* Rewrite the model inputs finding mechanism (#54)
* Rewrite the model inputs finding logistic
* Put stateful shape handle in get input shape
* Put the iteration logistic in func
* Added ggml-ci-intel-openvino-gpu and doc update
* .hpp files converted to .h
* fix ggml-ci-x64-intel-openvino-gpu
* Fix for stateful execution bug in llama-bench
* Minor updates after stateful llama-bench fix
* Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* Remove multiple get_shape calls
* Bring back mutex into compute
* Fix VIEW op, which slice the input node
* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
* Temp. fix for test requant errors
* Update to OV ggml-ci to low-perf
* ci : temporary disable "test-llama-archs"
* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
* docs : update url
* Fix OV link in docker and Update docs
---------
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments