Commit Graph

488 Commits

Author SHA1 Message Date
Georgi Gerganov b31b30f31d
ai : do not run bash commands in the prompt (#20810) 2026-03-20 19:06:33 +02:00
Georgi Gerganov 464fd0e71f
ai : update find-related action (#20790)
* ai : update "related issues" prompt

* cont

* cont

* cont
2026-03-20 10:28:14 +02:00
Georgi Gerganov 6c72646a61
ci : improve action for duplicate issue (#20772)
* ci : show thinking traces of the agent

* cont : increase thinking

* cont : remove agent files

* cont : move the model selection to the provider
2026-03-19 21:11:53 +02:00
Georgi Gerganov 900efd531d
ci : clarify gh command for viewing issues (#20766) 2026-03-19 18:43:54 +02:00
uvos b49d8b8757
ci : add hip quality check (#20430)
* CI: add hip quality check

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/hip-quality-check.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update scripts/hip/gcn-cdna-vgpr-check.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Revert "Update .github/workflows/hip-quality-check.yml"

This reverts commit efa0bfcdb0.

* scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs

* scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list

* Bump ccache version

* Add mssing seperators to list

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-19 17:05:44 +01:00
Georgi Gerganov f071ce67c9
ci : add action for finding duplicate issues (#20756)
* ci : add action for finding duplicates issues

* cont : gen info

* cont : formatting

* cont : fix

* cont : instructions

* cont : bump checkout action

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-19 16:17:37 +02:00
Sigbjørn Skjæret ab0bb93748
ci : bump ccache [no ci] (#20679)
* bump ccache

* forgotten

* disable for s390x

* disable also for ppc64le
2026-03-17 14:54:31 +01:00
Georgi Gerganov 45172df4d6
ci : disable AMX jobs (#20654)
[no ci]
2026-03-16 22:38:59 +02:00
Sigbjørn Skjæret 0ed992973b
ci : update labeler (#20629) 2026-03-16 20:24:20 +01:00
Sigbjørn Skjæret b91d7dfe5b
ci : only save openvino caches on github-hosted master (#20593)
* only save openvino ccache on master

* disable toolkit cache if self-hosted

* only cache on github-hosted runners

* remove toolkit cache [no ci]
2026-03-15 18:58:13 +01:00
Georgi Gerganov 9cd4ebcfb1
ci : split build.yml + server.yml (#20546)
* ci : split build.yml

* cont : split server.yml

* cont : reduce paths

* cont : split build-android.yml + update paths

* ci : make msys workflows manual (#20588)

* ci : make cross-build workflows manual (#20585)

* cont : fix release paths

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-15 15:11:17 +02:00
Georgi Gerganov b4768955c4
ci : move self-hosted workflows to separate files (#20540) 2026-03-14 23:15:35 +02:00
Eve 3a6f059909
ci : try to optimize some jobs (#20521)
* force arm version to test

* run on either x86 or arm if we can help it, this only works for runs without ccache

* readd other jobs

* remove ccache
2026-03-14 20:27:52 +01:00
Georgi Gerganov 9f774e45ee
ci : reduce webgpu tests timeout to 900s (#20538)
[no ci]
2026-03-14 17:08:26 +02:00
Zijun Yu 9789c4ecdc
ggml : add OpenVINO backend (#15307)
* Update build doc

* Add cgraph tensor output name to OV op name

* Update openvino build instructions

* Add initial NPU support

* draft NPU support version 2: prefill + kvcache

* NPU support version 2: prefill + kvcache

* Change due to ggml cgraph changes, not correct yet

* Change due to ggml cgraph changes, llama-3.2 CPU work

* Add AMD64 to CMakeLists

* Change due to ggml cgraph changes, all device work

* Refactor: clean, fix warning

* Update clang-format

* Statful transformation for CPU GPU

* Add SwiGLU

* Fuse to SDPA

* Replace Concat with Broadcast in MulMat for GQA

* Pull out indices creation for kv cache update

* Refactor: remove past_token_len from extra_inputs

* Fix Phi3 SwiGLU and SoftMax

* Pull out sin cos from rope

* Reduce memory: free ov weights node after graph conversion

* Fix CPY due to cgraph change

* Added OpenVINO CI/CD. Updated docs

* Fix llama-cli

* Fix Phi3 ROPE; Add test-backend-ops

* Fix NPU

* Fix llama-bench; Clang-format

* Fix llama-perplexity

* temp. changes for mark decomp

* matmul in fp32

* mulmat input conversion fix

* mulmat type conversion update

* add mark decomp pass

* Revert changes in fuse_to_sdpa

* Update build.md

* Fix test-backend-ops

* Skip test-thread-safety; Run ctest only in ci/run.sh

* Use CiD for NPU

* Optimize tensor conversion, improve TTFT

* Support op SET_ROWS

* Fix NPU

* Remove CPY

* Fix test-backend-ops

* Minor updates for raising PR

* Perf: RMS fused to OV internal RMS op

* Fix after rebasing

- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run

* Change openvino device_type to GPU; Enable flash_attn

* Update supports_buft and supports_op for quantized models

* Add quant weight conversion functions from genai gguf reader

* Quant models run with accuracy issue

* Fix accuracy: disable cpu_repack

* Fix CI; Disable test-backend-ops

* Fix Q4_1

* Fix test-backend-ops: Treat quantized tensors as weights

* Add NPU Q4_0 support

* NPU perf: eliminate zp

* Dequantize q4_1 q4_k q6_k for NPU

* Add custom quant type: q8_1_c, q4_0_128

* Set m_is_static=false as default in decoder

* Simpilfy translation of get_rows

* Fix after rebasing

* Improve debug util; Eliminate nop ReshapeReshape

* STYLE: make get_types_to_requant a function

* Support BF16 model

* Fix NPU compile

* WA for npu 1st token acc issue

* Apply EliminateZP only for npu

* Add GeGLU

* Fix Hunyuan

* Support iSWA

* Fix NPU accuracy

* Fix ROPE accuracy when freq_scale != 1

* Minor: not add attention_size_swa for non-swa model

* Minor refactor

* Add Q5_K to support phi-3-q4_k_m

* Requantize Q6_K (gs16) to gs32 on GPU

* Fix after rebasing

* Always apply Eliminate_ZP to fix GPU compile issue on some platforms

* kvcachefusion support

* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

* Fix for Phi3

* Fix llama-cli (need to run with --no-warmup)

* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working

* fix after rebasing

* Fix llama-3-8b and phi3-mini q4_0 NPU

* Update to OV-2025.3 and CMakeLists.txt

* Add OV CI cache

* Apply CISC review and update CI to OV2025.3

* Update CI to run OV dep install before build

* Update OV dockerfile to use OV2025.3 and update build docs

* Style: use switch in supports_ops

* Style: middle ptr and ref align, omit optional struct keyword

* NPU Unify PD (#14)

* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims

* Clean placeholders in ggml-openvino.cpp

* NPU unify PD (handled internally)

* change graph to 4d, support multi sequences

* Fix llama-bench

* Fix NPU

* Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Remove the second decoder for node. Moving the function into the model decoder

* Fix error for naive

* NPU prefill chunking

* NPU fix llama-bench

* fallback naive run with accuracy issue

* NPU support llma-perplexity -b 512 --no-warmup

* Refactor: split ov_graph_compute for dynamic and static

* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)

* minor update due to ov 2025.4

* remove unused API GgmlOvDecoder::get_output_names()

* remove unused API get_output_shape(const std::string & name)

* Modified API GgmlOvDecoder::get_output_type(const std::string & name)

* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)

* Removed API get_output_ggml_tensor(const std::string & name)

* Removed API m_outputs

* Removed m_output_names

* Removed API GgmlOvDecoder::get_input_names()

* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)

* Removed API get_input_type

* Removed API get_input_type

* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)

* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)

* Fix error for decoder cache

* Reuse cached decoder

* GPU remove Q6_K requantization

* NPU fix wrong model output shape

* NPU fix q4 perf regression

* Remove unused variable nodes

* Fix decoder can_reuse for llama-bench

* Update build.md for Windows

* backend buffer: allocate on host

* Use shared_buffer for GPU NPU; Refactor

* Add ov_backend_host_buffer; Use cached remote context

* Put kvcache on GPU

* Use ggml_aligned_malloc

* only use remote tensor for kvcache

* only use remote tensor for kvcache for GPU

* FIX: use remote tensor from singleton

* Update build.md to include OpenCL

* NPU always requant to q4_0_128

* Optimize symmetric quant weight extraction: use single zp

* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant

* Update build.md

* Support -ctk f32

* Initial stateful graph support

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* code cleanup

* npu perf fix

* requant to f16 for Q6 embed on NPU

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp

* Create OPENVINO.md in llama.cpp backend docs

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update build.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* kq_mask naming fix

* Syntax correction for workflows build file

* Change ov backend buffer is_host to false

* Fix llama-bench -p -n where p<=256

* Fix --direct-io 0

* Don't put kvcache on GPU in stateful mode

* Remove hardcode names

* Fix stateful shapes

* Simplification for stateful and update output shape processing

* Remove hardcode names

* Avoid re-compilation in llama-bench

* Extract zp directly instead of bias

* Refactor weight tensor processing

* create_weight_node accept non-ov backend buffer

* remove changes in llama-graph.cpp

* stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add

* hardcoded name handling for rope_freqs.weight

* Suppress logging and add error handling to allow test-backend-ops to complete

* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases

* Use bias instead of zp in test-backend-ops

* Update OV in CI, Add OV CI Tests in GH Actions

* Temp fix for multithreading bug

* Update OV CI, fix review suggestions.

* fix editorconfig-checker, update docs

* Fix tabs to spaces for editorconfig-checker

* fix editorconfig-checker

* Update docs

* updated model link to be GGUF model links

* Remove GGML_CPU_REPACK=OFF

* Skip permuted ADD and MUL

* Removed static variables from utils.cpp

* Removed initializing non-existing variable

* Remove unused structs

* Fix test-backend-ops for OV GPU

* unify api calling

* Update utils.cpp

* When the dim is dynamic, throw an error, need to is stastic forst

* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using

* No need to return

* Fix test-backend-ops for OV GPU LNL

* Fix test-thread-safety

* use the shape from infer request of output tensor create to avoid issue

* fix dynamic output shape  issue

* fix issue for the unused node in tests

* Remove unused lock

* Add comment

* Update openvino docs

* update to OV release version 2026.0

* add ci ov-gpu self hosted runner

* fix editorconfig

* Fix perplexity

* Rewrite the model inputs finding mechanism  (#54)

* Rewrite the model inputs finding logistic

* Put stateful shape handle in get input shape

* Put the iteration logistic in func

* Added ggml-ci-intel-openvino-gpu and doc update

* .hpp files converted to .h

* fix ggml-ci-x64-intel-openvino-gpu

* Fix for stateful execution bug in llama-bench

* Minor updates after stateful llama-bench fix

* Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* Remove multiple get_shape calls

* Bring back mutex into compute

* Fix VIEW op, which slice the input node

* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access

* Temp. fix for test requant errors

* Update to OV ggml-ci to low-perf

* ci : temporary disable "test-llama-archs"

* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag

* docs : update url

* Fix OV link in docker and Update docs

---------

Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-14 07:56:55 +02:00
Masato Nakasaka 4cc6eb158c
ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154) 2026-03-12 06:43:22 +01:00
Ruben Ortlam 182acfe5c5
ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294) 2026-03-11 14:12:29 +01:00
Johannes Gäßler a976ff081b
llama: end-to-end tests (#19802)
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
2026-03-08 12:30:21 +01:00
Daniel Bevenius 8d3b962f47
ci : use ubuntu-latest for gguf-publish workflow (#19951)
This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit
142cbe2ac6 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").

The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566
2026-02-27 14:42:24 +01:00
Slobodan Josic 3af34b9ff5
ci : update the ROCm/HIP toolchain versions [no ci] (#19891)
* [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1

* revert container version

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-25 15:54:49 +01:00
Mario Limonciello 8fdf269dad
ci : update Windows ROCm build to 26.Q1 [no ci] (#19810)
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
2026-02-25 12:30:19 +01:00
Sigbjørn Skjæret 9f0684f003
ci : fix rocm archive name [no ci] (#19808) 2026-02-22 16:14:37 +01:00
Sigbjørn Skjæret e877ad8bd9
ci : fix rocm release path [no ci] (#19784) 2026-02-22 08:07:46 +01:00
Mario Limonciello f75c4e8bf5
Add a build target to generate ROCm artifacts using ROCm 7.2 (#19433)
This builds the following targets:
 * gfx1151
 * gfx1150
 * gfx1200
 * gfx1201
 * gfx1100
 * gfx1101
 * gfx1030
 * gfx908
 * gfx90a
 * gfx942
2026-02-21 19:56:26 +01:00
Sigbjørn Skjæret e48349a49d
ci : bump komac version (#19682) 2026-02-17 09:30:31 +01:00
Mengsheng Wu 94a602db66
github : add missing backends to issue templates (#19603) 2026-02-14 00:56:53 +01:00
Georgi Gerganov 81ddc60cb3
ci : add metal server workflows (#19293)
* ci : add metal server workflows

* cont : try fix python init

* cont : move to a separate workflow that runs only on master

* cont : fix num jobs

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 15:09:30 +02:00
Sigbjørn Skjæret 9a5f57795c
ci : remove server job from webui and move slow test (#19424)
* remove server job from webui and move slow test

* use pip-install option
2026-02-08 01:20:00 +01:00
Georgi Gerganov 96441c955e
ci : use -j param correctly when building with sanitizers (#19411)
* ci : use less jobs when building with sanitizers

* cont : fix nproc

* cont : fix the fix

* cont : simplify
2026-02-07 23:50:47 +01:00
Jeff Bolz f9bd518a6b
vulkan: make FA mask/softcap enables spec constants (#19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-06 08:49:58 +01:00
Georgi Gerganov 423bee462b
ci : fix sanitize workflow to enable ggml sanitizers too (#19323) 2026-02-04 15:12:03 +02:00
Georgi Gerganov 6a9bf2f788
ci : add sanitizer runs for server (#19291) 2026-02-03 22:41:20 +02:00
Aman Gupta 15818ac44c
ci: add test-backend-ops test for CPU (#19268) 2026-02-02 22:40:28 +08:00
Zheyuan Chen bd90fc74c3
ggml-webgpu: improve flastAttention performance by software pipelining (#19151)
* webgpu : pipeline flash_attn Q/K loads in WGSL

* ggml-webgpu: unroll Q*K accumlation inner loop

* ggml-webgpu: vectorization

* ggml-webgpu: unrolling

* ggml-webgpu: remove redundant unrolling

* ggml-webgpu: restore the config

* ggml-webgpu: remove redundant comments

* ggml-webgpu: formatting

* ggml-webgpu: formatting and remove vectorization

* ggml-webgpu: remove unnecessary constants

* ggml-webgpu: change QKV buffer to read_write to pass validation

* ggml-webgpu: add explanation for the additional bracket around Q K accumulate

* Indentation and for -> if for tail

* Kick off CI on wgsl only commits

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-29 14:05:30 -08:00
Todor Boinovski ce38a4db47
hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150)
* hexagon: updates to enable offloading to HTP on WoS

* Update windows.md

* Update windows.md

* hexagon: enable -O3 optimizations

* hexagon: move all _WINDOWS conditional compilation to _WIN32

* hexagon: updates to enable offloading to HTP on WoS

* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface

* refactor htp-drv

* hexagon: add run-bench.ps1 script

* hexagon: htdrv refactor

* hexagon: unify Android and Windows build readmes

* hexagon: update README.md

* hexagon: refactor htpdrv

* hexagon: drv refactor

* hexagon: more drv refactor

* hexagon: fixes for android builds

* hexagon: factor out dl into ggml-backend-dl

* hexagon: add run-tool.ps1 script

* hexagon: merge htp-utils in htp-drv and remove unused code

* wos: no need for getopt_custom.h

* wos: add missing CR in htpdrv

* hexagon: ndev enforecement applies only to the Android devices

* hexagon: add support for generating and signing .cat file

* hexagon: add .inf file

* hexagon: working auto-signing and improved windows builds

* hexagon: futher improve skel build

* hexagon: add rough WoS guide

* hexagon: updated windows guide

* hexagon: improve cmake handling of certs and logging

* hexagon: improve windows setup/build doc

* hexagon: more windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* Update windows.md

* Update windows.md

* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon

Also added a power shell script to simplify build env setup.

* hexagon: remove trailing whitespace and move cmake requirement to user-presets

* hexagon: fix CMakeUserPresets path in workflow yaml

* hexagon: introduce local version of libdl.h

* hexagon: fix src1 reuse logic

gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-29 12:33:21 -08:00
Sigbjørn Skjæret 50e8962f79
ci : find latest release with asset for winget (#19161) 2026-01-28 22:05:39 +01:00
Sigbjørn Skjæret c0204a0893
ci : revert slim runner for winget (#19129) 2026-01-27 11:54:25 +01:00
Sigbjørn Skjæret 142cbe2ac6
ci : use new 1vCPU runner for lightweight jobs (#19107)
* use new 1vCPU runner for lightweight jobs

* pyright is too heavy, look into ty some day

use new pip-install input
2026-01-26 15:22:49 +01:00
Georgi Gerganov 557515be1e
graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898)
* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
2026-01-23 18:22:34 +02:00
Aaron Teo 8b30840703
release: update github api (#19022) 2026-01-22 21:38:02 +08:00
Pádraic Slattery 6b99a223e3
ci : update GitHub Actions versions [no ci] (#18935) 2026-01-22 00:57:18 +01:00
Sigbjørn Skjæret 57c0beaed0
ci : add label for jinja changes (#18903) 2026-01-17 21:52:02 +01:00
Chenguang Li e20fa27a02
CANN: fix an issue where get_env was not fully renamed (#18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
Adrien Gallouët 516a4ca9b5
refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
Adrien Gallouët f709c7a33f
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 07:46:27 +01:00
Adrien Gallouët 537d4240d4
ci : remove libcurl in releases (#18775)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 21:43:02 +01:00
Adrien Gallouët 36c5913c45
ci : use openssl for openEuler-latest-cmake-cann (#18779)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 17:29:00 +01:00
Reese Levine 15bff84bf5
ggml webgpu: initial flashattention implementation (#18610)
* FlashAttention (#13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (#10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-08 08:23:39 -08:00
Sigbjørn Skjæret 9dfa8ee950
ci : run cann build unconditionally [no ci] (#18659) 2026-01-07 13:07:08 +01:00
Ali Tariq 8e3a761189
ci : init git lfs in every build for RISC-V (#18590)
* Initialized git lfs in every test

* Added git-lfs in dependencies to instal
2026-01-05 02:18:33 +01:00