Commit Graph

469 Commits

Author SHA1 Message Date
jianlins 48c561fcc9 workflow: enhance library packaging by preserving symlinks and adding runtime checks 2026-03-13 11:57:51 -06:00
jianlins ee86901ed7 Merge branch 'master' of https://github.com/jianlins/llama.cpp 2026-03-12 23:02:48 -06:00
jianlins cd9d8771c2 workflow: update source checkout step and add tag creation logic in build script 2026-03-12 23:02:38 -06:00
Jianlin Shi 804e13febe
Merge branch 'ggml-org:master' into master 2026-03-12 23:06:57 -05:00
jianlins 25581cbb35 workflow: require tag_name input for manual release trigger and update asset upload process 2026-03-12 22:05:23 -06:00
jianlins a23f3fb7fd workflow: add safe.directory configuration for Git in build script 2026-03-12 20:12:53 -06:00
jianlins eb7a304948 workflow: update build steps to handle gcc-toolset-12 sourcing more safely 2026-03-12 00:08:04 -06:00
jianlins 460a535956 workflow: update build script to use gcc-toolset-12 and add linker flags 2026-03-12 00:01:18 -06:00
Masato Nakasaka 4cc6eb158c
ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154) 2026-03-12 06:43:22 +01:00
jianlins b1e28fa511 workflow: add gpu_arch input for customizable CUDA architecture in build script 2026-03-11 22:28:21 -06:00
jianlins 200905df28 workflow: remove crb repository enablement from Rocky Linux build script 2026-03-11 22:10:13 -06:00
jianlins 65f59acdde workflow: update package manager commands from yum to dnf in Rocky Linux build script 2026-03-11 22:05:15 -06:00
jianlins ddb3fcd924 workflow: comment out yum update in Rocky Linux build script 2026-03-11 22:01:02 -06:00
jianlins 38c82a6db7 workflow: install epel-release as a build dependency for Rocky Linux 2026-03-11 21:55:44 -06:00
jianlins a3cc7bf992 workflow: add manual trigger for release creation in Rocky Linux build 2026-03-11 21:46:17 -06:00
jianlins 29b3c14619 workflow: update build configurations for Linux and Windows CUDA 2026-03-11 21:44:36 -06:00
Jianlin Shi 2a1e77436c
build-linux-cuda 2026-03-11 20:02:20 -05:00
Jianlin Shi 70dde08eba
Merge branch 'ggml-org:master' into master 2026-03-11 19:39:40 -05:00
Ruben Ortlam 182acfe5c5
ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294) 2026-03-11 14:12:29 +01:00
Johannes Gäßler a976ff081b
llama: end-to-end tests (#19802)
* tests: add end-to-end tests per model architecture

* fixup for rebase

* fix use-after-free in llama-model-loader.cpp

* fix CI

* fix WebGPU

* fix CI

* disable CI for macOS-latest-cmake-arm64

* use expert_weights_scale only if != 0.0f

* comments
2026-03-08 12:30:21 +01:00
Daniel Bevenius 8d3b962f47
ci : use ubuntu-latest for gguf-publish workflow (#19951)
This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit
142cbe2ac6 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").

The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.

Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566
2026-02-27 14:42:24 +01:00
Slobodan Josic 3af34b9ff5
ci : update the ROCm/HIP toolchain versions [no ci] (#19891)
* [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1

* revert container version

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-25 15:54:49 +01:00
Mario Limonciello 8fdf269dad
ci : update Windows ROCm build to 26.Q1 [no ci] (#19810)
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
2026-02-25 12:30:19 +01:00
Sigbjørn Skjæret 9f0684f003
ci : fix rocm archive name [no ci] (#19808) 2026-02-22 16:14:37 +01:00
Sigbjørn Skjæret e877ad8bd9
ci : fix rocm release path [no ci] (#19784) 2026-02-22 08:07:46 +01:00
Mario Limonciello f75c4e8bf5
Add a build target to generate ROCm artifacts using ROCm 7.2 (#19433)
This builds the following targets:
 * gfx1151
 * gfx1150
 * gfx1200
 * gfx1201
 * gfx1100
 * gfx1101
 * gfx1030
 * gfx908
 * gfx90a
 * gfx942
2026-02-21 19:56:26 +01:00
Sigbjørn Skjæret e48349a49d
ci : bump komac version (#19682) 2026-02-17 09:30:31 +01:00
Georgi Gerganov 81ddc60cb3
ci : add metal server workflows (#19293)
* ci : add metal server workflows

* cont : try fix python init

* cont : move to a separate workflow that runs only on master

* cont : fix num jobs

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-02-09 15:09:30 +02:00
Sigbjørn Skjæret 9a5f57795c
ci : remove server job from webui and move slow test (#19424)
* remove server job from webui and move slow test

* use pip-install option
2026-02-08 01:20:00 +01:00
Georgi Gerganov 96441c955e
ci : use -j param correctly when building with sanitizers (#19411)
* ci : use less jobs when building with sanitizers

* cont : fix nproc

* cont : fix the fix

* cont : simplify
2026-02-07 23:50:47 +01:00
Jeff Bolz f9bd518a6b
vulkan: make FA mask/softcap enables spec constants (#19309)
* vulkan: make FA mask/softcap enables spec constants

* don't specialize for sinks

* bump timeout a little bit
2026-02-06 08:49:58 +01:00
Georgi Gerganov 423bee462b
ci : fix sanitize workflow to enable ggml sanitizers too (#19323) 2026-02-04 15:12:03 +02:00
Georgi Gerganov 6a9bf2f788
ci : add sanitizer runs for server (#19291) 2026-02-03 22:41:20 +02:00
Aman Gupta 15818ac44c
ci: add test-backend-ops test for CPU (#19268) 2026-02-02 22:40:28 +08:00
Zheyuan Chen bd90fc74c3
ggml-webgpu: improve flastAttention performance by software pipelining (#19151)
* webgpu : pipeline flash_attn Q/K loads in WGSL

* ggml-webgpu: unroll Q*K accumlation inner loop

* ggml-webgpu: vectorization

* ggml-webgpu: unrolling

* ggml-webgpu: remove redundant unrolling

* ggml-webgpu: restore the config

* ggml-webgpu: remove redundant comments

* ggml-webgpu: formatting

* ggml-webgpu: formatting and remove vectorization

* ggml-webgpu: remove unnecessary constants

* ggml-webgpu: change QKV buffer to read_write to pass validation

* ggml-webgpu: add explanation for the additional bracket around Q K accumulate

* Indentation and for -> if for tail

* Kick off CI on wgsl only commits

---------

Co-authored-by: Reese Levine <reeselevine1@gmail.com>
2026-01-29 14:05:30 -08:00
Todor Boinovski ce38a4db47
hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150)
* hexagon: updates to enable offloading to HTP on WoS

* Update windows.md

* Update windows.md

* hexagon: enable -O3 optimizations

* hexagon: move all _WINDOWS conditional compilation to _WIN32

* hexagon: updates to enable offloading to HTP on WoS

* hexagon: use run-time vs load-time dynamic linking for cdsp driver interface

* refactor htp-drv

* hexagon: add run-bench.ps1 script

* hexagon: htdrv refactor

* hexagon: unify Android and Windows build readmes

* hexagon: update README.md

* hexagon: refactor htpdrv

* hexagon: drv refactor

* hexagon: more drv refactor

* hexagon: fixes for android builds

* hexagon: factor out dl into ggml-backend-dl

* hexagon: add run-tool.ps1 script

* hexagon: merge htp-utils in htp-drv and remove unused code

* wos: no need for getopt_custom.h

* wos: add missing CR in htpdrv

* hexagon: ndev enforecement applies only to the Android devices

* hexagon: add support for generating and signing .cat file

* hexagon: add .inf file

* hexagon: working auto-signing and improved windows builds

* hexagon: futher improve skel build

* hexagon: add rough WoS guide

* hexagon: updated windows guide

* hexagon: improve cmake handling of certs and logging

* hexagon: improve windows setup/build doc

* hexagon: more windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* hexagon: windows readme updates

* Update windows.md

* Update windows.md

* snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon

Also added a power shell script to simplify build env setup.

* hexagon: remove trailing whitespace and move cmake requirement to user-presets

* hexagon: fix CMakeUserPresets path in workflow yaml

* hexagon: introduce local version of libdl.h

* hexagon: fix src1 reuse logic

gpt-oss needs a bigger lookahead window.
The check for src[1] itself being quantized was wrong.

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-01-29 12:33:21 -08:00
Sigbjørn Skjæret 50e8962f79
ci : find latest release with asset for winget (#19161) 2026-01-28 22:05:39 +01:00
Sigbjørn Skjæret c0204a0893
ci : revert slim runner for winget (#19129) 2026-01-27 11:54:25 +01:00
Sigbjørn Skjæret 142cbe2ac6
ci : use new 1vCPU runner for lightweight jobs (#19107)
* use new 1vCPU runner for lightweight jobs

* pyright is too heavy, look into ty some day

use new pip-install input
2026-01-26 15:22:49 +01:00
Georgi Gerganov 557515be1e
graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898)
* graph : avoid branches between embedding and token inputs

* models : make deepstack graphs (e.g. Qwen3 VL) have constant topology

* ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI

* cont : pad token embeddings to n_embd_inp
2026-01-23 18:22:34 +02:00
Aaron Teo 8b30840703
release: update github api (#19022) 2026-01-22 21:38:02 +08:00
Pádraic Slattery 6b99a223e3
ci : update GitHub Actions versions [no ci] (#18935) 2026-01-22 00:57:18 +01:00
Chenguang Li e20fa27a02
CANN: fix an issue where get_env was not fully renamed (#18796)
* CANN: fix an issue where get_env was not fully renamed

* ci: add cann with acl group

* ci: define use_acl_graph using GitHub Action

* ci: update cann dockerfile with acl graph
2026-01-16 16:24:04 +08:00
Adrien Gallouët 516a4ca9b5
refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
Adrien Gallouët f709c7a33f
ci, tests : use cmake to download models and remove libcurl dependency (#18791)
* ci, tests : use cmake to download models and remove libcurl dependency
* llama_dl_model -> llama_download_model
* use EXPECTED_HASH for robust model downloading
* Move llama_download_model to cmake/common.cmake

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-14 07:46:27 +01:00
Adrien Gallouët 537d4240d4
ci : remove libcurl in releases (#18775)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 21:43:02 +01:00
Adrien Gallouët 36c5913c45
ci : use openssl for openEuler-latest-cmake-cann (#18779)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-12 17:29:00 +01:00
Reese Levine 15bff84bf5
ggml webgpu: initial flashattention implementation (#18610)
* FlashAttention (#13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (#10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-08 08:23:39 -08:00
Sigbjørn Skjæret 9dfa8ee950
ci : run cann build unconditionally [no ci] (#18659) 2026-01-07 13:07:08 +01:00
Ali Tariq 8e3a761189
ci : init git lfs in every build for RISC-V (#18590)
* Initialized git lfs in every test

* Added git-lfs in dependencies to instal
2026-01-05 02:18:33 +01:00