llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	b31b30f31d	ai : do not run bash commands in the prompt (#20810 )	2026-03-20 19:06:33 +02:00
Georgi Gerganov	464fd0e71f	ai : update find-related action (#20790 ) * ai : update "related issues" prompt * cont * cont * cont	2026-03-20 10:28:14 +02:00
Georgi Gerganov	6c72646a61	ci : improve action for duplicate issue (#20772 ) * ci : show thinking traces of the agent * cont : increase thinking * cont : remove agent files * cont : move the model selection to the provider	2026-03-19 21:11:53 +02:00
Georgi Gerganov	900efd531d	ci : clarify gh command for viewing issues (#20766 )	2026-03-19 18:43:54 +02:00
uvos	b49d8b8757	ci : add hip quality check (#20430 ) * CI: add hip quality check * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Revert "Update .github/workflows/hip-quality-check.yml" This reverts commit `efa0bfcdb0`. * scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs * scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list * Bump ccache version * Add mssing seperators to list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 17:05:44 +01:00
Georgi Gerganov	f071ce67c9	ci : add action for finding duplicate issues (#20756 ) * ci : add action for finding duplicates issues * cont : gen info * cont : formatting * cont : fix * cont : instructions * cont : bump checkout action Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 16:17:37 +02:00
Sigbjørn Skjæret	ab0bb93748	ci : bump ccache [no ci] (#20679 ) * bump ccache * forgotten * disable for s390x * disable also for ppc64le	2026-03-17 14:54:31 +01:00
Georgi Gerganov	45172df4d6	ci : disable AMX jobs (#20654 ) [no ci]	2026-03-16 22:38:59 +02:00
Sigbjørn Skjæret	0ed992973b	ci : update labeler (#20629 )	2026-03-16 20:24:20 +01:00
Sigbjørn Skjæret	b91d7dfe5b	ci : only save openvino caches on github-hosted master (#20593 ) * only save openvino ccache on master * disable toolkit cache if self-hosted * only cache on github-hosted runners * remove toolkit cache [no ci]	2026-03-15 18:58:13 +01:00
Georgi Gerganov	9cd4ebcfb1	ci : split build.yml + server.yml (#20546 ) * ci : split build.yml * cont : split server.yml * cont : reduce paths * cont : split build-android.yml + update paths * ci : make msys workflows manual (#20588) * ci : make cross-build workflows manual (#20585) * cont : fix release paths Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-15 15:11:17 +02:00
Georgi Gerganov	b4768955c4	ci : move self-hosted workflows to separate files (#20540 )	2026-03-14 23:15:35 +02:00
Eve	3a6f059909	ci : try to optimize some jobs (#20521 ) * force arm version to test * run on either x86 or arm if we can help it, this only works for runs without ccache * readd other jobs * remove ccache	2026-03-14 20:27:52 +01:00
Georgi Gerganov	9f774e45ee	ci : reduce webgpu tests timeout to 900s (#20538 ) [no ci]	2026-03-14 17:08:26 +02:00
Zijun Yu	9789c4ecdc	ggml : add OpenVINO backend (#15307 ) * Update build doc * Add cgraph tensor output name to OV op name * Update openvino build instructions * Add initial NPU support * draft NPU support version 2: prefill + kvcache * NPU support version 2: prefill + kvcache * Change due to ggml cgraph changes, not correct yet * Change due to ggml cgraph changes, llama-3.2 CPU work * Add AMD64 to CMakeLists * Change due to ggml cgraph changes, all device work * Refactor: clean, fix warning * Update clang-format * Statful transformation for CPU GPU * Add SwiGLU * Fuse to SDPA * Replace Concat with Broadcast in MulMat for GQA * Pull out indices creation for kv cache update * Refactor: remove past_token_len from extra_inputs * Fix Phi3 SwiGLU and SoftMax * Pull out sin cos from rope * Reduce memory: free ov weights node after graph conversion * Fix CPY due to cgraph change * Added OpenVINO CI/CD. Updated docs * Fix llama-cli * Fix Phi3 ROPE; Add test-backend-ops * Fix NPU * Fix llama-bench; Clang-format * Fix llama-perplexity * temp. changes for mark decomp * matmul in fp32 * mulmat input conversion fix * mulmat type conversion update * add mark decomp pass * Revert changes in fuse_to_sdpa * Update build.md * Fix test-backend-ops * Skip test-thread-safety; Run ctest only in ci/run.sh * Use CiD for NPU * Optimize tensor conversion, improve TTFT * Support op SET_ROWS * Fix NPU * Remove CPY * Fix test-backend-ops * Minor updates for raising PR * Perf: RMS fused to OV internal RMS op * Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run * Change openvino device_type to GPU; Enable flash_attn * Update supports_buft and supports_op for quantized models * Add quant weight conversion functions from genai gguf reader * Quant models run with accuracy issue * Fix accuracy: disable cpu_repack * Fix CI; Disable test-backend-ops * Fix Q4_1 * Fix test-backend-ops: Treat quantized tensors as weights * Add NPU Q4_0 support * NPU perf: eliminate zp * Dequantize q4_1 q4_k q6_k for NPU * Add custom quant type: q8_1_c, q4_0_128 * Set m_is_static=false as default in decoder * Simpilfy translation of get_rows * Fix after rebasing * Improve debug util; Eliminate nop ReshapeReshape * STYLE: make get_types_to_requant a function * Support BF16 model * Fix NPU compile * WA for npu 1st token acc issue * Apply EliminateZP only for npu * Add GeGLU * Fix Hunyuan * Support iSWA * Fix NPU accuracy * Fix ROPE accuracy when freq_scale != 1 * Minor: not add attention_size_swa for non-swa model * Minor refactor * Add Q5_K to support phi-3-q4_k_m * Requantize Q6_K (gs16) to gs32 on GPU * Fix after rebasing * Always apply Eliminate_ZP to fix GPU compile issue on some platforms * kvcachefusion support * env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added * Fix for Phi3 * Fix llama-cli (need to run with --no-warmup) * Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working * fix after rebasing * Fix llama-3-8b and phi3-mini q4_0 NPU * Update to OV-2025.3 and CMakeLists.txt * Add OV CI cache * Apply CISC review and update CI to OV2025.3 * Update CI to run OV dep install before build * Update OV dockerfile to use OV2025.3 and update build docs * Style: use switch in supports_ops * Style: middle ptr and ref align, omit optional struct keyword * NPU Unify PD (#14) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims * Clean placeholders in ggml-openvino.cpp * NPU unify PD (handled internally) * change graph to 4d, support multi sequences * Fix llama-bench * Fix NPU * Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility. * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Remove the second decoder for node. Moving the function into the model decoder * Fix error for naive * NPU prefill chunking * NPU fix llama-bench * fallback naive run with accuracy issue * NPU support llma-perplexity -b 512 --no-warmup * Refactor: split ov_graph_compute for dynamic and static * remove unused API GgmlOvDecoder::get_output_stride(const std::string & name) * minor update due to ov 2025.4 * remove unused API GgmlOvDecoder::get_output_names() * remove unused API get_output_shape(const std::string & name) * Modified API GgmlOvDecoder::get_output_type(const std::string & name) * Removed API GgmlOvDecoder::get_output_op_params(const std::string & name) * Removed API get_output_ggml_tensor(const std::string & name) * Removed API m_outputs * Removed m_output_names * Removed API GgmlOvDecoder::get_input_names() * Removed API GgmlOvDecoder::get_input_stride(const std::string& name) * Removed API get_input_type * Removed API get_input_type * Removed API GgmlOvDecoder::get_input_shape(const std::string & name) * Removed API GgmlOvDecoder::get_input_op_params(const std::string & name) * Fix error for decoder cache * Reuse cached decoder * GPU remove Q6_K requantization * NPU fix wrong model output shape * NPU fix q4 perf regression * Remove unused variable nodes * Fix decoder can_reuse for llama-bench * Update build.md for Windows * backend buffer: allocate on host * Use shared_buffer for GPU NPU; Refactor * Add ov_backend_host_buffer; Use cached remote context * Put kvcache on GPU * Use ggml_aligned_malloc * only use remote tensor for kvcache * only use remote tensor for kvcache for GPU * FIX: use remote tensor from singleton * Update build.md to include OpenCL * NPU always requant to q4_0_128 * Optimize symmetric quant weight extraction: use single zp * Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant * Update build.md * Support -ctk f32 * Initial stateful graph support * Update ggml/src/ggml-openvino/ggml-decoder.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * code cleanup * npu perf fix * requant to f16 for Q6 embed on NPU * Update ggml/src/ggml-openvino/ggml-decoder.cpp * Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp * Create OPENVINO.md in llama.cpp backend docs * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * Update build.md * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * kq_mask naming fix * Syntax correction for workflows build file * Change ov backend buffer is_host to false * Fix llama-bench -p -n where p<=256 * Fix --direct-io 0 * Don't put kvcache on GPU in stateful mode * Remove hardcode names * Fix stateful shapes * Simplification for stateful and update output shape processing * Remove hardcode names * Avoid re-compilation in llama-bench * Extract zp directly instead of bias * Refactor weight tensor processing * create_weight_node accept non-ov backend buffer * remove changes in llama-graph.cpp * stateful masking fix (#38) Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes. * Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add * hardcoded name handling for rope_freqs.weight * Suppress logging and add error handling to allow test-backend-ops to complete * Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases * Use bias instead of zp in test-backend-ops * Update OV in CI, Add OV CI Tests in GH Actions * Temp fix for multithreading bug * Update OV CI, fix review suggestions. * fix editorconfig-checker, update docs * Fix tabs to spaces for editorconfig-checker * fix editorconfig-checker * Update docs * updated model link to be GGUF model links * Remove GGML_CPU_REPACK=OFF * Skip permuted ADD and MUL * Removed static variables from utils.cpp * Removed initializing non-existing variable * Remove unused structs * Fix test-backend-ops for OV GPU * unify api calling * Update utils.cpp * When the dim is dynamic, throw an error, need to is stastic forst * Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using * No need to return * Fix test-backend-ops for OV GPU LNL * Fix test-thread-safety * use the shape from infer request of output tensor create to avoid issue * fix dynamic output shape issue * fix issue for the unused node in tests * Remove unused lock * Add comment * Update openvino docs * update to OV release version 2026.0 * add ci ov-gpu self hosted runner * fix editorconfig * Fix perplexity * Rewrite the model inputs finding mechanism (#54) * Rewrite the model inputs finding logistic * Put stateful shape handle in get input shape * Put the iteration logistic in func * Added ggml-ci-intel-openvino-gpu and doc update * .hpp files converted to .h * fix ggml-ci-x64-intel-openvino-gpu * Fix for stateful execution bug in llama-bench * Minor updates after stateful llama-bench fix * Update ggml/src/ggml-openvino/utils.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * Remove multiple get_shape calls * Bring back mutex into compute * Fix VIEW op, which slice the input node * Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access * Temp. fix for test requant errors * Update to OV ggml-ci to low-perf * ci : temporary disable "test-llama-archs" * ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag * docs : update url * Fix OV link in docker and Update docs --------- Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com> Co-authored-by: Arshath <arshath.ramzan@intel.com> Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com> Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-14 07:56:55 +02:00
Masato Nakasaka	4cc6eb158c	ci: Setup self-hosted CI for Intel Linux Vulkan backend (#20154 )	2026-03-12 06:43:22 +01:00
Ruben Ortlam	182acfe5c5	ci: disable coopmat on ubuntu-24-cmake-vulkan job (#20294 )	2026-03-11 14:12:29 +01:00
Johannes Gäßler	a976ff081b	llama: end-to-end tests (#19802 ) * tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments	2026-03-08 12:30:21 +01:00
Daniel Bevenius	8d3b962f47	ci : use ubuntu-latest for gguf-publish workflow (#19951 ) This commit changes the runner for the gguf-publish workflow from ubuntu-slim back to ubuntu-latest, which was updated in Commit `142cbe2ac6` ("ci : use new 1vCPU runner for lightweight jobs (#19107)"). The motivation for this is that the action used in the workflow depends on the docker daemon, which does not seem not available in the ubuntu-slim runner. This is currently causing an error in the workflow and preventing the gguf-publish workflow from running successfully. Today was the the first time since the original change (I think) that publish task has been run which may be why the issue was not noticed before. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566	2026-02-27 14:42:24 +01:00
Slobodan Josic	3af34b9ff5	ci : update the ROCm/HIP toolchain versions [no ci] (#19891 ) * [HIP] Update ROCm build container to rocm/dev-ubuntu-22.04:7.2 and HIP_SDK to 26.Q1 * revert container version --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-25 15:54:49 +01:00
Mario Limonciello	8fdf269dad	ci : update Windows ROCm build to 26.Q1 [no ci] (#19810 ) * Update build command to build llama-* tools not just ggml-hip * Update rocWMMA headers to 7.2 * Add GFX1150 target * Correct library paths for AMD libraries in 26.Q1	2026-02-25 12:30:19 +01:00
Sigbjørn Skjæret	9f0684f003	ci : fix rocm archive name [no ci] (#19808 )	2026-02-22 16:14:37 +01:00
Sigbjørn Skjæret	e877ad8bd9	ci : fix rocm release path [no ci] (#19784 )	2026-02-22 08:07:46 +01:00
Mario Limonciello	f75c4e8bf5	Add a build target to generate ROCm artifacts using ROCm 7.2 (#19433 ) This builds the following targets: * gfx1151 * gfx1150 * gfx1200 * gfx1201 * gfx1100 * gfx1101 * gfx1030 * gfx908 * gfx90a * gfx942	2026-02-21 19:56:26 +01:00
Sigbjørn Skjæret	e48349a49d	ci : bump komac version (#19682 )	2026-02-17 09:30:31 +01:00
Mengsheng Wu	94a602db66	github : add missing backends to issue templates (#19603 )	2026-02-14 00:56:53 +01:00
Georgi Gerganov	81ddc60cb3	ci : add metal server workflows (#19293 ) * ci : add metal server workflows * cont : try fix python init * cont : move to a separate workflow that runs only on master * cont : fix num jobs Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-09 15:09:30 +02:00
Sigbjørn Skjæret	9a5f57795c	ci : remove server job from webui and move slow test (#19424 ) * remove server job from webui and move slow test * use pip-install option	2026-02-08 01:20:00 +01:00
Georgi Gerganov	96441c955e	ci : use -j param correctly when building with sanitizers (#19411 ) * ci : use less jobs when building with sanitizers * cont : fix nproc * cont : fix the fix * cont : simplify	2026-02-07 23:50:47 +01:00
Jeff Bolz	f9bd518a6b	vulkan: make FA mask/softcap enables spec constants (#19309 ) * vulkan: make FA mask/softcap enables spec constants * don't specialize for sinks * bump timeout a little bit	2026-02-06 08:49:58 +01:00
Georgi Gerganov	423bee462b	ci : fix sanitize workflow to enable ggml sanitizers too (#19323 )	2026-02-04 15:12:03 +02:00
Georgi Gerganov	6a9bf2f788	ci : add sanitizer runs for server (#19291 )	2026-02-03 22:41:20 +02:00
Aman Gupta	15818ac44c	ci: add test-backend-ops test for CPU (#19268 )	2026-02-02 22:40:28 +08:00
Zheyuan Chen	bd90fc74c3	ggml-webgpu: improve flastAttention performance by software pipelining (#19151 ) * webgpu : pipeline flash_attn Q/K loads in WGSL * ggml-webgpu: unroll QK accumlation inner loop ggml-webgpu: vectorization * ggml-webgpu: unrolling * ggml-webgpu: remove redundant unrolling * ggml-webgpu: restore the config * ggml-webgpu: remove redundant comments * ggml-webgpu: formatting * ggml-webgpu: formatting and remove vectorization * ggml-webgpu: remove unnecessary constants * ggml-webgpu: change QKV buffer to read_write to pass validation * ggml-webgpu: add explanation for the additional bracket around Q K accumulate * Indentation and for -> if for tail * Kick off CI on wgsl only commits --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-01-29 14:05:30 -08:00
Todor Boinovski	ce38a4db47	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 ) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-29 12:33:21 -08:00
Sigbjørn Skjæret	50e8962f79	ci : find latest release with asset for winget (#19161 )	2026-01-28 22:05:39 +01:00
Sigbjørn Skjæret	c0204a0893	ci : revert slim runner for winget (#19129 )	2026-01-27 11:54:25 +01:00
Sigbjørn Skjæret	142cbe2ac6	ci : use new 1vCPU runner for lightweight jobs (#19107 ) * use new 1vCPU runner for lightweight jobs * pyright is too heavy, look into ty some day use new pip-install input	2026-01-26 15:22:49 +01:00
Georgi Gerganov	557515be1e	graph : utilize `ggml_build_forward_select()` to avoid reallocations (#18898 ) * graph : avoid branches between embedding and token inputs * models : make deepstack graphs (e.g. Qwen3 VL) have constant topology * ci : enable -DGGML_SCHED_NO_REALLOC=ON for server CI * cont : pad token embeddings to n_embd_inp	2026-01-23 18:22:34 +02:00
Aaron Teo	8b30840703	release: update github api (#19022 )	2026-01-22 21:38:02 +08:00
Pádraic Slattery	6b99a223e3	ci : update GitHub Actions versions [no ci] (#18935 )	2026-01-22 00:57:18 +01:00
Sigbjørn Skjæret	57c0beaed0	ci : add label for jinja changes (#18903 )	2026-01-17 21:52:02 +01:00
Chenguang Li	e20fa27a02	CANN: fix an issue where get_env was not fully renamed (#18796 ) * CANN: fix an issue where get_env was not fully renamed * ci: add cann with acl group * ci: define use_acl_graph using GitHub Action * ci: update cann dockerfile with acl graph	2026-01-16 16:24:04 +08:00
Adrien Gallouët	516a4ca9b5	refactor : remove libcurl, use OpenSSL when available (#18828 )	2026-01-14 18:02:47 +01:00
Adrien Gallouët	f709c7a33f	ci, tests : use cmake to download models and remove libcurl dependency (#18791 ) * ci, tests : use cmake to download models and remove libcurl dependency * llama_dl_model -> llama_download_model * use EXPECTED_HASH for robust model downloading * Move llama_download_model to cmake/common.cmake Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-14 07:46:27 +01:00
Adrien Gallouët	537d4240d4	ci : remove libcurl in releases (#18775 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-12 21:43:02 +01:00
Adrien Gallouët	36c5913c45	ci : use openssl for openEuler-latest-cmake-cann (#18779 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-01-12 17:29:00 +01:00
Reese Levine	15bff84bf5	ggml webgpu: initial flashattention implementation (#18610 ) * FlashAttention (#13) * Add inplace softmax * Move rms_norm to split row approach * Update debug for supports_op * clean up debug statements * neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though * neg passes backend test * unary operators pass ggml tests * rms_norm double declaration bug atoned * abides by editor-config * removed vestigial files * fixed autoconfig * All operators (inlcluding xielu) working * removed unnecesarry checking if node->src[1] exists for unary operators * responded and dealt with PR comments * implemented REPL_Template support and removed bug in unary operators kernel * formatted embed wgsl and ggml-webgpu.cpp * Faster tensors (#8) Add fast matrix and matrix/vector multiplication. * Use map for shader replacements instead of pair of strings * Wasm (#9) * webgpu : fix build on emscripten * more debugging stuff * test-backend-ops: force single thread on wasm * fix single-thread case for init_tensor_uniform * use jspi * add pthread * test: remember to set n_thread for cpu backend * Add buffer label and enable dawn-specific toggles to turn off some checks * Intermediate state * Fast working f16/f32 vec4 * Working float fast mul mat * Clean up naming of mul_mat to match logical model, start work on q mul_mat * Setup for subgroup matrix mat mul * Basic working subgroup matrix * Working subgroup matrix tiling * Handle weirder sg matrix sizes (but still % sg matrix size) * Working start to gemv * working f16 accumulation with shared memory staging * Print out available subgroup matrix configurations * Vectorize dst stores for sg matrix shader * Gemv working scalar * Minor set_rows optimization (#4) * updated optimization, fixed errors * non vectorized version now dispatches one thread per element * Simplify * Change logic for set_rows pipelines --------- Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan> Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Comment on dawn toggles * Working subgroup matrix code for (semi)generic sizes * Remove some comments * Cleanup code * Update dawn version and move to portable subgroup size * Try to fix new dawn release * Update subgroup size comment * Only check for subgroup matrix configs if they are supported * Add toggles for subgroup matrix/f16 support on nvidia+vulkan * Make row/col naming consistent * Refactor shared memory loading * Move sg matrix stores to correct file * Working q4_0 * Formatting * Work with emscripten builds * Fix test-backend-ops emscripten for f16/quantized types * Use emscripten memory64 to support get_memory * Add build flags and try ci --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * Remove extra whitespace * Move wasm single-thread logic out of test-backend-ops for cpu backend * Disable multiple threads for emscripten single-thread builds in ggml_graph_plan * Refactored pipelines and workgroup calculations (#10) * refactored pipelines * refactored workgroup calculation * removed commented out block of prior maps * Clean up ceiling division pattern --------- Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu> Co-authored-by: Reese Levine <reeselevine1@gmail.com> * Start work on flash attention * Shader structure set up (many bugs still) * debugging * Working first test * Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32 * Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling * Start work on integrating pre-wgsl * Separate structs/initial shader compilation library into separate files * Work on compilation choices for flashattention * Work on subgroup matrix/tile size portability * subgroup size agnostic online softmax * Cleanups, quantization types * more cleanup * fix wasm build * Refactor flashattention to increase parallelism, use direct loads for KV in somce cases * Checkpoint * formatting * Update to account for default kv cache padding * formatting shader * Add workflow for ggml-ci webgpu * Try passing absolute path to dawn in ggml-ci * Avoid error on device destruction, add todos for proper cleanup * Fix unused warning * Forgot one parameter unused * Move some flashattn computation to f32 for correctness	2026-01-08 08:23:39 -08:00
Sigbjørn Skjæret	9dfa8ee950	ci : run cann build unconditionally [no ci] (#18659 )	2026-01-07 13:07:08 +01:00
Ali Tariq	8e3a761189	ci : init git lfs in every build for RISC-V (#18590 ) * Initialized git lfs in every test * Added git-lfs in dependencies to instal	2026-01-05 02:18:33 +01:00

1 2 3 4 5 ...

488 Commits