llama.cpp

Commit Graph

Author	SHA1	Message	Date
Neo Zhang	a93c0ef0fa	add op gated_delta_net (#20455 )	2026-03-14 22:01:57 +08:00
Wallentri	f2c0dfb739	Use fp32 in cuBLAS V100 to avoid overflows, env variables to override cuBLAS compute type (#19959 ) * Update ggml-cuda.cu * Update ggml-cuda.cu * Update build.md * Update build.md * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml-cuda.cu * Update build.md * Update ggml/src/ggml-cuda/ggml-cuda.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update build.md * Update ggml-cuda.cu * Update ggml-cuda.cu --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-14 15:43:13 +08:00
Zijun Yu	9789c4ecdc	ggml : add OpenVINO backend (#15307 ) * Update build doc * Add cgraph tensor output name to OV op name * Update openvino build instructions * Add initial NPU support * draft NPU support version 2: prefill + kvcache * NPU support version 2: prefill + kvcache * Change due to ggml cgraph changes, not correct yet * Change due to ggml cgraph changes, llama-3.2 CPU work * Add AMD64 to CMakeLists * Change due to ggml cgraph changes, all device work * Refactor: clean, fix warning * Update clang-format * Statful transformation for CPU GPU * Add SwiGLU * Fuse to SDPA * Replace Concat with Broadcast in MulMat for GQA * Pull out indices creation for kv cache update * Refactor: remove past_token_len from extra_inputs * Fix Phi3 SwiGLU and SoftMax * Pull out sin cos from rope * Reduce memory: free ov weights node after graph conversion * Fix CPY due to cgraph change * Added OpenVINO CI/CD. Updated docs * Fix llama-cli * Fix Phi3 ROPE; Add test-backend-ops * Fix NPU * Fix llama-bench; Clang-format * Fix llama-perplexity * temp. changes for mark decomp * matmul in fp32 * mulmat input conversion fix * mulmat type conversion update * add mark decomp pass * Revert changes in fuse_to_sdpa * Update build.md * Fix test-backend-ops * Skip test-thread-safety; Run ctest only in ci/run.sh * Use CiD for NPU * Optimize tensor conversion, improve TTFT * Support op SET_ROWS * Fix NPU * Remove CPY * Fix test-backend-ops * Minor updates for raising PR * Perf: RMS fused to OV internal RMS op * Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run * Change openvino device_type to GPU; Enable flash_attn * Update supports_buft and supports_op for quantized models * Add quant weight conversion functions from genai gguf reader * Quant models run with accuracy issue * Fix accuracy: disable cpu_repack * Fix CI; Disable test-backend-ops * Fix Q4_1 * Fix test-backend-ops: Treat quantized tensors as weights * Add NPU Q4_0 support * NPU perf: eliminate zp * Dequantize q4_1 q4_k q6_k for NPU * Add custom quant type: q8_1_c, q4_0_128 * Set m_is_static=false as default in decoder * Simpilfy translation of get_rows * Fix after rebasing * Improve debug util; Eliminate nop ReshapeReshape * STYLE: make get_types_to_requant a function * Support BF16 model * Fix NPU compile * WA for npu 1st token acc issue * Apply EliminateZP only for npu * Add GeGLU * Fix Hunyuan * Support iSWA * Fix NPU accuracy * Fix ROPE accuracy when freq_scale != 1 * Minor: not add attention_size_swa for non-swa model * Minor refactor * Add Q5_K to support phi-3-q4_k_m * Requantize Q6_K (gs16) to gs32 on GPU * Fix after rebasing * Always apply Eliminate_ZP to fix GPU compile issue on some platforms * kvcachefusion support * env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added * Fix for Phi3 * Fix llama-cli (need to run with --no-warmup) * Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working * fix after rebasing * Fix llama-3-8b and phi3-mini q4_0 NPU * Update to OV-2025.3 and CMakeLists.txt * Add OV CI cache * Apply CISC review and update CI to OV2025.3 * Update CI to run OV dep install before build * Update OV dockerfile to use OV2025.3 and update build docs * Style: use switch in supports_ops * Style: middle ptr and ref align, omit optional struct keyword * NPU Unify PD (#14) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims * Clean placeholders in ggml-openvino.cpp * NPU unify PD (handled internally) * change graph to 4d, support multi sequences * Fix llama-bench * Fix NPU * Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility. * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Update ggml-decoder.cpp * Remove the second decoder for node. Moving the function into the model decoder * Fix error for naive * NPU prefill chunking * NPU fix llama-bench * fallback naive run with accuracy issue * NPU support llma-perplexity -b 512 --no-warmup * Refactor: split ov_graph_compute for dynamic and static * remove unused API GgmlOvDecoder::get_output_stride(const std::string & name) * minor update due to ov 2025.4 * remove unused API GgmlOvDecoder::get_output_names() * remove unused API get_output_shape(const std::string & name) * Modified API GgmlOvDecoder::get_output_type(const std::string & name) * Removed API GgmlOvDecoder::get_output_op_params(const std::string & name) * Removed API get_output_ggml_tensor(const std::string & name) * Removed API m_outputs * Removed m_output_names * Removed API GgmlOvDecoder::get_input_names() * Removed API GgmlOvDecoder::get_input_stride(const std::string& name) * Removed API get_input_type * Removed API get_input_type * Removed API GgmlOvDecoder::get_input_shape(const std::string & name) * Removed API GgmlOvDecoder::get_input_op_params(const std::string & name) * Fix error for decoder cache * Reuse cached decoder * GPU remove Q6_K requantization * NPU fix wrong model output shape * NPU fix q4 perf regression * Remove unused variable nodes * Fix decoder can_reuse for llama-bench * Update build.md for Windows * backend buffer: allocate on host * Use shared_buffer for GPU NPU; Refactor * Add ov_backend_host_buffer; Use cached remote context * Put kvcache on GPU * Use ggml_aligned_malloc * only use remote tensor for kvcache * only use remote tensor for kvcache for GPU * FIX: use remote tensor from singleton * Update build.md to include OpenCL * NPU always requant to q4_0_128 * Optimize symmetric quant weight extraction: use single zp * Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant * Update build.md * Support -ctk f32 * Initial stateful graph support * Update ggml/src/ggml-openvino/ggml-decoder.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * code cleanup * npu perf fix * requant to f16 for Q6 embed on NPU * Update ggml/src/ggml-openvino/ggml-decoder.cpp * Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp * Create OPENVINO.md in llama.cpp backend docs * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * Update build.md * Update OPENVINO.md * Update OPENVINO.md * Update OPENVINO.md * kq_mask naming fix * Syntax correction for workflows build file * Change ov backend buffer is_host to false * Fix llama-bench -p -n where p<=256 * Fix --direct-io 0 * Don't put kvcache on GPU in stateful mode * Remove hardcode names * Fix stateful shapes * Simplification for stateful and update output shape processing * Remove hardcode names * Avoid re-compilation in llama-bench * Extract zp directly instead of bias * Refactor weight tensor processing * create_weight_node accept non-ov backend buffer * remove changes in llama-graph.cpp * stateful masking fix (#38) Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes. * Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add * hardcoded name handling for rope_freqs.weight * Suppress logging and add error handling to allow test-backend-ops to complete * Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases * Use bias instead of zp in test-backend-ops * Update OV in CI, Add OV CI Tests in GH Actions * Temp fix for multithreading bug * Update OV CI, fix review suggestions. * fix editorconfig-checker, update docs * Fix tabs to spaces for editorconfig-checker * fix editorconfig-checker * Update docs * updated model link to be GGUF model links * Remove GGML_CPU_REPACK=OFF * Skip permuted ADD and MUL * Removed static variables from utils.cpp * Removed initializing non-existing variable * Remove unused structs * Fix test-backend-ops for OV GPU * unify api calling * Update utils.cpp * When the dim is dynamic, throw an error, need to is stastic forst * Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using * No need to return * Fix test-backend-ops for OV GPU LNL * Fix test-thread-safety * use the shape from infer request of output tensor create to avoid issue * fix dynamic output shape issue * fix issue for the unused node in tests * Remove unused lock * Add comment * Update openvino docs * update to OV release version 2026.0 * add ci ov-gpu self hosted runner * fix editorconfig * Fix perplexity * Rewrite the model inputs finding mechanism (#54) * Rewrite the model inputs finding logistic * Put stateful shape handle in get input shape * Put the iteration logistic in func * Added ggml-ci-intel-openvino-gpu and doc update * .hpp files converted to .h * fix ggml-ci-x64-intel-openvino-gpu * Fix for stateful execution bug in llama-bench * Minor updates after stateful llama-bench fix * Update ggml/src/ggml-openvino/utils.cpp Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> * Remove multiple get_shape calls * Bring back mutex into compute * Fix VIEW op, which slice the input node * Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access * Temp. fix for test requant errors * Update to OV ggml-ci to low-perf * ci : temporary disable "test-llama-archs" * ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag * docs : update url * Fix OV link in docker and Update docs --------- Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com> Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com> Co-authored-by: Arshath <arshath.ramzan@intel.com> Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com> Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com> Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-14 07:56:55 +02:00
Masashi Yoshimura	05039967da	ggml-virtgpu: Fix some build commands (#20341 )	2026-03-12 15:47:45 +08:00
Masashi Yoshimura	f2ab047f27	ggml-webgpu: Add supports for `GGML_OP_REPEAT` (#20230 ) * Add GGML_OP_REPEAT to webgpu backend. * Add i16 support for GGML_OP_REPEAT.	2026-03-11 14:40:36 -07:00
Neo Zhang	ecac98ee53	[SYCL] Update SYCL.md for binary package for Windows (#20401 ) * add download binary package * update prefix	2026-03-11 22:21:22 +08:00
Neo Zhang	0cec84f999	fix op rope, add rope_back (#20293 )	2026-03-11 09:53:34 +08:00
a3894281	0f1e9d14cc	docs: update CPU backend ops to mark POOL_1D as supported (#20304 )	2026-03-10 21:31:24 +08:00
Charles Xu	0cd4f4720b	kleidiai : support for concurrent sme and neon kernel execution (#20070 )	2026-03-10 09:25:25 +02:00
Bertay Eren	0beb8db3a0	ggml-vulkan: add SGN operator, auto-generate Vulkan.csv and ops.md (#20219 )	2026-03-09 07:24:16 +01:00
GiantPrince	d088d5b74f	ggml-vulkan: Add ELU op support (#20183 ) * ggml-Vulkan: add ELU support * ggml-Vulkan: remove extra spaces and variables * ggml-Vulkan: fix format issue * ggml-Vulkan: fix format issue * fix whitespace issue * Update Vulkan.csv and ops.md	2026-03-08 12:38:17 +01:00
Neo Zhang	213c4a0b81	[SYCL] supprt Flash Attention for fp32/fp16/Q4/Q5/Q8 (#20190 ) * support flash-attention for fp32/fp16/Q4/Q5/Q8 * rm warining * update for JIT	2026-03-08 12:00:07 +08:00
Piotr Wilkin (ilintar)	566059a26b	Autoparser - complete refactoring of parser architecture (#18675 ) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser	2026-03-06 21:01:00 +01:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Masashi Yoshimura	541bf37622	Add concat op to webgpu. (#20068 )	2026-03-04 11:19:00 -08:00
Mickael Desgranges	ecd99d6a9a	docs: Fix intel documentation link (#20040 )	2026-03-03 21:50:00 +08:00
Vishal Singh	88cf781f51	ggml-zendnn: update code for latest ZenDNN API (#19923 ) - adapt ggml-zendnn.cpp to the new lowoha::matmul interface - update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08) - add static lib support in CMake	2026-02-27 08:43:41 +08:00
Kevin Pouget	ffaafde16f	ggml-virtgpu: improve the reliability of the code (#19846 ) * ggml-virtgpu-backend: validate the consistency of the received objects This patch adds consistency checks in the ggml-virtgpu-backend (running on the host side) to ensure that the data received from the guest is consistent (valid pointers, valid sizes and offsets). * ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods ``` 1. bck->iface.synchronize(bck) 2. buft->iface.get_alloc_size(buft, op) 3. buft->iface.get_max_size(buft) ``` these three methods are optional in the GGML interface. `get_max_size` was already properly defaulted, but `backend sychronize` and `butf get_max_size` would have segfaulted the backend if not implemented. * ggml-virtgpu-backend: fix log format missing argument * ggml-virtgpu-backend: improve the abort message * ggml-virtgpu-backend: more safety checks * ggml-virtgpu-backend: new error code * ggml-virtgpu-backend: initialize all the error codes * ggml-virtgpu: add a missing comment generated by the code generator * ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names * ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message * ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR (still unmerged) * ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file * ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error * docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun * ggml-virtgpu: turn off clang-format header ordering for some of the files Compilation breaks when ordered alphabetically. * ggml-virtgpu: clang-format * ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes	2026-02-26 20:00:57 +08:00
Masashi Yoshimura	11c325c6e0	ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. (#19700 ) * ggml-webgpu: Add unary op (SQR, SQRT, SIN, COS) support. * Fix to cast the src value to f32 before sin/cos computing.	2026-02-19 09:18:30 -07:00
Maciej Lisowski	e99f1083a0	docs: Fix broken links for preparing models in Backends (#19684 )	2026-02-18 23:50:23 +08:00
Aaron Teo	6e67fd2144	docs: update s390x build docs (#19643 )	2026-02-16 00:33:34 +08:00
TriDefender	313493de53	docs : update path in snapdragon README.md (#19533 ) paths changed so original example didn't work	2026-02-12 08:13:51 +01:00
Sascha Rogmann	292f6908cd	spec : remove check rate (#19377 ) * spec: remove parameter spec-ngram-check-rate * spec : renamed statistics vars * spec : add n_call_begin, n_call_accept * spec : don't enable key-map-stats	2026-02-09 15:30:50 +02:00
Kevin Pouget	f5e7734ff2	ggml-virtgpu: add backend documentation (#19354 ) * ggml-virtgpu: add backend documentation Assisted-by-AI: Claude Code * CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget * README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md * docs/ggml-virt: add link to testing + configuration * Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget" This reverts commit `8ece8e72e2`. * drop the ggml- prefix * s/ggerganov/ggml-org * Relocate VirtGPU.md * reorganize the text * turn turn the ascii diagram into a mermaid * README.md: update the link to the main doc	2026-02-09 20:15:42 +08:00
Nechama Krashinski	537eadb1b9	sycl: add F16 support for GGML_OP_CEIL (#19306 ) * Fix SYCL CEIL operator * sycl: implement GGML_OP_CEIL	2026-02-06 23:13:44 +08:00
Gaurav Garg	41e3f02647	cuda : revert CUDA_SCALE_LAUNCH_QUEUES override until investigated (#19227 ) Hangs were reported on Jetson Orin AGX if we set CUDA_SCALE_LAUNCH_QUEUES=4x. Reverting the previous PR (#19042) and updating the document to consider setting CUDA_SCALE_LAUNCH_QUEUES=4x for faster throughput on multi-GPU systems.	2026-02-03 08:41:02 +02:00
Neo Zhang	bf38346d13	Remove support for Nvidia & AMD GPU, because the oneAPI plugin for Nvidia & AMD GPU is unavailable: download/installation channels are out of work. (#19246 ) User can't build up the software for Nvidia & AMD GPU. rm the oneMath since it is only used in NV and AMD code path.	2026-02-02 21:06:21 +08:00
Tamar	4d5e972673	sycl: implement GGML_OP_TOP_K (#19242 )	2026-02-02 21:05:51 +08:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
Sascha Rogmann	b4d05a3d2f	spec : various improvements ton ngram-map + docs (#19253 ) * spec: ngram-map and reasoning chats * spec: add t_begin and t_accept * ngram-map : add internal hash map * docs : update ngram-map, add ngram-mod * docs : fix ngram-map-k * docs : differences between implementations	2026-02-02 08:26:58 +02:00
Max Krasnyansky	3bc8d2cf23	Bump cmake max version (needed for Windows on Snapdragon builds) (#19188 ) * Bump max cmake version (needed for Windows on Snapdragon builds) * cmake: move max version setting into ggml/CMakeLists	2026-02-01 14:13:38 -08:00
Neo Zhang	2634ed207a	create test.sh to enhance the parameters for testing, update the guide, rm useless script (#19243 )	2026-02-01 18:24:00 +08:00
s8322	1025fd2c09	sycl: implement GGML_UNARY_OP_SOFTPLUS (#19114 ) * sycl: add softplus unary op implementation * sycl: add softplus unary op implementation * docs(ops): mark SYCL SOFTPLUS as supported * docs: update SYCL status for SOFTPLUS	2026-01-30 12:01:38 +08:00
RachelMantel	c7358ddf64	sycl: implement GGML_OP_TRI (#19089 ) * sycl: implement GGML_OP_TRI * docs: update ops.md for SYCL TRI * docs: regenerate ops.md * docs: update SYCL support for GGML_OP_TRI	2026-01-30 12:00:49 +08:00
DDXDB	d284baf1b5	Fix typos in SYCL documentation (#19162 ) * Fix typos in SYCL documentation * Update SYCL.md * Update SYCL.md * Update SYCL.md * Update docs/backend/SYCL.md Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> * Update SYCL.md --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-30 09:46:57 +08:00
Todor Boinovski	ce38a4db47	hexagon: enable offloading to Hexagon on Windows on Snapdragon (#19150 ) * hexagon: updates to enable offloading to HTP on WoS * Update windows.md * Update windows.md * hexagon: enable -O3 optimizations * hexagon: move all _WINDOWS conditional compilation to _WIN32 * hexagon: updates to enable offloading to HTP on WoS * hexagon: use run-time vs load-time dynamic linking for cdsp driver interface * refactor htp-drv * hexagon: add run-bench.ps1 script * hexagon: htdrv refactor * hexagon: unify Android and Windows build readmes * hexagon: update README.md * hexagon: refactor htpdrv * hexagon: drv refactor * hexagon: more drv refactor * hexagon: fixes for android builds * hexagon: factor out dl into ggml-backend-dl * hexagon: add run-tool.ps1 script * hexagon: merge htp-utils in htp-drv and remove unused code * wos: no need for getopt_custom.h * wos: add missing CR in htpdrv * hexagon: ndev enforecement applies only to the Android devices * hexagon: add support for generating and signing .cat file * hexagon: add .inf file * hexagon: working auto-signing and improved windows builds * hexagon: futher improve skel build * hexagon: add rough WoS guide * hexagon: updated windows guide * hexagon: improve cmake handling of certs and logging * hexagon: improve windows setup/build doc * hexagon: more windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * hexagon: windows readme updates * Update windows.md * Update windows.md * snapdragon: rename docs/backend/hexagon to docs/backends/snapdragon Also added a power shell script to simplify build env setup. * hexagon: remove trailing whitespace and move cmake requirement to user-presets * hexagon: fix CMakeUserPresets path in workflow yaml * hexagon: introduce local version of libdl.h * hexagon: fix src1 reuse logic gpt-oss needs a bigger lookahead window. The check for src[1] itself being quantized was wrong. --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-01-29 12:33:21 -08:00
Neo Zhang	d4964a7c66	sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-29 09:20:22 +08:00
Sascha Rogmann	72d3b1898a	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 ) * server: introduce self-speculative decoding * server: moved self-call into speculative.cpp * can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: can_speculate() tests self-spec * server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: can_speculate() requires a task instance * common: ngram map, config self-speculative decoding * common: add enum common_speculative_type * common: add vector of speculative states * common: add option --spec-draftless * server: cleanup (remove slot.batch_spec, rename) * common: moved self-spec impl to ngram-map * common: cleanup (use common_speculative_state_draft) * spec : refactor * cont : naming * spec: remove --spec-config * doc: (draftless) speculative decoding * common: print performance in spec decoding * minor : cleanup * common : better names * minor : cleanup + fix build * minor: comments * CODEOWNERS: add common/ngram-map.* (#18471) * common : rename speculative.draftless_type -> speculative.type * ngram-map : fix uninitialized values * ngram-map : take into account the input can become shorter * ngram-map : revert len check for now * arg : change `--spec-draftless` -> `--spec-type` * spec : add common_speculative_state::accept() * spec : refactor + add common_speculative_begin() * spec : fix begin() call with mtmd * spec : additional refactor + remove common_speculative_params --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-28 19:42:42 +02:00
Ben Chen	0a95026da9	doc: add build instruction to use Vulkan backend on macos (#19029 )	2026-01-28 12:30:16 +01:00
David Lima	68ac3acb43	docs: Remove duplicated word on CUDA build section (#19136 )	2026-01-27 14:48:51 +01:00
Gaurav Garg	a83c73a18a	[CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full (#19042 ) * [CUDA] Reduce CPU-side stalls due to the CUDA command buffer being full With pipeline parallelism, during prompt processing, the CPU-side CUDA command buffer gets full, stalling the CPU. Due to this, enough work doesn't get submitted to the GPU, causing bubbles in the GPU timeline. Fix this by setting the CUDA environment variable CUDA_SCALE_LAUNCH_QUEUES to 4x to increase the command buffer size. * Set the env variable in the CUDA backend registry allocation * Add link to PR in code comment * Remove warning logs and update documentation	2026-01-27 08:52:44 +02:00
Francisco Herrera	293a1565dc	docs: add linux to index (#18907 )	2026-01-18 18:03:35 +08:00
Reese Levine	a89002f07b	ggml webgpu: support for backend sampling (#18880 ) * ggml webgpu: add SOFTPLUS unary operator Implements SOFTPLUS (log(1 + exp(x))) with f16/f32 support. Uses f32 precision for intermediate calculations to prevent f16 overflow. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * Follow Vulkan backend numerical stability pattern * ggml webgpu: add EXPM1 unary operator Implements EXPM1 (exp(x) - 1) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add FLOOR unary operator Implements FLOOR (rounds down to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add CEIL unary operator Implements CEIL (rounds up to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add ROUND unary operator Implements ROUND (rounds to nearest integer) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * ggml webgpu: add TRUNC unary operator Implements TRUNC (truncates towards zero) with f16/f32 support. * Add shader implementation and 4 variants (f32/f16, inplace/non-inplace) * Register pipelines and device support * docs : update WebGPU support for unary operators (FLOOR, CEIL, ROUND, TRUNC, EXPM1, SOFTPLUS) * Updates to webgpu get_memory * Add argmax * Add argmax,cumsum,sum,sum_rows * Add necessary CPY/GET_ROWS operators * Support for argsort using multi-pass strategy * Update set_rows for i32 indices, move to pre-wgsl * Port unary operators to pre-wgsl and support FILL * Implement PAD * Add support for top-k * clean up, scope pipeline init mutex * fix newline * Add support for log * Update LOG for better precision, and ops doc --------- Co-authored-by: Abhijit Ramesh <abhijitramesh2k@gmail.com>	2026-01-16 16:12:43 -08:00
hipudding	6ba6a3c76f	docs : update ops.md for CANN backend (#18654 )	2026-01-16 13:32:17 +01:00
Xuan-Son Nguyen	c15395f73c	common : implement new jinja template engine (#18462 ) * jinja vm * lexer * add vm types * demo * clean up * parser ok * binary_expression::execute * shadow naming * bin ops works! * fix map object * add string builtins * add more builtins * wip * use mk_val * eval with is_user_input * render gemma tmpl ok * track input string even after transformations * support binded functions * keyword arguments and slicing array * use shared_ptr for values * add mk_stmt * allow print source on exception * fix negate test * testing more templates * mostly works * add filter_statement * allow func to access ctx * add jinja-value.cpp * impl global_from_json * a lot of fixes * more tests * more fix, more tests * more fixes * rm workarounds * demo: type inferrence * add placeholder for tojson * improve function args handling * rm type inference * no more std::regex * trailing spaces * make testing more flexible * make output a bit cleaner * (wip) redirect minja calls * test: add --output * fix crash on macro kwargs * add minimal caps system * add some workarounds * rm caps_apply_workarounds * get rid of preprocessing * more fixes * fix test-chat-template * move test-chat-jinja into test-chat-template * rm test-chat-jinja from cmake * test-chat-template: use common * fix build * fix build (2) * rename vm --> interpreter * improve error reporting * correct lstrip behavior * add tojson * more fixes * disable tests for COMMON_CHAT_FORMAT_GENERIC * make sure tojson output correct order * add object.length * fully functional selectattr / rejectattr * improve error reporting * more builtins added, more fixes * create jinja rendering tests * fix testing.h path * adjust whitespace rules * more fixes * temporary disable test for ibm-granite * r/lstrip behavior matched with hf.js * minimax, glm4.5 ok * add append and pop * kimi-k2 ok * test-chat passed * fix lstrip_block * add more jinja tests * cast to unsigned char * allow dict key to be numeric * nemotron: rm windows newline * tests ok * fix test * rename interpreter --> runtime * fix build * add more checks * bring back generic format support * fix Apertus * [json.exception.out_of_range.403] key 'content' not found * rm generic test * refactor input marking * add docs * fix windows build * clarify error message * improved tests * split/rsplit with maxsplit * non-inverse maxsplit forgot to change after simplifying * implement separators for tojson and fix indent * i like to move it move it * rename null -- > none * token::eof * some nits + comments * add exception classes for lexer and parser * null -> none * rename global -> env * rm minja * update docs * docs: add input marking caveats * imlement missing jinja-tests functions * oops * support trim filter with args, remove bogus to_json reference * numerous argument fixes * updated tests * implement optional strip chars parameter * use new chars parameter * float filter also has default * always leave at least one decimal in float string * jinja : static analysis + header cleanup + minor fixes * add fuzz test * add string.cpp * fix chat_template_kwargs * nits * fix build * revert * unrevert sorry :) * add fuzz func_args, refactor to be safer * fix array.map() * loosen ensure_vals max count condition, add not impl for map(int) * hopefully fix windows * check if empty first * normalize newlines --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-01-16 11:22:06 +01:00
Max Krasnyansky	cff777f226	hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822 ) * hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars * hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32 Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication. Update HTP ops infra to support OP_CPY * hexagon: cleanup and refactor hex/hvx/htp headers and helper libs hex is basically all scalar/core platform stuff (L2, DMA, basic utils) hvx is all hvx related utils, helpers, etc htp is higher level stuff like Ops, etc hvx-utils library got a nice round of cleanup and refactoring to reduce duplication use hvx_vec_store_a where possible * hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h Moved sigmoid and tanh vector functions from hvx-utils.h to a new header hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid array processing using a macro pattern similar to hvx-copy.h. Updated act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed unused hvx-sigmoid.c. * hexagon: factor out hvx-sqrt.h * hexagon: mintor update to hvx-utils.h * hexagon: remove spurios log * hexagon: factor out and optimize hvx_add/sub/mul * hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions * hexagon: refactor reduction functions to hvx-reduce.h Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`. Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`. Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic. Updated `softmax-ops.c` to use the new functions. * hexagon: refactor the rest of arithmetic functions to hvx-arith.h Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`. Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested. * hexagon: refactor hvx_sum_of_squares_f32 - Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature. - Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`. - Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`. - Update existing reduction functions in `hvx-reduce.h` to use the updated macro. - Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature. * hexagon: use hvx_splat instead of memset * hexagon: consistent use of f32/f16 in all function names to match the rest of GGML * hexagon: fix hvx_copy_f16_f32 on v75 and older * hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL * scripts: update snapdragon/adb scripts to enable host param	2026-01-14 21:46:12 -08:00
Piotr Wilkin (ilintar)	d98b548120	Restore clip's cb() to its rightful glory - extract common debugging elements in llama (#17914 ) * Extract common debugging functions; plug eval-callback and mtmd's MTMD_DEBUG_GRAPH with same functionality * Move to common * Remove unneeded header * Unlink from common * chore: update webui build output * Cleanup; properly pass params to mtmd without depending on common; factorize debug.cpp to use common debug code. * Revert change to webapp * Post-merge adjust * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Apply code review changes * Remove changes to server-context * Remove mtmd.h include * Remove utility functions from header * Apply suggestions from code review Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Rename functions * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>	2026-01-14 20:29:35 +01:00
Adrien Gallouët	516a4ca9b5	refactor : remove libcurl, use OpenSSL when available (#18828 )	2026-01-14 18:02:47 +01:00
Xuan-Son Nguyen	23f82f2420	preset: allow named remote preset (#18728 ) * preset: allow named remote preset * nits: fix docs * cont docs	2026-01-10 15:12:29 +01:00
Aaron Teo	2656c0d265	docs(ggml): update backend ops (#18734 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-01-10 18:48:17 +08:00

1 2 3 4 5

250 Commits