llama.cpp

Commit Graph

Author	SHA1	Message	Date
Ruben Ortlam	128142fe7d	test-backend-ops: allow loading tests from file and parsing model operators into file (#19896 ) * tests: allow loading test-backend-ops tests from json * add error threshold based on op * add error when file cannot be read * add graph operator json extraction tool * add nb parameter for non-contiguous input tensors * fix view check * only use view if non-contiguous/permuted, use C++ random instead of rand() * replace internal API calls with public llama_graph_reserve call * reduce test description length * fix nb[0] not getting set for view * add name to tests * fix inplace error * use text file instead of json * move llama_graph_reserve function to new llama-ext header, move export-graph-ops to tests/ * fix missing declaration * use pragma once * fix indent * fix Windows build	2026-03-12 13:26:00 +01:00
Asbjørn Olling	0a10c34dc1	grammar: Fix grammar root symbol check (#19761 ) * grammar: fix bad check for root symbol, correct error logging * add tests to demonstrate root symbol check failure	2026-03-12 12:04:56 +01:00
ProgenyAlpha	deee23863b	vulkan: add GATED_DELTA_NET op support (#20334 ) * vulkan: add GATED_DELTA_NET op support Implements the fused gated delta net recurrence as a Vulkan compute shader with full support for scalar gate, KDA vector gate, GQA broadcast, multi-token sequences, and permuted (non-contiguous) q/k inputs. Specialization constants select head size (32/64/128) and KDA mode at pipeline creation time. Passes all 13 test-backend-ops cases on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: optimize GATED_DELTA_NET shader (Phase 1) - vec4 dot products on all inner loops (dp4 hardware intrinsic) - Cache exp(g) in shared memory for KDA path, eliminating ~32K redundant global reads and ~16K redundant exp() calls per token - vec4 fused decay + rank-1 update (3 vec4 ops vs 12 scalar ops) - Add perf benchmark cases for GATED_DELTA_NET to test-backend-ops KDA TG: +5.4% throughput. Non-KDA: no regressions. 13/13 test-backend-ops passing on AMD Radeon 890M (RADV GFX1150). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: address review feedback for GATED_DELTA_NET Pipeline array refactor [3][2], A_TYPE/D_TYPE/FLOAT_TYPE shader macros, scale in push constants, supports_op fix, dispatch restructuring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: use FLOAT_TYPE for buffer/shared declarations, align formatting Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: add explicit FLOAT_TYPE casts for buffer loads Wrap data_q, data_k, and data_g buffer reads with FLOAT_TYPE() casts to ensure correct behavior across all Vulkan configurations. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * vulkan: fix Q/K broadcast for interleaved head layout Adapt to the interleaved broadcast convention from #20340: head_id / rq1 → head_id % neq1 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Progeny Alpha <ProgenyAlpha@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-03-12 11:32:04 +01:00
Jeff Bolz	246ffc4b05	vulkan: fix l2_norm epsilon handling (#20350 )	2026-03-12 06:39:41 +01:00
Mishusha	a8304b4d27	common/parser: add GigaChatV3/3.1 models support (#19931 ) Co-authored-by: Mishusha <pmv26021975@gmail.com>	2026-03-12 01:22:25 +01:00
Georgi Gerganov	d28961d81e	llama : enable chunked fused GDN path (#20340 ) * llama : enable chunked fused GDN path * models : avoid Q and K repeats when using fused GDA * cont : fix comment Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix the fix Co-authored-by: Aman Gupta <amangupta052@gmail.com> * cont : fix * metal : add GDN kernel (#20361) * metal : add Metal backend for GGML_OP_GATED_DELTA_NET Add a fused Metal kernel for the gated delta net recurrence op (#19504), enabling GPU-accelerated inference for DeltaNet-based models (Qwen3.5, etc.) on Apple Silicon. Supports both GDA (scalar gate) and KDA (per-row gate) modes with head_size 64 and 128. Unsupported configurations (head_size 32, non-contiguous tensors) gracefully fall back to CPU. Performance: Qwen3.5-0.8B Q4_K_M on M4 Max tg128: 170 -> 213 t/s (+25%) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : validate contiguity of all input tensors in supports_op Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * metal : add algorithm equivalence comment for GDA decay path Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * cont : unslop + optimize * cont : clean-up --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> * CUDA: AR gated delta net improvements (#20391) * Add FastDiv to gated_delta_net_cuda * Shard columns across warps This reduces register pressure (avoids spill for S_v = 128) and gives the warp-scheduler more CTAs to schedule (thus hiding data-access latencies). * Remove unneded include in gated_delta_net.cu * Improve comments * Apply code-formating * Make sharding HIP-compatible 1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly 2. Add test with partial warp to test sum reduction on CUDA * Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t * Rename variables * Enable GDN also for prefill, move TODO for chunked_GDN * Actually remove the TODO from `2068908975` * Get warp size at runtime warp_size is not known at compile time in hip host code. * Don't expose ggml_cuda_get_physical_warp_size on host --------- Co-authored-by: uvos <devnull@uvos.xyz> * llama : refactor llm_build_delta_net_base API --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com> Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com> Co-authored-by: uvos <devnull@uvos.xyz>	2026-03-11 22:46:40 +02:00
Richard Davison	5eae9cb1d9	ggml : add NVFP4 quantization type support (#19769 ) * WIP: add NVFP4 quantization support * tests * improve NVFP4 dot product implementation performance and fix bad super call * typo * Use nvfp4 kvalues * vulkan : fix NVFP4 shader compilation by including kvalues_mxfp4 lookup table * vulcal and perf fixes * wip * Fix metal * fix vulcan * Rename threshold & fix wrong scale * Fix MOE * Shelf backend implementations (CUDA, Metal, Vulkan, arch-specific SIMD) Remove NVFP4 support from GPU backends and architecture-specific optimized dot products. These should be added in separate PRs so backend specialists can review them independently. Reverted files: - ggml-cuda: common.cuh, convert.cu, mmq.cu/cuh, mmvq.cu, vecdotq.cuh, quantize.cu/cuh, mma.cuh, ggml-cuda.cu, fattn-tile.cuh - ggml-metal: ggml-metal.metal, ggml-metal-device.cpp, ggml-metal-impl.h, ggml-metal-ops.cpp - ggml-vulkan: ggml-vulkan.cpp, all vulkan-shaders/* - ggml-cpu arch: arm/quants.c, x86/quants.c, powerpc/quants.c, s390/quants.c Core NVFP4 support (type definition, CPU fallback dot product, quantization, dequantization, conversion) is retained. * Fix arch-fallback.h: add NVFP4 generic fallback for all platforms After shelving backend-specific SIMD implementations, the generic CPU dot product needs to be aliased on ARM, x86, PowerPC, and s390 platforms that previously relied on arch-specific versions. * quantize: add NVFP4 as a quantization type option * Fix ggml_fp32_to_ue4m3: handle subnormal values Previously, values with ue4m3_exp <= 0 were clamped to 0, causing all small scales to underflow. This made NVFP4 quantization via llama-quantize produce garbage (PPL = 5.8M) since typical transformer weights have amax/6.0 in the range 0.001-0.01, which falls in the UE4M3 subnormal range. Now subnormals are properly encoded as man * 2^-9 (exp=0, man=1..7), matching the decode path in ggml_ue4m3_to_fp32. Result: NVFP4 requantization now produces PPL = 15.25 (vs F16 = 14.33), comparable to Q4_1 (PPL = 15.81) at slightly lower BPW (4.70 vs 5.15). * Restore ARM NEON NVFP4 dot product implementation Restores the optimized ggml_vec_dot_nvfp4_q8_0 for ARM NEON using vqtbl1q_s8 lookup and ggml_vdotq_s32 dot products. tg128 performance: 4.37 t/s (generic) -> 13.66 t/s (NEON) = 3.1x speedup * Optimize ARM NEON NVFP4 dot product: LUT + vpaddq + vfmaq - Add ue4m3_scale_lut[128] to ggml-common.h replacing branch-heavy ggml_ue4m3_to_fp32() in the hot loop - Use vpaddq_s32 for pairwise int32 reduction instead of vaddvq_s32 - Accumulate with vfmaq_f32 into float32x4_t vector accumulators tg128: 8.1 -> 31.0 t/s (3.8x speedup, 77% of Q4_1 speed) * ARM NEON NVFP4: rearrange q8 to match nibble layout Alternative approach: rearrange q8 data to match the NVFP4 lo/hi nibble layout instead of rearranging the looked-up NVFP4 values. Eliminates vcombine_s8(vget_low, vget_low) shuffles. Performance is equivalent (~18.5 t/s) - the bottleneck is the 2x block overhead from QK=16 vs QK=32, not the shuffle instructions. * CPU only backend 64 super-block layout * cleanup * Remove unused LUT * int * exclude NVFP4 from unsupported ops in metal build * remove quantization for now * store scales as native UE4M3, preserve original model bits when possible * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * correct comment * format * reduce duplication and cleanup * Address comments * move detection to prepare_tensors * Use math instead of const * Move * fix comment * Shelf quantize tests * Rebase and move check * cleanup * lint * Update gguf-py/gguf/scripts/gguf_convert_endian.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Use fallback quant config * Simplify Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * organize * Refactor * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * add quantize_nvfp4 (required for test_quants.py) * fix return type --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-11 21:02:54 +01:00
Piotr Wilkin (ilintar)	acb7c79069	common/parser: handle reasoning budget (#20297 ) * v1 * Finished! * Handlie cli * Reasoning sampler * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Less explosive terminology :) * Add utf-8 case and tests * common : migrate reasoning budget sampler to common * cont : clean up * cont : expose state and allow passing as initial state * cont : remove unused imports * cont : update state machine doc string --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Alde Rojas <hello@alde.dev>	2026-03-11 10:26:12 +01:00
Ray Xu	8d880ac012	examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968 ) * Fix logic for retrieving schema items in `json_schema_to_grammar.py` If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error. I think if `schema['items']` is `{}`, them items should just be `{}` * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add tests for arrays with empty items Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case. --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-10 14:38:18 +01:00
Aldehir Rojas	c96f608d98	common: consolidate PEG string parsers (#20263 ) * common : consolidate PEG string parsers * cont : fix json_string_content()	2026-03-10 00:29:21 +01:00
Piotr Wilkin (ilintar)	f76565db92	common: map developer role to system (#20215 ) * Map developer role to system * Simplify	2026-03-09 14:25:11 +01:00
Piotr Wilkin (ilintar)	97c64fbdbd	PEG parser for LFM2 (#20251 ) * PEG parser for LFM2 * Simplify using python_value()	2026-03-09 01:11:22 +01:00
Aldehir Rojas	451ef08432	common : gracefully handle incomplete output (#20191 ) * common : handle incomplete UTF-8 at end of input in PEG parser * cont : if reached end prematurely, emit needs_more_input to propagate partial output * cont: refactor peg parse context to add lenient flag * cont : remove partial flag, keep lenient flag	2026-03-08 17:17:02 +01:00
Johannes Gäßler	a976ff081b	llama: end-to-end tests (#19802 ) * tests: add end-to-end tests per model architecture * fixup for rebase * fix use-after-free in llama-model-loader.cpp * fix CI * fix WebGPU * fix CI * disable CI for macOS-latest-cmake-arm64 * use expert_weights_scale only if != 0.0f * comments	2026-03-08 12:30:21 +01:00
Piotr Wilkin (ilintar)	b283f6d5b3	Revert to OAI-compatible args (#20213 ) * Revert to OAI-compatible args * Apply workaround::func_args_not_string	2026-03-08 11:33:03 +01:00
Aman Gupta	c5a778891b	ggml: add GATED_DELTA_NET op (#19504 ) * ggml: add GATED_DELTA_NET op * remove the transpose * add KDA * add qwen35 dense * llama : check for fused gated delta net backend support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-07 15:41:10 +08:00
Piotr Wilkin (ilintar)	2f2923f895	Autoparser: add optional argument reshuffle capability (#20171 ) * Allow reshuffled arguments in tagged argument parser format tool calls. * Remove shuffle just keep the optional parsers in any order * Remove unnecessary import	2026-03-06 22:34:15 +01:00
Piotr Wilkin (ilintar)	566059a26b	Autoparser - complete refactoring of parser architecture (#18675 ) * Autoparser - full single commit squish * Final pre-merge changes: minor fixes, Kimi 2.5 model parser	2026-03-06 21:01:00 +01:00
Aman Gupta	1e38a7a6fa	CUDA: use shared mem for ssm_conv (#20128 ) * CUDA: use shared mem for ssm_conv * fuse silu + ssm_conv * fuse unary + mul * enable for fp16 * formatting Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-06 23:09:59 +08:00
Marcel Petrick	92f7da00b4	chore : correct typos [no ci] (#20041 ) * fix(docs): correct typos found during code review Non-functional changes only: - Fixed minor spelling mistakes in comments - Corrected typos in user-facing strings - No variables, logic, or functional code was modified. Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> * Update docs/backend/CANN.md Co-authored-by: Aaron Teo <taronaeo@gmail.com> * Revert "Auxiliary commit to revert individual files from 846d1c301281178efbc6ce6060ad34c1ebe45af8" This reverts commit 02fcf0c7db661d5ff3eff96b2b2db9fdb7213256. * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-backend-ops.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Marcel Petrick <mail@marcelpetrick.it> Co-authored-by: Aaron Teo <taronaeo@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-05 08:50:21 +01:00
Masashi Yoshimura	36a7a6589c	ggml-webgpu: Support non-contiguous `src0` and overlapping `src0/src1` in binary ops (#19850 ) * ggml-webgpu: Add binary op support for overlapping and non-contiguous. * Add newline to binary.wgsl * Append the test of binary op for src overlapping to test_bin_bcast. * Remove unnecessary newline.	2026-03-02 07:59:53 -08:00
Bartowski	d979f2b176	tests : model metadata loading from huggingface (#19796 ) * Add model metadata loading from huggingface for use with other tests * Add incremental chunking instead of full redownload, fix caching issue and add warning when it fails * Add support for split models, load metadata from each individual split file, also avoid mmproj * Code cleanup, revert incremental downloading * Only compile when cpp-httplib has SSL support * Fix formatting	2026-02-28 10:44:38 +01:00
Jan Patrick Lehr	a8b192b6ec	tests : enable test-chat out of tree build (#19558 ) The binary relies on model files that it tries to find. However, when configuring the build directory to be parallel to the source tree those heuristics fail. This sets the working directory for the test executable to be the source-tree which resolves this issue.	2026-02-27 05:37:54 +01:00
Maximilian Werk	66287bdaac	model : add Jina Embeddings v5 Nano (partial EuroBERT) support (#19826 ) * WIP: Add EuroBERT support with autoformatting changes This commit includes: - EuroBERT model implementation for GGUF conversion - C++ backend support for EuroBERT architecture - Unintended autoformatting changes to Python files Saving before reverting formatting-only changes. * feat: add back eos assert when not last token pooling * feat: removed duplicated code and cleanup * feat: removed not working architectures and unnecessary check * fix: typo * fix: dynamic pooling config * feat: added an example model for eurobert * feat: proper llama-vocab implementation for jina-v5 * fix: removed unnecessary comments	2026-02-26 12:14:09 +01:00
Georgi Gerganov	418dea39ce	ggml/gguf : prevent integer overflows (#19856 ) * gguf : prevent integer overflow for ggml_context mem size * ggml : fix int overflows in ggml_new_object() * gguf : prevent string exhaustion * gguf : prevent array elements exhaustion * ggml : fix negative tensor type oob * py : assert that alignment is non-zero power of 2 * ggml : check int overflow in ggml_new_tensor_impl and ggml_new_object * gguf-py : error on duplicate keys when reading * py : restore tensor_fields * enforce proper alignment in add_custom_alignment * gguf : better name * gguf : fix ctx size for no_alloc == true * gguf : minor print fix * ggml : print values when overflow * ggml : remove deprecated ggml_type_sizef() * ggml : relax ggml_type asserts to debug-only * gguf : add mem_size overflow test * gguf : add file size check for arrays * ggml : relax asseerts for ggml_get_type_traits() * flake8 fix --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-24 20:17:11 +02:00
Daniel Bevenius	d8aeb65cee	tests : fix typos in comments in test-backend-sampler [no ci] (#19824 ) * tests : fix typos in comments in test-backend-sampler [no ci]	2026-02-23 17:12:02 +01:00
Xuan-Son Nguyen	5452d736f8	jinja: correct stats for tojson and string filters (#19785 )	2026-02-22 21:08:23 +01:00
Aldehir Rojas	94b0200a01	common : merge qwen3-coder and nemotron nano 3 parsers (#19765 ) * common : migrate qwen3-coder to PEG parsing variant * cont : add JSON parameter test	2026-02-20 23:22:22 +01:00
Jeff Bolz	77d6ae4ac8	test: mul_mat tests with huge batch size (#19519 )	2026-02-19 20:08:25 -06:00
Jesse Posner	3dadc88b58	common : fix Step-3.5-Flash format detection and thinking support (#19635 ) * common : fix Step-3.5-Flash format detection and thinking support Step-3.5-Flash uses the same XML-style tool call format as Qwen3-Coder (<tool_call><function=...><parameter=...>) but its Jinja template lacks the bare <function> and plural <parameters> markers that the detection logic previously required. This caused it to fall through to Hermes 2 Pro, which doesn't call func_args_not_string(), so arguments stayed as JSON strings and templates using arguments\|items crashed. Additionally, the Qwen3-Coder-XML format handler had no thinking support. Models like Step-3.5-Flash that unconditionally emit <think> in their generation prompt need the same thinking_forced_open handling that Nemotron v3 and Hermes 2 Pro already have, otherwise reasoning_content is never separated from content in API responses. Changes: - Relax Qwen3-Coder XML detection to only require the 3 shared markers - Tighten Nemotron v3 branch to also require bare <function> and plural <parameters>, preventing Step-3.5-Flash from being misrouted via <think> - Add thinking_forced_open support to Qwen3-Coder-XML init function - Add <think>/</think> to preserved tokens - Fix build_grammar_xml_tool_call to handle thinking_forced_open in the grammar root rule, allowing </think> before tool calls - Add Step-3.5-Flash chat template and format detection test Builds on: https://github.com/ggml-org/llama.cpp/pull/19283 * chat : route Step-3.5-Flash to Nemotron v3 PEG parser, add tests Step-3.5-Flash uses the same XML tool call format as Qwen3-Coder and Nemotron 3 Nano (<tool_call>/<function=...>/<parameter=...>) but with unconditional <think> output. Route it to the Nemotron v3 PEG parser for streaming and schema-aware parameter parsing. Detection: templates with <think> + XML tool tags use Nemotron v3 PEG parser; templates without <think> (Qwen3-Coder) use GBNF grammar. Tests cover: basic messages, tool calls with/without thinking content, parallel tool calls, code string parameters, optional </parameter> closing tags, and JSON schema response format. * chat : remove dead thinking code from qwen3_coder_xml Remove thinking handling code that became unreachable after routing Step-3.5-Flash to the Nemotron v3 PEG parser. Qwen3-Coder has no <think> in its template, so the thinking_forced_open logic, preserved tokens, and grammar prefix were dead paths.	2026-02-19 22:40:52 +01:00
Piotr Wilkin (ilintar)	8a70973557	Add Jinja support for "indent" string filter (#19529 ) * Add partial Jinja support for "indent" string filter * Fully implement indent * Add tests for all width variants. * Update tests/test-jinja.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix getline ignoring trailing newlines * Update common/jinja/value.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix first indent condition --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-02-19 00:25:52 +01:00
Georgi Gerganov	08e6d914b8	ggml : avoid UB in gemm ukernel (#19642 )	2026-02-15 14:56:35 +02:00
Jeff Bolz	dbb023336b	vulkan: support L2_NORM with contiguous rows (#19604 )	2026-02-14 06:42:04 +01:00
ymcki	0e21991472	fix vulkan ggml_acc only works in 3d but not 4d (#19426 ) * fix vulkan ggml_acc only works in 3d but not 4d * removed clamp in test_acc_block * use the correct stride and its test case * cuda : fix "supports op" condition * change src0 to src1 in ggml_vk_acc. Update acc.comp with jeffbolznv\'s suggestion except to keep the boundary check * version without boundary check * revert back to boundary check version --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-02-13 13:31:37 +01:00
Georgi Gerganov	490eb96b88	metal : support GGML_OP_SET (#19548 )	2026-02-13 07:34:52 +02:00
Georgi Gerganov	3b3a948134	metal : update sum_rows kernel to support float4 (#19524 )	2026-02-12 11:35:28 +02:00
Georgi Gerganov	914dde72ba	ggml : unary ops support non-cont src0 + metal F16 unary ops (#19511 ) * ggml : unary ops support non-cont src0 * metal : support F16 unary ops + fix ELU	2026-02-11 18:58:43 +02:00
Georgi Gerganov	89181c0b6d	ggml : extend bin bcast for permuted src1 (#19484 ) * tests : extend bin bcast for permuted src1 * cont : extend bin support * cont : s0 is always 1 * tests : simplify	2026-02-11 07:52:00 +02:00
Georgi Gerganov	ceaa89b786	metal : consolidate unary ops (#19490 )	2026-02-11 07:51:12 +02:00
Xuan-Son Nguyen	9a96352729	test: fix IMROPE perf test case (#19465 )	2026-02-10 14:37:50 +01:00
Georgi Gerganov	a0d585537c	cuda : extend GGML_OP_PAD to work with non-cont src0 (#19429 ) * cuda : extend GGML_OP_PAD to work with non-cont src0 * tests : add permuted pad	2026-02-10 08:07:16 +02:00
Hugo	1e8924fd65	cmake : add variable to skip installing tests (#19370 ) When packaging downstream, there's usually little point in installing test. The default behaviour remains the same.	2026-02-09 07:12:02 +01:00
Jeff Bolz	db6adb3c88	tests: reduce number of FA test permutations (#19381 ) Only test non-F16 for head size 64 and 72 (one a multiple of QK, one not).	2026-02-06 08:50:30 -06:00
Jeff Bolz	449ec2ab07	vulkan: Preprocess FA mask to detect all-neg-inf and all-zero. (#19281 ) Write out a 2-bit code per block and avoid loading the mask when it matches these two common cases. Apply this optimization when the mask is relatively large (i.e. prompt processing).	2026-02-05 09:26:38 -06:00
Georgi Gerganov	eaba92c3dc	tests : add non-cont, inplace rope tests (#19296 ) * tests : add non-cont, inplace rope tests * cont : exercise dim 3 Co-authored-by: Jeff Bolz <jbolz@nvidia.com> * cont : more dim3 exercises --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2026-02-04 12:45:21 +02:00
Sid Mohan	0dfcd3b607	jinja : add missing 'in' test to template engine (#19004 ) (#19239 ) * jinja : add missing 'in' test to template engine (#19004) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * reuse test_is_in in binary op --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-02-02 21:00:55 +01:00
Aman Gupta	9f682fb640	ggml-cpu: FA split across kv for faster TG (#19209 ) * ggml-cpu: split across kv for faster TG * simplify sinks application * add ref impl	2026-02-03 01:19:55 +08:00
Christian Kastner	7a4ca3cbd9	docs : Minor cleanups (#19252 ) * Update old URLs to github.com/ggml-org/ * Bump copyrights	2026-02-02 08:38:55 +02:00
Georgi Gerganov	c3b87cebff	tests : add GQA=20 FA test (#19095 )	2026-01-30 13:52:57 +02:00
Aldehir Rojas	7b7ae857f6	chat : add parsing for solar-open-100b (#18540 ) * chat : add parsing for solar-open-100b * add comments to rules * cont : make assistant start optional * cont : remove assistant start prefix altogether --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-01-29 16:06:15 +01:00

1 2 3 4 5 ...

683 Commits