llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	0be6c7c9ce	ggml : bump version to 0.9.9 (ggml/1449)	2026-03-31 14:00:41 +03:00
Adrien Gallouët	41361c8599	common : move up common_init() and fix Windows UTF-8 logs (#21176 ) The build info is now only for debug, so we avoid the duplicate with `--version`. The UTF-8 setup at the beginning is needed to avoid logging garbage on Windows. Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-31 12:53:41 +02:00
Neo Zhang	62278cedde	sycl : enhance fattn perf (#21185 )	2026-03-31 13:31:50 +03:00
mtmcp	90aa83c6bd	common: add bounds check in common_init_result::sampler to prevent segfault on failed model load (#21082 ) * common: add bounds check in common_init_result::sampler to prevent segfault on failed model load * Revert `a308e584ca` * Add regression test * Remove regression test for init-fail sampler check	2026-03-31 13:04:42 +03:00
SATISH K C	fcc2d598c8	fix: include API key in CORS proxy requests for MCP connections (#21193 ) * fix: include API key in CORS proxy requests for MCP connections When llama-server is started with --api-key-file and --webui-mcp-proxy, the /cors-proxy endpoint requires authentication. The WebUI was not including the Authorization header in proxy requests, causing MCP connections to fail with 401. Inject getAuthHeaders() into requestInit when useProxy is true so the proxy request carries the Bearer token alongside the forwarded target headers. Fixes #21167 * fix: simplify headers assignment based on reviewer suggestion Apply buildProxiedHeaders only when useProxy is true, pass headers directly to the transport otherwise.	2026-03-31 10:52:34 +02:00
Piotr Wilkin (ilintar)	4453e77561	server/webui: cleanup dual representation approach, simplify to openai-compat (#21090 ) * server/webui: cleanup dual representation approach, simplify to openai-compat * feat: Fix regression for Agentic Loop UI * chore: update webui build output * refactor: Post-review code improvements * chore: update webui build output * refactor: Cleanup * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-31 10:42:06 +02:00
Adrien Gallouët	26dac845cc	vendor : update BoringSSL to 0.20260327.0 (#21211 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-31 09:21:54 +02:00
Galunid	5ce013cd7e	common : Disable backend sampling if reasoning budget is enabled (#21209 )	2026-03-31 10:14:01 +03:00
shaofeiqi	08f21453ae	opencl: add q4_K gemm and gemv kernels for Adreno (#20919 ) * opencl: add q4_K gemm and gemv kernels for Adreno * opencl: fix whitespace * opencl: add workarounds for compiler bugs on older devices * opencl: handle fp16 denorm on X Elite * opencl: fix kernel build error * opencl: fix whitespace * opencl: make q4_K cvt kernels signature consistent --------- Co-authored-by: Li He <lih@qti.qualcomm.com>	2026-03-30 12:19:16 -07:00
Seungmin Kim	84ae8434d0	CI : Enable CUDA and Vulkan ARM64 runners and fix CI/CD (#21122 ) * CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> * Obtain source tag name from git tag Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-30 20:24:37 +02:00
Zhihao "Zephyr" Yao	ead417f01c	jinja : handle empty expressions correctly (#20913 ) * Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments(). * Treat empty computed member expressions with Jinja2 undefined semantics Treat empty computed member expressions like `a[]` as undefined instead of raising a parser error, to match Jinja2 behavior. - return a noop expression for empty computed member arguments - return undefined when a computed member key evaluates to undefined - add Jinja tests covering `a[]\|default('fallback')` and `a[] is undefined` * Handle undefined computed member properties Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`. * Use default undefined value in member access Initialize val and then return it when property is undefined. Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * empty statement parses to blank_expression instead of noop_statement --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-30 20:08:46 +02:00
Oliver Simons	64ac9ab66a	CUDA : Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 (#21181 ) * CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1 We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`, while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we had uninitialized values in `offset_iterator[nrows]` for the case when `nrows % block_size == 0`. Fixes #21162 * Reduce nrows in test case to 256, don't need 768	2026-03-30 16:20:00 +02:00
Radoslav Gerganov	cad2d3884c	rpc : fix misleading error log (#21184 ) When RPC is running with a remote backend which doesn't have init_tensor function (like CPU and Metal), the server log gets full with error messages saying that init_tensor is being called with null buffer which is incorrect. This patch fixes this.	2026-03-30 17:05:11 +03:00
Aleksander Grygier	389c7d4955	webui: Fix branching logic on edit message (#21175 ) * fix: Branching logic + small refactor * chore: update webui build output	2026-03-30 14:40:50 +02:00
Aman Gupta	278521c33a	llama-model-loader: print warning when using overrides with mmap (#20978 ) * llama-model-loader: use pinned memory for tensor overrides * change to warning	2026-03-30 17:40:17 +08:00
Sigbjørn Skjæret	e2eb39e81c	ci : bump ty to 0.0.26 (#21156 ) * fix incorrect type ignore comments * bump ty to 0.0.26	2026-03-30 09:29:15 +02:00
Xuan-Son Nguyen	abf9a62161	server: wrap headers for mcp proxy (#21072 ) * server: wrap headers for mcp proxy * Update tools/server/server-cors-proxy.h Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix build * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-30 08:59:16 +02:00
Sigbjørn Skjæret	7c203670f8	add missing ROPE_FACTORS_LONG/SHORT for MiniCPM (#21150 )	2026-03-29 19:45:40 +02:00
Gaurav Garg	ec16a072f0	Optimize MOE GEMV kernel for BS > 1. (#20905 ) * Optimize MOE GEMV kernel for BS > 1. The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row. New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync). This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization. * Remove em-dashes * Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8 * Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-03-29 18:35:18 +02:00
Max Krasnyansky	f5d1c4179f	hexagon: dma optimizations (mostly fixing regressions) (#21137 ) * hex-fa: add simple dma cache for Mask I noticed that we were refetch the mask rows over and over. This simple cache avoids that. * hex-dma: unset in-order desc bit which caused signficant perf regression We don't rely on true in order processing of the DMA descriptors anywhere. Turns out this mode caused significant regression of around 3-4 TPS during token gen. * hex-rope: update comment to clarify that we don't need in-order DMA completions	2026-03-29 06:40:13 -07:00
Davi Henrique Linhares	2405d59cb6	devops: including compute-runtime for intel.Dockerfile (#21076 )	2026-03-29 13:34:03 +08:00
Neo Zhang	afe65aa282	[SYCL] Enhance build script to use half cores to build, avoid OS hang (#21093 ) * use half cores to build, avoid OS hang * reduce the output text num to short test time * avoid to return 0	2026-03-29 09:02:45 +08:00
Sigbjørn Skjæret	65097181e4	fix **/x glob matching (#21129 )	2026-03-28 22:27:38 +01:00
Piotr Wilkin (ilintar)	98ae0a0d36	common/parser: fix handling of tool definition with missing properties key (#21128 )	2026-03-28 20:41:32 +01:00
Sigbjørn Skjæret	3a14a542f5	common : add character class support to glob_match (#21111 ) * add character class support to glob_match * remove pointless reference	2026-03-28 19:57:37 +01:00
BlueMöhre	968189729f	WebUI: Replace illegal nested button elements (#21026 ) * remove/replace nested button elements * map rest props to outer element * solve TODO * chore: update webui build output	2026-03-28 17:57:59 +01:00
Adrien	e397d3885c	common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124 ) The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV when a JSON schema "pattern" field contains a non-capturing group (?:...). Root cause: when the parser sees '(' followed by '?', it pushes a warning but does not advance past '?:'. The recursive transform() call then interprets '?' as a quantifier and calls seq.back() on an empty vector, causing undefined behavior. This commonly occurs when serving OpenAI-compatible tool calls from clients that include complex regex patterns in their JSON schemas (e.g., date validation patterns like ^(?:(?:\d\d[2468][048]\|...)-02-29\|...)$). The fix: - Skip '?:' after '(' to treat non-capturing groups as regular groups - For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely, handling escaped characters to avoid miscounting parenthesis depth - Adjust the ')' unbalanced-parentheses check using direct char comparisons instead of substr - Add test cases for non-capturing groups (C++ only, as the JS/Python implementations do not yet support this syntax)	2026-03-28 17:55:38 +01:00
Aldehir Rojas	e6f2ec01ff	common : add reasoning_format = none support to gpt-oss (#21094 )	2026-03-28 09:33:39 -05:00
Georgi Gerganov	edfb440a2f	server : fix processing of multiple back-to-back mtmd chunks (#21107 )	2026-03-28 16:27:36 +02:00
Adrien Gallouët	3d66da1809	ci : gracefully shut down the server (#21110 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 14:49:57 +01:00
Woof Dog	82b703f8bc	Document custom default webui preferences in server README (#19771 )	2026-03-28 14:19:16 +01:00
Aleksander Grygier	51a84efc53	webui: Conversation forking + branching improvements (#21021 ) * refactor: Make `DialogConfirmation` extensible with children slot * feat: Add conversation forking logic * feat: Conversation forking UI * feat: Update delete/edit dialogs and logic for forks * refactor: Improve Chat Sidebar UX and add MCP Servers entry * refactor: Cleanup * feat: Update message in place when editing leaf nodes * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * chore: Cleanup * refactor: Post-review improvements * chore: update webui build output * test: Update Storybook test * chore: update webui build output * chore: update webui build output	2026-03-28 13:38:15 +01:00
Adrien Gallouët	b0f0dd3e51	vendor : update cpp-httplib to 0.40.0 (#21100 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 08:59:44 +01:00
Ruben Ortlam	0eb4764182	vulkan: add noncontiguous GLU support (#21081 ) * vulkan: add noncontiguous GLU support * fix compile issue	2026-03-28 08:44:56 +01:00
hipudding	cb15cdb020	CANN: add SOFTPLUS unary op support Implement GGML_UNARY_OP_SOFTPLUS using aclnnSoftplus with beta=1.0 and threshold=20.0. This enables hybrid models like Qwen3.5 to run entirely on the CANN backend without graph splitting, which fixes graph cache instability caused by the backend scheduler fragmenting the computation graph when SOFTPLUS falls back to CPU.	2026-03-28 07:16:07 +00:00
hipudding	168d05f3d5	CANN: add GGML_OP_SOLVE_TRI support Implement triangular linear system solve (AX=B) using aclnnTriangularSolve for the lower-triangular, non-unit case.	2026-03-28 06:47:56 +00:00
hipudding	871ffea262	CANN: add GGML_OP_DIAG support Create diagonal matrix from vector by filling dst with zeros then copying src onto the diagonal via a strided view with InplaceCopy.	2026-03-28 06:47:56 +00:00
hipudding	4a7bb25226	CANN: add GGML_OP_FILL support Implement FILL using aclnnInplaceFillScalar to fill a tensor with a constant scalar value from op_params.	2026-03-28 06:47:56 +00:00
hipudding	93e0c17661	CANN: add CUMSUM and TRI op support, fix graph cache op_params matching - Implement GGML_OP_CUMSUM using aclnnCumsum - Implement GGML_OP_TRI with all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG) using Tril/MaskedFillScalar approach to work around CANN sparse-zero bugs - Fix graph cache to always compare op_params for all ops, not just a whitelist	2026-03-28 06:47:56 +00:00
hipudding	11e78d8499	CANN: simplify GATED_DELTA_NET implementation - Remove dead code: _math and _naive variants are no longer needed - Rename _batched to the public entry point ggml_cann_gated_delta_net - In supports_op, return false for non-contiguous / GQA / non-F32 cases so the framework falls back to CPU instead of running the slow naive path - The single remaining implementation uses aclnnBatchMatMul over all H heads per timestep, reducing kernel launches to O(n_seqs * n_tokens)	2026-03-28 06:47:56 +00:00
hipudding	3707b58628	CANN: add GATED_DELTA_NET op support Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a batched approach that groups all attention heads into a single 3-D BatchMatMul per recurrence step, reducing kernel launches from O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens). Key design decisions: - Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch all H heads together for M×k, outer-product, and M×q steps - Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused across all time steps to avoid per-step allocations - Support both scalar gate (g shape [1,H]) and KDA per-dim gate (g shape [S_v,H]) via appropriate broadcast shapes - Fall back to naive per-head scalar loop for permuted/GQA/non-F32 inputs that don't meet batched path requirements - Relax CANN precision tolerance to 1e-6 in tests to account for different FP32 accumulation order in BatchMatMul vs scalar loops	2026-03-28 06:47:56 +00:00
hipudding	140c5a3d1b	CANN: add GATED_DELTA_NET op support	2026-03-28 06:47:56 +00:00
hipudding	c0e78773e9	CANN: implement GGML_OP_SET for CANN backend Add SET operator support using aclnnInplaceCopy, modeled after the existing ACC implementation. This enables the scheduler to assign SET ops to CANN when the output tensor resides on device memory, avoiding cross-device write issues with delta-net hybrid models. All 12 test-backend-ops SET tests pass (f32/i32, inplace/non-inplace, dim 1/2/3).	2026-03-28 06:47:56 +00:00
hipudding	be1492d21f	CANN: implement backend memset_tensor interface Add ggml_backend_cann_buffer_memset_tensor and wire it into `ggml_backend_cann_buffer_interface`. This ensures backend tensor memset operations are supported and avoids incorrect behavior when tensors need explicit zero-initialization (e.g. cache buffers).	2026-03-28 06:47:56 +00:00
Piotr Wilkin (ilintar)	1f5d15e665	common/parser: fix reasoning whitespace bugs + extra parser tests (#21085 ) * fix whitespace reasoning issues + add reconstruction tests * Proper fix * fix Nemotron autoparser test expectations to include newline in marker	2026-03-28 07:29:26 +01:00
Sigbjørn Skjæret	c46758d28f	cli : add /glob command (#21084 ) * add /glob command * output error when max files reached * support globbing outside curdir	2026-03-28 02:33:04 +01:00
Ts-sound	bf934f28db	docker : fix and enable ARM64 image build (#20929 ) * CI: fix ARM64 image build error & enable compilation * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: revert ggml/src/ggml-cpu/CMakeLists.txt * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> * CI: update runs-on to ubuntu24.04, and update ARM64 build image ( ubuntu_version: "24.04") * CI: change cpu.Dockerfile gcc to 14; * CI : cpu.Dockerfile , update pip install . * Update .github/workflows/docker.yml Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-03-28 01:45:09 +01:00
Adrien Gallouët	5c1a7b8355	server : add custom socket options to disable SO_REUSEPORT (#21056 ) * server : add custom socket options to disable SO_REUSEPORT Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Add --reuse-port $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 --reuse-port setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEPORT, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 $ strace -e trace=setsockopt,bind build/bin/llama-server -lv 2 setsockopt(3, SOL_TCP, TCP_NODELAY, [1], 4) = 0 setsockopt(3, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0 bind(3, {sa_family=AF_INET, sin_port=htons(8080), sin_addr=inet_addr("127.0.0.1")}, 16) = 0 Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Update tools/server/README.md (llama-gen-docs) Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Fix windows Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-28 01:12:43 +01:00
Aldehir Rojas	59d840209a	common : inhibit lazy grammar sampler while reasoning is active (#20970 ) * common : inhibit grammar while reasoning budget is active * cont : update force_pos in accept * cont : fix tests * cont : tweak should apply logic * cont : return early not using grammar sampler * Add tests * cont : prevent backend sampling when reasoning budget enabled * cont : fix typo --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>	2026-03-27 18:30:40 +01:00
Kusha Gharahi	ff934e29bc	server: Introduce LLAMA_BUILD_WEBUI build flag to allow disabling the embedded web ui (#20158 ) * introduce LLAMA_SERVER_NO_WEBUI * LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI * LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE * MIssed this * Add useWebUi to package.nix	2026-03-27 17:25:55 +01:00

1 2 3 4 5 ...

8657 Commits All Branches Search

8657 Commits

All Branches