The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load
* Revert a308e584ca
* Add regression test
* Remove regression test for init-fail sampler check
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes#21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
* Obtain source tag name from git tag
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().
* Treat empty computed member expressions with Jinja2 undefined semantics
Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.
- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`
* Handle undefined computed member properties
Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.
* Use default undefined value in member access
Initialize val and then return it when property is undefined.
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* empty statement parses to blank_expression instead of noop_statement
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1
We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.
Fixes#21162
* Reduce nrows in test case to 256, don't need 768
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* hex-fa: add simple dma cache for Mask
I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
* hex-dma: unset in-order desc bit which caused signficant perf regression
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
* hex-rope: update comment to clarify that we don't need in-order DMA completions
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).
Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.
This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).
The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
implementations do not yet support this syntax)
Implement GGML_UNARY_OP_SOFTPLUS using aclnnSoftplus with beta=1.0
and threshold=20.0. This enables hybrid models like Qwen3.5 to run
entirely on the CANN backend without graph splitting, which fixes
graph cache instability caused by the backend scheduler fragmenting
the computation graph when SOFTPLUS falls back to CPU.
- Implement GGML_OP_CUMSUM using aclnnCumsum
- Implement GGML_OP_TRI with all 4 tri types (LOWER, LOWER_DIAG, UPPER, UPPER_DIAG)
using Tril/MaskedFillScalar approach to work around CANN sparse-zero bugs
- Fix graph cache to always compare op_params for all ops, not just a whitelist
- Remove dead code: _math and _naive variants are no longer needed
- Rename _batched to the public entry point ggml_cann_gated_delta_net
- In supports_op, return false for non-contiguous / GQA / non-F32 cases
so the framework falls back to CPU instead of running the slow naive path
- The single remaining implementation uses aclnnBatchMatMul over all H
heads per timestep, reducing kernel launches to O(n_seqs * n_tokens)
Implement GATED_DELTA_NET for the CANN (Ascend NPU) backend using a
batched approach that groups all attention heads into a single 3-D
BatchMatMul per recurrence step, reducing kernel launches from
O(n_seqs × H × n_tokens) to O(n_seqs × n_tokens).
Key design decisions:
- Use aclnnBatchMatMul (rank-3 only) with shape [H, S_v, S_v] to batch
all H heads together for M×k, outer-product, and M×q steps
- Pre-allocate temporary buffers (g_exp, mk, delta, outer) reused
across all time steps to avoid per-step allocations
- Support both scalar gate (g shape [1,H]) and KDA per-dim gate
(g shape [S_v,H]) via appropriate broadcast shapes
- Fall back to naive per-head scalar loop for permuted/GQA/non-F32
inputs that don't meet batched path requirements
- Relax CANN precision tolerance to 1e-6 in tests to account for
different FP32 accumulation order in BatchMatMul vs scalar loops
Add SET operator support using aclnnInplaceCopy, modeled after the
existing ACC implementation. This enables the scheduler to assign
SET ops to CANN when the output tensor resides on device memory,
avoiding cross-device write issues with delta-net hybrid models.
All 12 test-backend-ops SET tests pass (f32/i32, inplace/non-inplace, dim 1/2/3).
Add ggml_backend_cann_buffer_memset_tensor and wire it into
`ggml_backend_cann_buffer_interface`.
This ensures backend tensor memset operations are supported
and avoids incorrect behavior when tensors need explicit
zero-initialization (e.g. cache buffers).
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix