* CANN: fix multi-thread set_tensor race conditions
When ollama calls ggml_backend_tensor_set from multiple threads (each
writing a different chunk of the same tensor), the CANN backend had
three concurrency issues:
1. Quantized tensors (Q4_0/Q8_0) require a full-tensor format transform
before uploading to device. Per-chunk transforms produced corrupt data.
2. ND-to-NZ weight conversion requires complete tensor data on device.
Per-chunk conversion operated on incomplete data.
3. The global g_nz_workspaces array had unprotected concurrent access.
Fix by introducing a TensorSetTracker that accumulates write progress
per tensor. For quantized tensors, raw data is staged in a host buffer
and the transform + upload is deferred until all chunks arrive. For NZ
weights, chunks are uploaded directly but conversion is deferred. The
tracker and its staging buffer are released immediately after
post-processing completes.
Add per-device mutex to g_nz_workspaces to prevent data races.
* CANN: fix L2_NORM ignoring eps parameter
The L2_NORM implementation was not using the eps parameter from
op_params, causing incorrect results when eps is large (e.g. 10.0).
The CPU reference computes scale = 1/fmaxf(norm, eps), so add a
Clamp step to clamp the norm to at least eps before dividing.
* ggml/cann: compare op_params for POOL_2D in ACL graph cache matching
When ACL graph mode is enabled, the graph LRU cache checks whether a
cached graph matches the current computation graph. Previously,
GGML_OP_POOL_2D was not included in the op_params comparison, so two
POOL_2D nodes with different pooling parameters (kernel size, stride,
padding) but identical tensor shapes and addresses could incorrectly
reuse a cached graph, leading to wrong results or aclnn errors.
Add GGML_OP_POOL_2D to the list of ops that require op_params matching
in ggml_graph_node_properties::has_matching_properties().
* cann: fix ACL graph cache matching by adding tensor type and unconditional op_params comparison
The ACL graph LRU cache was incorrectly reusing cached graphs for
operations with different tensor types or op_params, causing test
failures for CPY (f16 vs bf16), POOL_2D, L2_NORM, NORM_MUL_ADD,
RMS_NORM_MUL_ADD, and ADD_RMS_NORM.
Changes:
- Add node_type and src_type[] fields to ggml_graph_node_properties
so the cache can distinguish tensors with different types but
identical ne/nb (e.g. f16 and bf16 both have 2-byte elements)
- Compare op_params unconditionally for all ops instead of only for
SCALE/UNARY/GLU/ROPE/POOL_2D
The build info is now only for debug, so we avoid the duplicate
with `--version`.
The UTF-8 setup at the beginning is needed to avoid logging
garbage on Windows.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common: add bounds check in common_init_result::sampler to prevent segfault on failed model load
* Revert a308e584ca
* Add regression test
* Remove regression test for init-fail sampler check
* fix: include API key in CORS proxy requests for MCP connections
When llama-server is started with --api-key-file and --webui-mcp-proxy,
the /cors-proxy endpoint requires authentication. The WebUI was not
including the Authorization header in proxy requests, causing MCP
connections to fail with 401.
Inject getAuthHeaders() into requestInit when useProxy is true so the
proxy request carries the Bearer token alongside the forwarded target
headers.
Fixes#21167
* fix: simplify headers assignment based on reviewer suggestion
Apply buildProxiedHeaders only when useProxy is true, pass headers
directly to the transport otherwise.
* CI: Enable CUDA and Vulkan ARM64 runners and fix CI/CD
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
* Obtain source tag name from git tag
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Ts-sound <44093942+Ts-sound@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().
* Treat empty computed member expressions with Jinja2 undefined semantics
Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.
- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`
* Handle undefined computed member properties
Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.
* Use default undefined value in member access
Initialize val and then return it when property is undefined.
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* empty statement parses to blank_expression instead of noop_statement
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: Fix CUB's argsort when nrows % block_size == 0 CCCL < 3.1
We wrongly calculated offset_grid as `ceildiv(nrows, block_size)`,
while it must be `ceildiv(nrows + 1, block_size)`. As a consequence, we
had uninitialized values in `offset_iterator[nrows]` for the case when
`nrows % block_size == 0`.
Fixes#21162
* Reduce nrows in test case to 256, don't need 768
When RPC is running with a remote backend which doesn't have init_tensor
function (like CPU and Metal), the server log gets full with error
messages saying that init_tensor is being called with null buffer which
is incorrect. This patch fixes this.
* Optimize MOE GEMV kernel for BS > 1.
The previous MOE kernel for BS > 1 had too many thread blocks (nrows_x, nchannels_dst, ncols_dst), with very little work per block. block of (32, 4) was doing inner dot product for a single row.
New mul_mat_vec_q_moe kernel is dedicated for MoE multi-token kernel with grid (ceil(nrows_x/rpb), nchannels_dst), block (warp_size, ncols_dst). Each warp handles two rows independently with warp-level reduction only (no shared memory sync).
This change doesn't increase any compilation time as a single template instance is needed per type. This also simplifies the original GEMV kernel and gets rid of `is_multi_token_id` specialization.
* Remove em-dashes
* Cherry-pick changes from @am17an PR https://github.com/ggml-org/llama.cpp/pull/20885 to enable small_k optimization only for cases where it benefits
Increase max batch size for MMVQ kernels for MUL_MAT_ID to 8
* Make the max batch size for MOE GEMV kernel configurable based on GPU arch and datatype
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* hex-fa: add simple dma cache for Mask
I noticed that we were refetch the mask rows over and over.
This simple cache avoids that.
* hex-dma: unset in-order desc bit which caused signficant perf regression
We don't rely on true in order processing of the DMA descriptors anywhere.
Turns out this mode caused significant regression of around 3-4 TPS during token gen.
* hex-rope: update comment to clarify that we don't need in-order DMA completions
The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).
Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.
This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).
The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
implementations do not yet support this syntax)
* introduce LLAMA_SERVER_NO_WEBUI
* LLAMA_SERVER_NO_WEBUI → LLAMA_BUILD_WEBUI
* LLAMA_BUILD_WEBUI ON by default not based on LLAMA_STANDALONE
* MIssed this
* Add useWebUi to package.nix
* server: respect the verbose_prompt parameter
* Revert "server: respect the verbose_prompt parameter"
This reverts commit 8ed885cf37.
* Remove --verbose-prompt parameter from llama-server
* Using set_examples instead of set_excludes