Add element-wise unary ops needed by Qwen 3.5's DeltaNet linear
attention layers. These ops follow the existing unary-ops pattern
with VTCM DMA double-buffering.
- neg: negate via scale by -1.0
- exp: uses existing hvx_exp_f32 HVX intrinsics
- sigmoid: uses existing hvx_sigmoid_f32_aa HVX intrinsics
- softplus: log(1 + exp(x)) scalar fallback
- CONT reuses the existing CPY infrastructure since making a tensor
contiguous is equivalent to a same-type copy.
- REPEAT implements tiled memory copy with multi-threaded execution via
the worker pool, supporting f32 and f16 types. The kernel parallelizes
across output rows and uses memcpy for each tile.
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Add native MTP support for the dense Qwen 3.5 architecture (0.8B, 2B, 4B, 9B, 27B).
What works:
- MTP graph builder for dense qwen35 (build_mtp_head in qwen35.cpp)
- MTP tensor loading and registration for QWEN35 arch
- GGUF converter handles MTP tensors (mtp.fc, mtp.layers, mtp.norm, etc.)
- Public API: llama_get_mtp_logits(), llama_model_n_mtp_layers()
- Server auto-detects MTP from GGUF metadata
- Speculative state machine for MTP draft token generation
- PR #20075 applied: recurrent state checkpoint/restore for hybrid models
- M-RoPE position check relaxed for speculative re-evaluation
- Windows os.kill fix for gateway process detection
What needs work:
- Speculative verify loop conflicts with tool-calling requests (400 error)
- The recommended fix: bypass the speculative framework entirely and
implement MTP acceptance directly in the server generation loop
(no seq_rm/rollback needed since MTP drafts are produced in-graph)
- MTP attention skipped (projection + FFN path only) due to
inp_out_ids token count mismatch
Tested on: RTX 5060 8GB, Windows 11, CUDA 13.2
Model: Qwen3.5-9B with MTP tensors (Q4_K_M quantization)
Base: llama.cpp b8388
* kleidiai : fix MUL_MAT support for batched (3D) inputs
The supports_op() check incorrectly rejected MUL_MAT operations with 3D
inputs (ne[2] > 1), but the actual compute_forward_qx() implementation
handles batched inputs correctly via a loop over ne12.
This caused models with Q4_0/Q8_0 weights to crash during graph scheduling
when n_seq_max > 1, because weights were placed in KLEIDIAI buffers during
loading (tested with 2D inputs) but the runtime used 3D inputs.
Also relax the buffer check to allow supports_op() to be called during
weight loading when src[0]->buffer is NULL.
Fixes#20608
* Kleidiai support_ops should only return true for 3D inputs, not also 4D
* vulkan: avoid graphics queue on non-RADV AMD drivers
* avoid graphics queues on small GPUs
* change to only use graphics queue if overridden with env var GGML_VK_ALLOW_GRAPHICS_QUEUE
* reenable transfer queue if graphics queue is not used
* kleidiai: add data type check to get_tensor_traits
* Added check for F16 data type into get_tensor_traits path with input data
not in ggml_backend_cpu_kleidiai_buffer_type format (unsupported for Q4/8)
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Change-Id: I9aca4b9b8d669d35db6f1dbcc4e080b1919b1de7
* updated ggml/src/ggml-cpu/kleidiai/kleidiai.cpp
updated kleidiai.cpp file as per suggestion
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Martin Klacer <martin.klacer@arm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* webui: fix model selector being locked to first loaded model
When multiple models are loaded, the auto-select effect would re-fire
on every loadedModelIds change, overriding the user's manual model
selection. Guard with selectedModelId so auto-select only kicks in
when no model is chosen yet.
* chore: update webui build output
* webui: use date in exported filename
Move conversation naming and export to utils
update index.html.gz
* webui: move literals to message export constants file
* webui: move export naming and download back to the conversation store
* chore: update webui build output
* webui: add comments to some constants
* chore: update webui build output
* ggml/hip: fix APU compatibility - soft error handling for hipMemAdviseSetCoarseGrain
On AMD APU/iGPU devices (unified memory architecture), hipMemAdviseSetCoarseGrain
returns hipErrorInvalidValue because the hint is not applicable to UMA systems.
The previous CUDA_CHECK() call treated this as a fatal error, causing crashes on
APU systems such as AMD Strix Halo (gfx1151).
Fix: treat hipMemAdviseSetCoarseGrain as an optional performance hint - call it
without error checking and clear any resulting error with hipGetLastError().
Also add pre-allocation debug logging (GGML_LOG_DEBUG) to help diagnose memory
issues on APU systems, and store totalGlobalMem in device info.
Context: AMD APUs on Windows are affected by a ROCm runtime bug that limits
hipMallocManaged to ~64GB regardless of available system RAM. A fix has been
submitted upstream: https://github.com/ROCm/rocm-systems/pull/4077
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
* ggml/hip: remove unrelated changes, keep only hipMemAdviseSetCoarseGrain fix
---------
Co-authored-by: moonshadow-25 <moonshadow-25@users.noreply.github.com>
Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>