* vulkan: Programmatically add RoundingModeRTE to all shaders when the device supports it
* use FetchContent to get SPIRV-Headers
* Fetch spirv-headers unconditionally
* remove fetchcontent, rely on installed headers
* fix ubuntu job
* Update docs/build.md
* cmake: fix CMP0194 warning on Windows with MSVC
Set CMP0194 policy to NEW before project() call in ggml/CMakeLists.txt to suppress the "MSVC is not an assembler for language ASM" warning introduced in CMake 4.1.
The ggml project enables ASM globally for Metal (macOS) and KleidiAI (ARM) backends. On Windows/MSVC, no assembler sources are used, but CMake 4.1+ warns because cl.exe is not a valid ASM compiler.
This follows the same pattern used in ggml-vulkan (CMP0114, CMP0147).
Closesggml-org/llama.cpp#20311
* cmake: apply cisc's formatting suggestion
---------
Co-authored-by: texasich <texasich@users.noreply.github.com>
* Update register tiling matmul to use f32 accumulation
* fix profiling code
* Fix register tiling matmul for chrome, i'm blaming dawn
* Update batch tuning value for iOS
* compile fix
* Fix use of new load function
* common: skip reasoning budget sampler when no budget is requested
After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).
More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).
So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.
* common: preserve rbudget when grammar is lazy
Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.
* webui: add setting for first-line chat titles
Add an opt-in setting (`titleGenerationUseFirstLine`) to use the first
non-empty line of a prompt as the generated conversation title.
Previously, the complete multi-line prompt was being used, which created
long titles for complex queries. Coupled with
"Ask for confirmation before changing conversation title", the dialog
would overflow.
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/utils/text.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: Run build to update the bundle
As requested in:
https://github.com/ggml-org/llama.cpp/pull/21797#pullrequestreview-4094935065
* webui: Fix missing import for NEWLINE_SEPARATOR
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Add MCP Connection diagnostics and CORS hint to web-ui
* tidy up test
* webui: Refactor and improve MCP diagnostic logging
---------
Co-authored-by: evalstate <1936278+evalstate@users.noreply.github.com>
* add qwen3a
* wip
* vision ok
* no more deepstack for audio
* convert ASR model ok
* qwen3 asr working
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* nits
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* fix bad merge
* fix multi inheritance
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* mtmd : add MERaLiON-2 multimodal audio support
Adds support for A*STAR's MERaLiON-2 audio-language model (3B and 10B)
to the multimodal framework.
Architecture:
- Whisper large-v2 encoder for audio feature extraction
- Gated MLP adaptor: ln_speech -> frame stack (x15) -> Linear+SiLU -> GLU -> out_proj
- Gemma2 3B / 27B decoder
The mmproj GGUF is generated via convert_hf_to_gguf.py --mmproj on the full
MERaLiON-2 model directory (architecture: MERaLiON2ForConditionalGeneration).
The decoder is converted separately as a standard Gemma2 model after stripping
the text_decoder. weight prefix.
New projector type: PROJECTOR_TYPE_MERALION
Supports tasks: speech transcription (EN/ZH/MS/TA), translation, spoken QA.
Model: https://huggingface.co/MERaLiON/MERaLiON-2-3Bhttps://huggingface.co/MERaLiON/MERaLiON-2-10B
* simplify comments in meralion adaptor
* meralion: use format_tensor_name, ascii arrows in comments
* hexagon: introduce op request batching and rewrite buffer managment
The host now prepares batches of requests and dispatches them via a single dspqueue message.
Buffers are mapped explicitly by NPU while processing batches.
* hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops
* hex-utils: add explicit l2flush and l2clear helpers
* hex-opreq: use fine-grain per tensor l2 management
* hex-opreq: avoid redundant invalidates for tensors we already flushed
* hex-opreq: update debug messages
* htp-opreq: reuse ops_context
* hex-opreq: do not flush or invalidate cache lines beyond buffer boundry
* hex-opreq: fix errors in log message
* Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry"
This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d.
* hexagon: limit l2 flushes to 1MB which covers l2 cache
* hex-opreq: limit cache flush to 4MB
Looks like 4MB cont. vitual space should cover the 1MB cache.
* hexagon: drop cache flush size to 2MB
* hex-opreq: start reworking opreq packing
* hex-opreq: introduce new way of packing opbatch where tensors are stored separately
* hex-opreq: add a simple fastrpc call to force unmap all buffers
* hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size
* hex-opreq: bump opreq batch size to 256
* hex-mm: place src1 spad at the top of vtcm for easy reuse
* hex-ops: introduce internal types and disable src1 reuse for now
Nothing new just formalizing the repack / qyn.quant types we've been using.
* htp-opreq: use tensor pointers instead of copies
* hex-opreq: introduce more robust way for tracking vtcm/spad reuse
This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops.
* hex-cumsum: fix error post opreq merge
* hex-opreq: move request batch handling into the session
Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner.
* hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx
* hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers
* hex-buf: add support for allocating shared/pinned buffer for opreqs
* hex-opbatch: make opbatches configurable
* hex-naming: better name for ggml_hexagon_shared_buffer
* hex-naming: add session->c_name() helper
* hex-opbatch: start using shm but still copy for now
* hex-opbatch: use shared buffer for packing opbatch
* hex-opbatch: beter naming for opbatch related classes and code
* hex-opbatch: reuse batched tensors with same data/dims/strides
* hex-opbatch: update logging
* hex-opbatch: add support for vmem limit for op batching
* hex-opbatch: update htp side to properly support dynamic mmap/unmap
* hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing
* hex-opbatch: fixed src1 handling in act ops
* hex-act: fix empty src1 handling in swiglu and friends
Simplify preamble macro while at it
* hex-mm: minor fix vtcm and dma handling in matmul
cleaning up some left-overs from merges
* hex-opbatch: allocate extra 1KB for dspqueue overhead
* hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc
* hex-mm: properly handle hmx_disabled flag
* hex-ops: update comments
* hex-ops: add debug output for get/set-rows
* hex-mmap: optimize un/mapping of buffers
* hex-opreq: global cache flush and invalidate beyond 128KB threshold
* hex-ops: add super simple opfilter regex for debugging
If an Op matches the regex hex backend will reject it.
* hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future
* hexagon: improved vtcm acquision to remove inter-op overhead
Fully compatible with QNN-HTP coex
* hex-mm: fixed hvx fallback path
* hex-mm: lower the vmem threshold a bit further to ~3GB
* hexagon: update debug & error logs
This also fixes an issue with newer llvm merging repack and non-repack
functions. We use those pointer to distinguish between buffer types.
* hexagon: move ops context into main context
Just a cleanup. We don't need separate contexts at this point.
* hex-opbatch: cleanup naming and headers for opbatch and related descriptors
* hex-fa: it's now better to enable FA during TG to reduce graph splits
* hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var
It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops
if needed for debugging or validation.
* hexagon: fixed editorconfig check
* Update ggml/src/ggml-hexagon/ggml-hexagon.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log
* Merge with upstream
* Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants
* Update Unary wgsl EXP and EXPM1 for f16 stability
* Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization
* Fix numerical percision for unary sqrt when working with f16
* Fix NaN canonicalization for packed integers using f16
* Update err threshold for binary div ops when using f16
* backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend
* clean: uncomment existing code logs
* clean: clean the unncessary debug info
* Refactor and generalize dequant helpers
* Remove deprecated quant structs
* Refactor shader defines to reduce repetition
* Remove error override for F16 type
* fix: fix the accidential removal of the proper initialization of ctx
* clean: clean legacy and format code
* fix: did not modify tests ops
---------
Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>
I'm not sure what the purpose of keeping `--alias` was when using
`--models-preset`, but the result is really weird, as shown in the
following logs:
$ build/bin/llama-server --models-preset preset.ini --alias "Gemma 4 E4B UD Q8_K_XL"
...
init: using 31 threads for HTTP server
srv load_models: Loaded 2 cached model presets
srv load_models: Loaded 1 custom model presets from preset.ini
main: failed to initialize router models: alias 'Gemma 4 E4B UD Q8_K_XL' for model 'angt/test-split-model-stories260K:F32' conflicts with existing model name
So I propose to simply ignore `--alias` too in this case. With this
commit, the server starts in routing mode correctly.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* fix: enable reasoning budget sampler for gemma4
Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.
Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).
Add test case for empty thinking block.
Fixes#21487
* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser