Commit Graph

8126 Commits

Author SHA1 Message Date
Pascal 965655fafb chore: update webui build output 2026-02-01 20:35:35 +01:00
Pascal 7953c18967 webui: fix UI freeze at high token rates with RAF yield
The markdown coalescing loop was processing chunks back-to-back without
yielding to the browser's paint cycle. At high token rates (250+ tok/s),
this caused complete UI freeze as the main thread was perpetually busy.

Add a requestAnimationFrame yield between processing batches. This allows
the browser to paint at screen FPS regardless of token throughput. Chunks
arriving during the yield are coalesced and processed together, so we
skip intermediate states and jump straight to the latest content.

Before: Chunk->process->Chunk->process->... (browser never paints = freeze)
After:  Chunk->process->[RAF]->coalesced chunks->process->[RAF]->... (screen FPS)

Tested with 250 tok/s streams on 50K+ token contexts: smooth scrolling
and responsive UI throughout.
2026-02-01 20:34:08 +01:00
Pascal 2884ef46b3 chore: update webui build output 2026-02-01 19:45:54 +01:00
Pascal 0dbaeaf6c7 webui: incremental MDAST transform caching for streaming performance
Replace full AST re-transformation with per-block caching strategy.
Previously, each streaming chunk triggered processor.run() on the entire
document (12 rehype/remark plugins including KaTeX and highlight.js).

Now transforms individual MDAST nodes and caches results by position hash.
In append-only streaming mode, stable blocks are reused directly from cache,
only the unstable trailing block is re-transformed.

- Add SvelteMap FIFO cache (5000 blocks, evicts oldest 1000 on overflow)
- Add getMdastNodeHash() for MDAST node fingerprinting by position
- Add isAppendMode() to detect streaming append patterns
- Add transformMdastNode() for single-node transformation with cache lookup
- Remove stringifyProcessedNode() (dead code after refactor)

Reduces streaming complexity from O(N × transforms) to O(1) for stable blocks.
Targets 200K token contexts without UI degradation on mobile devices.
2026-02-01 19:44:16 +01:00
Pascal 1ab2e45684 chore: update webui build output 2026-02-01 12:10:06 +01:00
Pascal 82f6094aa2 feat: render images inline below attachment markers in tool results
Parse tool results line-by-line to display images immediately after their
[Attachment saved: xxx.png] markers. Fixes previous commit where all images
from all tool calls were shown in every section. Each tool call now displays
only its own images.

Uses Svelte derived for memoization to avoid re-parsing on every streaming
chunk. Parsing only occurs when section.toolResult or message.extra changes
2026-02-01 12:06:25 +01:00
Pascal be96423ae9 feat: render images below attachment markers in tool results 2026-02-01 04:56:21 +01:00
Pascal 5a4e4f4189 chore: update webui build output 2026-02-01 04:13:48 +01:00
Pascal 42244c0162 fix: also skip image attachments in message history for non-vision backends 2026-02-01 04:13:37 +01:00
Pascal 6b7e6f18a6 chore: update webui build output 2026-02-01 03:22:09 +01:00
Pascal 893dbb058a fix: skip sending image attachments to non-vision backends 2026-02-01 03:20:36 +01:00
Pascal 556029eee6 chore: update webui build output 2026-01-31 08:27:11 +01:00
Pascal 1384352484 fix: responsive MCP server cards, prioritize server name over version 2026-01-31 08:22:41 +01:00
Pascal 1615b1c58c fix: responsive MCP server cards for mobile viewports 2026-01-31 07:58:47 +01:00
Pascal cd8e5741f2 chore: update webui build output 2026-01-30 20:23:45 +01:00
Pascal b872838329 webui: adaptive model selector dropdown width
Make model selector dropdown responsive:
- Mobile: full width (w-full max-w-[100vw])
- Desktop: adapts to longest model name (sm:w-max)
- Replace TruncatedText with responsive span (truncate on mobile, full text on desktop via sm:overflow-visible sm:whitespace-nowrap)
- Center status icons in fixed 24px wrapper to prevent layout shifts
- Add sm:pr-2 padding between text and icon zone on desktop

Fixes dropdown cutting off long model names on desktop while maintaining full-width display on mobile with proper text truncation
2026-01-30 20:21:05 +01:00
Aleksander Grygier 120ada3616 chore: update webui build output 2026-01-29 16:31:07 +01:00
Aleksander Grygier e41f70bb47 refactor: Use CORS Proxy for favicons calls 2026-01-29 16:30:10 +01:00
Aleksander Grygier 46c5bca942 refactor: Proxy utility 2026-01-29 16:29:04 +01:00
Aleksander Grygier 944765138e chore: update webui build output 2026-01-29 15:03:00 +01:00
Aleksander Grygier 536c6866e3 feat: Integrate with `llama-server` proxy + improve MCP Server Edit Form 2026-01-29 14:59:28 +01:00
Aleksander Grygier 406cb1dd99 Merge remote-tracking branch 'ngxson/xsn/cors_proxy_demo' into allozaur/mcp-mvp 2026-01-29 13:34:20 +01:00
Aleksander Grygier 9d6e210a5e Merge remote-tracking branch 'ggml-org/master' into allozaur/mcp-mvp 2026-01-29 13:21:44 +01:00
Aleksander Grygier 7b00b46a6a chore: update webui build output 2026-01-29 12:55:45 +01:00
Aleksander Grygier 6793c7daac fix: Checking for capabilities from store 2026-01-29 12:45:10 +01:00
Aleksander Grygier 2aa704b821 refactor: Cleanup 2026-01-29 11:44:08 +01:00
yulo f3dd7b8e68
HIP: add mmf for CDNA (#18896)
* refactor mmf rows_per_block

* speed up compile

* pass cdna compile

* fix cuda error

* clean up mmf

* f32 mmf

* clean float mma

* fix mmf error

* faster mmf

* extend tile k

* fix compile error

* Revert "extend tile k"

This reverts commit 4d2ef3d483.

* fix smem overflow

* speed up compiling mmf

* speed up compile for hip

* 512 block for cdna

* config pad size

* fix as comment

* update select logic

* move some code to cuh

* fix as comment

* correct cdna3 config

---------

Co-authored-by: zhang hui <you@example.com>
2026-01-29 11:10:53 +01:00
Georgi Gerganov eed25bc6b0
arg : add -kvu to llama-batched-bench (#19172) 2026-01-29 08:50:47 +02:00
Vishal Singh b33df266d0
ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159) 2026-01-29 12:28:57 +08:00
Aman Gupta 3bcc990997
CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126) 2026-01-29 10:31:28 +08:00
Neo Zhang d4964a7c66
sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2026-01-29 09:20:22 +08:00
Sigbjørn Skjæret 50e8962f79
ci : find latest release with asset for winget (#19161) 2026-01-28 22:05:39 +01:00
Aleksander Grygier c7b7fc6c15 chore: update webui build output 2026-01-28 19:57:18 +01:00
Aleksander Grygier d9e82b7c29 fix: Linter errors 2026-01-28 19:55:44 +01:00
Ruben Ortlam f6b533d898
Vulkan Flash Attention Coopmat1 Refactor (#19075)
* vulkan: use coopmat for flash attention p*v matrix multiplication

* fix P loading issue

* fix barrier position

* remove reduction that is no longer needed

* move max thread reduction into loop

* remove osh padding

* add bounds checks and padding

* remove unused code

* fix shmem sizes, loop duration and accesses

* don't overwrite Qf, add new shared psh buffer instead

* add missing bounds checks

* use subgroup reductions

* optimize

* move bounds check, reduce barriers

* support other Bc values and other subgroup sizes

* remove D_split

* replace Of register array with shared memory Ofsh array

* parallelize HSV across the rowgroups

* go back to Of in registers, not shmem

* vectorize sfsh

* don't store entire K tile in shmem

* fixes

* load large k tiles to shmem on Nvidia

* adapt shared memory host check function to shader changes

* remove Bc 32 case

* remove unused variable

* fix missing mask reduction tmspsh barrier

* fix mask bounds check

* fix rowmax f16 under/overflow to inf

* fix flash_attn_cm2 BLOCK_SIZE preprocessor directives
2026-01-28 18:52:45 +01:00
Sascha Rogmann 72d3b1898a
spec : add self‑speculative decoding (no draft model required) + refactor (#18471)
* server: introduce self-speculative decoding

* server: moved self-call into speculative.cpp

* can_speculate() includes self-speculation

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: can_speculate() tests self-spec

* server: replace can_speculate() with slot.can_speculate()

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* common: use %zu format specifier for size_t in logging

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* server: can_speculate() requires a task instance

* common: ngram map, config self-speculative decoding

* common: add enum common_speculative_type

* common: add vector of speculative states

* common: add option --spec-draftless

* server: cleanup (remove slot.batch_spec, rename)

* common: moved self-spec impl to ngram-map

* common: cleanup (use common_speculative_state_draft)

* spec : refactor

* cont : naming

* spec: remove --spec-config

* doc: (draftless) speculative decoding

* common: print performance in spec decoding

* minor : cleanup

* common : better names

* minor : cleanup + fix build

* minor: comments

* CODEOWNERS: add common/ngram-map.* (#18471)

* common : rename speculative.draftless_type -> speculative.type

* ngram-map : fix uninitialized values

* ngram-map : take into account the input can become shorter

* ngram-map : revert len check for now

* arg : change `--spec-draftless` -> `--spec-type`

* spec : add common_speculative_state::accept()

* spec : refactor + add common_speculative_begin()

* spec : fix begin() call with mtmd

* spec : additional refactor + remove common_speculative_params

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-01-28 19:42:42 +02:00
Aleksander Grygier 7c9be63a74 refactor: Refine Chat Message Processing State Display 2026-01-28 18:31:37 +01:00
Aleksander Grygier 5a176d1893 feat: Chat logic improvements 2026-01-28 18:31:37 +01:00
Aleksander Grygier aa7089d598 feat: Integrate Resource Attachments into Chat Form UI 2026-01-28 18:31:37 +01:00
Aleksander Grygier 23e4ef7495 feat: MCP Resources UI
feat: Implement MCP Resource Selection Dialog
2026-01-28 18:31:37 +01:00
Aleksander Grygier 1623547e2b feat: Integrate Resource Store into Main MCP Store 2026-01-28 18:31:36 +01:00
Aleksander Grygier dc2076a77c feat: MCP Resources Svelte Store 2026-01-28 18:31:36 +01:00
Aleksander Grygier 192c920d73 refactor: Use constants 2026-01-28 18:31:35 +01:00
Aleksander Grygier 89166a79d4 feat: Introduce MCP Resource Types and Service Methods 2026-01-28 18:31:35 +01:00
Aleksander Grygier 85a61a7c96 refactor: Componentize HorizontalScrollCarousel 2026-01-28 17:32:59 +01:00
Aleksander Grygier bfbcdc7420 fix: Code Preview sandbox 2026-01-28 17:31:04 +01:00
Daniel Bevenius ebf5725870
convert : yield Mamba2Model/GraniteMoeModel modify_tensors (#19157)
* convert : yield Mamba2Model/GraniteMoeModel modify_tensors

This commit updates the `GraniteHybridModel` class' modify_tensors
function to properly delegate to `Mamba2Model.modify_tensors` and
`GraniteMoeModel.modify_tensors` using 'yield from' instead of 'return'.

The motivation for this is that modify_tensors is a generator function
(it uses 'yield from'), but the two calls above use return statements
but don't yield anything which means that the the caller of this
function will not receive any yielded values from it. And this causes
layer tensors to be silently dropped during conversion.
2026-01-28 16:49:36 +01:00
Patryk Kaminski 0cd7032ca4
ggml-sycl: remove unused syclcompat header (#19140)
The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826).
This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.
2026-01-28 23:33:54 +08:00
Sigbjørn Skjæret 60368e1d73
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
* undefined is treated as iterable (string/array) by filters

`tojson` is not a supported `undefined` filter

* add tests

* add sequence and iterable tests

keep it DRY and fix some types
2026-01-28 14:40:29 +01:00
Oleksandr Kuvshynov 88d23ad515
vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058)
Deduplication here relied on the fact that vulkan would return unique
UUID for different physical GPUs. It is at the moment not always the case.
On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total),
MotlenVK would assign same UUID to pairs of GPUs, unless they
are connected with Infinity Fabric.

See more details here: KhronosGroup/MoltenVK#2683.

The right way is to fix that in MoltenVK, but until it is fixed,
llama.cpp would only recognize 2 of 4 GPUs in such configuration.

The deduplication logic here is changed to only filter GPUs if UUID is
same but driver is different.
2026-01-28 12:35:54 +01:00