llama.cpp

Commit Graph

Author	SHA1	Message	Date
Pascal	965655fafb	chore: update webui build output	2026-02-01 20:35:35 +01:00
Pascal	7953c18967	webui: fix UI freeze at high token rates with RAF yield The markdown coalescing loop was processing chunks back-to-back without yielding to the browser's paint cycle. At high token rates (250+ tok/s), this caused complete UI freeze as the main thread was perpetually busy. Add a requestAnimationFrame yield between processing batches. This allows the browser to paint at screen FPS regardless of token throughput. Chunks arriving during the yield are coalesced and processed together, so we skip intermediate states and jump straight to the latest content. Before: Chunk->process->Chunk->process->... (browser never paints = freeze) After: Chunk->process->[RAF]->coalesced chunks->process->[RAF]->... (screen FPS) Tested with 250 tok/s streams on 50K+ token contexts: smooth scrolling and responsive UI throughout.	2026-02-01 20:34:08 +01:00
Pascal	2884ef46b3	chore: update webui build output	2026-02-01 19:45:54 +01:00
Pascal	0dbaeaf6c7	webui: incremental MDAST transform caching for streaming performance Replace full AST re-transformation with per-block caching strategy. Previously, each streaming chunk triggered processor.run() on the entire document (12 rehype/remark plugins including KaTeX and highlight.js). Now transforms individual MDAST nodes and caches results by position hash. In append-only streaming mode, stable blocks are reused directly from cache, only the unstable trailing block is re-transformed. - Add SvelteMap FIFO cache (5000 blocks, evicts oldest 1000 on overflow) - Add getMdastNodeHash() for MDAST node fingerprinting by position - Add isAppendMode() to detect streaming append patterns - Add transformMdastNode() for single-node transformation with cache lookup - Remove stringifyProcessedNode() (dead code after refactor) Reduces streaming complexity from O(N × transforms) to O(1) for stable blocks. Targets 200K token contexts without UI degradation on mobile devices.	2026-02-01 19:44:16 +01:00
Pascal	1ab2e45684	chore: update webui build output	2026-02-01 12:10:06 +01:00
Pascal	82f6094aa2	feat: render images inline below attachment markers in tool results Parse tool results line-by-line to display images immediately after their [Attachment saved: xxx.png] markers. Fixes previous commit where all images from all tool calls were shown in every section. Each tool call now displays only its own images. Uses Svelte derived for memoization to avoid re-parsing on every streaming chunk. Parsing only occurs when section.toolResult or message.extra changes	2026-02-01 12:06:25 +01:00
Pascal	be96423ae9	feat: render images below attachment markers in tool results	2026-02-01 04:56:21 +01:00
Pascal	5a4e4f4189	chore: update webui build output	2026-02-01 04:13:48 +01:00
Pascal	42244c0162	fix: also skip image attachments in message history for non-vision backends	2026-02-01 04:13:37 +01:00
Pascal	6b7e6f18a6	chore: update webui build output	2026-02-01 03:22:09 +01:00
Pascal	893dbb058a	fix: skip sending image attachments to non-vision backends	2026-02-01 03:20:36 +01:00
Pascal	556029eee6	chore: update webui build output	2026-01-31 08:27:11 +01:00
Pascal	1384352484	fix: responsive MCP server cards, prioritize server name over version	2026-01-31 08:22:41 +01:00
Pascal	1615b1c58c	fix: responsive MCP server cards for mobile viewports	2026-01-31 07:58:47 +01:00
Pascal	cd8e5741f2	chore: update webui build output	2026-01-30 20:23:45 +01:00
Pascal	b872838329	webui: adaptive model selector dropdown width Make model selector dropdown responsive: - Mobile: full width (w-full max-w-[100vw]) - Desktop: adapts to longest model name (sm:w-max) - Replace TruncatedText with responsive span (truncate on mobile, full text on desktop via sm:overflow-visible sm:whitespace-nowrap) - Center status icons in fixed 24px wrapper to prevent layout shifts - Add sm:pr-2 padding between text and icon zone on desktop Fixes dropdown cutting off long model names on desktop while maintaining full-width display on mobile with proper text truncation	2026-01-30 20:21:05 +01:00
Aleksander Grygier	120ada3616	chore: update webui build output	2026-01-29 16:31:07 +01:00
Aleksander Grygier	e41f70bb47	refactor: Use CORS Proxy for favicons calls	2026-01-29 16:30:10 +01:00
Aleksander Grygier	46c5bca942	refactor: Proxy utility	2026-01-29 16:29:04 +01:00
Aleksander Grygier	944765138e	chore: update webui build output	2026-01-29 15:03:00 +01:00
Aleksander Grygier	536c6866e3	feat: Integrate with `llama-server` proxy + improve MCP Server Edit Form	2026-01-29 14:59:28 +01:00
Aleksander Grygier	406cb1dd99	Merge remote-tracking branch 'ngxson/xsn/cors_proxy_demo' into allozaur/mcp-mvp	2026-01-29 13:34:20 +01:00
Aleksander Grygier	9d6e210a5e	Merge remote-tracking branch 'ggml-org/master' into allozaur/mcp-mvp	2026-01-29 13:21:44 +01:00
Aleksander Grygier	7b00b46a6a	chore: update webui build output	2026-01-29 12:55:45 +01:00
Aleksander Grygier	6793c7daac	fix: Checking for capabilities from store	2026-01-29 12:45:10 +01:00
Aleksander Grygier	2aa704b821	refactor: Cleanup	2026-01-29 11:44:08 +01:00
yulo	f3dd7b8e68	HIP: add mmf for CDNA (#18896 ) * refactor mmf rows_per_block * speed up compile * pass cdna compile * fix cuda error * clean up mmf * f32 mmf * clean float mma * fix mmf error * faster mmf * extend tile k * fix compile error * Revert "extend tile k" This reverts commit `4d2ef3d483`. * fix smem overflow * speed up compiling mmf * speed up compile for hip * 512 block for cdna * config pad size * fix as comment * update select logic * move some code to cuh * fix as comment * correct cdna3 config --------- Co-authored-by: zhang hui <you@example.com>	2026-01-29 11:10:53 +01:00
Georgi Gerganov	eed25bc6b0	arg : add -kvu to llama-batched-bench (#19172 )	2026-01-29 08:50:47 +02:00
Vishal Singh	b33df266d0	ggml-zendnn : resolve ZenDNN backend cross-module symbol dependency (#19159 )	2026-01-29 12:28:57 +08:00
Aman Gupta	3bcc990997	CUDA: refactor topk-moe to enable more models (GLM 4.7, Nemotron etc.) (#19126 )	2026-01-29 10:31:28 +08:00
Neo Zhang	d4964a7c66	sycl: fix norm kernels: l2_norm, group_norm, rms_norm by remove assert to support more cases (#19154 ) Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>	2026-01-29 09:20:22 +08:00
Sigbjørn Skjæret	50e8962f79	ci : find latest release with asset for winget (#19161 )	2026-01-28 22:05:39 +01:00
Aleksander Grygier	c7b7fc6c15	chore: update webui build output	2026-01-28 19:57:18 +01:00
Aleksander Grygier	d9e82b7c29	fix: Linter errors	2026-01-28 19:55:44 +01:00
Ruben Ortlam	f6b533d898	Vulkan Flash Attention Coopmat1 Refactor (#19075 ) * vulkan: use coopmat for flash attention pv matrix multiplication fix P loading issue * fix barrier position * remove reduction that is no longer needed * move max thread reduction into loop * remove osh padding * add bounds checks and padding * remove unused code * fix shmem sizes, loop duration and accesses * don't overwrite Qf, add new shared psh buffer instead * add missing bounds checks * use subgroup reductions * optimize * move bounds check, reduce barriers * support other Bc values and other subgroup sizes * remove D_split * replace Of register array with shared memory Ofsh array * parallelize HSV across the rowgroups * go back to Of in registers, not shmem * vectorize sfsh * don't store entire K tile in shmem * fixes * load large k tiles to shmem on Nvidia * adapt shared memory host check function to shader changes * remove Bc 32 case * remove unused variable * fix missing mask reduction tmspsh barrier * fix mask bounds check * fix rowmax f16 under/overflow to inf * fix flash_attn_cm2 BLOCK_SIZE preprocessor directives	2026-01-28 18:52:45 +01:00
Sascha Rogmann	72d3b1898a	spec : add self‑speculative decoding (no draft model required) + refactor (#18471 ) * server: introduce self-speculative decoding * server: moved self-call into speculative.cpp * can_speculate() includes self-speculation Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * server: can_speculate() tests self-spec * server: replace can_speculate() with slot.can_speculate() Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * common: use %zu format specifier for size_t in logging Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * server: can_speculate() requires a task instance * common: ngram map, config self-speculative decoding * common: add enum common_speculative_type * common: add vector of speculative states * common: add option --spec-draftless * server: cleanup (remove slot.batch_spec, rename) * common: moved self-spec impl to ngram-map * common: cleanup (use common_speculative_state_draft) * spec : refactor * cont : naming * spec: remove --spec-config * doc: (draftless) speculative decoding * common: print performance in spec decoding * minor : cleanup * common : better names * minor : cleanup + fix build * minor: comments * CODEOWNERS: add common/ngram-map.* (#18471) * common : rename speculative.draftless_type -> speculative.type * ngram-map : fix uninitialized values * ngram-map : take into account the input can become shorter * ngram-map : revert len check for now * arg : change `--spec-draftless` -> `--spec-type` * spec : add common_speculative_state::accept() * spec : refactor + add common_speculative_begin() * spec : fix begin() call with mtmd * spec : additional refactor + remove common_speculative_params --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-01-28 19:42:42 +02:00
Aleksander Grygier	7c9be63a74	refactor: Refine Chat Message Processing State Display	2026-01-28 18:31:37 +01:00
Aleksander Grygier	5a176d1893	feat: Chat logic improvements	2026-01-28 18:31:37 +01:00
Aleksander Grygier	aa7089d598	feat: Integrate Resource Attachments into Chat Form UI	2026-01-28 18:31:37 +01:00
Aleksander Grygier	23e4ef7495	feat: MCP Resources UI feat: Implement MCP Resource Selection Dialog	2026-01-28 18:31:37 +01:00
Aleksander Grygier	1623547e2b	feat: Integrate Resource Store into Main MCP Store	2026-01-28 18:31:36 +01:00
Aleksander Grygier	dc2076a77c	feat: MCP Resources Svelte Store	2026-01-28 18:31:36 +01:00
Aleksander Grygier	192c920d73	refactor: Use constants	2026-01-28 18:31:35 +01:00
Aleksander Grygier	89166a79d4	feat: Introduce MCP Resource Types and Service Methods	2026-01-28 18:31:35 +01:00
Aleksander Grygier	85a61a7c96	refactor: Componentize HorizontalScrollCarousel	2026-01-28 17:32:59 +01:00
Aleksander Grygier	bfbcdc7420	fix: Code Preview sandbox	2026-01-28 17:31:04 +01:00
Daniel Bevenius	ebf5725870	convert : yield Mamba2Model/GraniteMoeModel modify_tensors (#19157 ) * convert : yield Mamba2Model/GraniteMoeModel modify_tensors This commit updates the `GraniteHybridModel` class' modify_tensors function to properly delegate to `Mamba2Model.modify_tensors` and `GraniteMoeModel.modify_tensors` using 'yield from' instead of 'return'. The motivation for this is that modify_tensors is a generator function (it uses 'yield from'), but the two calls above use return statements but don't yield anything which means that the the caller of this function will not receive any yielded values from it. And this causes layer tensors to be silently dropped during conversion.	2026-01-28 16:49:36 +01:00
Patryk Kaminski	0cd7032ca4	ggml-sycl: remove unused syclcompat header (#19140 ) The syclcompat/math.hpp is not used anymore. The change that intrduced it was successfuly reverted (https://github.com/ggml-org/llama.cpp/pull/17826). This include path will become obsolete and dropped in oneAPI 2026.0 effectively breaking ggml-sycl builds.	2026-01-28 23:33:54 +08:00
Sigbjørn Skjæret	60368e1d73	jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147 ) * undefined is treated as iterable (string/array) by filters `tojson` is not a supported `undefined` filter * add tests * add sequence and iterable tests keep it DRY and fix some types	2026-01-28 14:40:29 +01:00
Oleksandr Kuvshynov	88d23ad515	vulkan: handle device dedup on MacOS + Vega II Duo cards (#19058 ) Deduplication here relied on the fact that vulkan would return unique UUID for different physical GPUs. It is at the moment not always the case. On Mac Pro 2019 running Mac OS, with 2 Vega II Duo cards (so, 4 GPU total), MotlenVK would assign same UUID to pairs of GPUs, unless they are connected with Infinity Fabric. See more details here: KhronosGroup/MoltenVK#2683. The right way is to fix that in MoltenVK, but until it is fixed, llama.cpp would only recognize 2 of 4 GPUs in such configuration. The deduplication logic here is changed to only filter GPUs if UUID is same but driver is different.	2026-01-28 12:35:54 +01:00

1 2 3 4 5 ...

8126 Commits All Branches Search

8126 Commits

All Branches