llama.cpp

Commit Graph

Author	SHA1	Message	Date
Imad Saddik	fa7d3a96c5	fix: keep the container spanning the whole width to fix scroll bar issue	2026-03-15 17:39:19 +00:00
Imad Saddik	c399ec3c46	chore: update webui build output	2026-03-15 17:36:11 +00:00
Imad Saddik	72c1928dc9	refactor: move widthClasses.class to the container div	2026-03-15 17:34:31 +00:00
Imad Saddik	0cd953eea2	chore: update webui build output	2026-03-15 17:13:14 +00:00
Imad Saddik	6074619ba4	style: restore class for checkbox labels	2026-03-15 17:11:58 +00:00
Imad Saddik	297abf8450	chore: update webui build output	2026-03-15 17:09:59 +00:00
Imad Saddik	d4034eff07	fix: update chatWidthClasses to use autoChatWidth configuration	2026-03-15 17:08:43 +00:00
Imad Saddik	b73209694d	chore: update webui build output	2026-03-15 17:03:58 +00:00
Imad Saddik	2630c27754	refactor: simplify chatWidthClasses getter logic and remove widthClasses.class	2026-03-15 17:02:41 +00:00
Imad Saddik	1a6f21f25c	chore: revert package-lock.json to match master	2026-03-15 16:56:41 +00:00
Imad Saddik	95be04617e	chore: update webui build output	2026-03-15 16:56:08 +00:00
Imad Saddik	2836834801	refactor: remove anything related to the custom chat width setting	2026-03-15 16:54:43 +00:00
Imad Saddik	20a8227933	chore: update webui build output	2026-03-15 16:44:35 +00:00
Imad Saddik	89647d5daf	chore: downgrade @lucide/svelte version and remove custom chat width component	2026-03-15 16:43:18 +00:00
Imad Saddik	c55533a706	chore: update webui build output	2026-03-14 09:00:23 +00:00
Imad Saddik	29ede762c4	refactor: don't reset custom chat width when the auto width is checked	2026-03-14 08:59:02 +00:00
Imad Saddik	e8eccf9b35	feat: update chatWidthClasses to prioritize auto chat width	2026-03-14 08:57:25 +00:00
Imad Saddik	23758f3ba8	feat: add syncable parameters for auto and custom chat width	2026-03-14 08:55:56 +00:00
Imad Saddik	b7851305df	chore: update webui build output	2026-03-14 08:29:24 +00:00
Imad Saddik	e2a6be14e7	fix: pass style to ChatMessageUser	2026-03-14 08:09:33 +00:00
Imad Saddik	bcc95c98cb	refactor: remove chatWidthClasses from ChatForm	2026-03-14 08:05:20 +00:00
Imad Saddik	16fcb29197	fix: use widthClasses in ChatScreenForm	2026-03-14 08:03:48 +00:00
Imad Saddik	5a721b5678	chore: update webui build output	2026-03-14 06:57:42 +00:00
Imad Saddik	ee944af476	style: fix indentation and formatting in ChatForm.svelte	2026-03-14 06:54:26 +00:00
Imad Saddik	8f9571a5c2	refactor: use derived on chatWidthClasses for consistency	2026-03-14 06:53:13 +00:00
Imad Saddik	0306577300	chore: update webui build output	2026-03-14 06:48:49 +00:00
Imad Saddik	c3cb3fcfcd	refactor: remove unused chat width parameters from syncable parameters	2026-03-14 06:47:28 +00:00
Imad Saddik	19986697e3	chore: update webui build output	2026-03-14 06:46:26 +00:00
Imad Saddik	8cef196854	style: fix formatting	2026-03-14 06:44:43 +00:00
Imad Saddik	4561f25021	refactor: call chatWidthClasses once and reuse it everywhere	2026-03-14 06:43:19 +00:00
Imad Saddik	165234e722	fix: correct typo in disabled message for automatic width	2026-03-14 06:37:06 +00:00
Imad Saddik	b9545a1021	chore: update webui build output	2026-03-14 06:33:38 +00:00
Imad Saddik	5dd9b7d888	Merge branch 'master' into feat/change_chat_screen_width	2026-03-14 06:32:15 +00:00
Adrien Gallouët	77e20cc107	vendor : update cpp-httplib to 0.37.2 (#20484 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-14 06:51:02 +01:00
Rail Chabdarov	5a32a9b8a5	Fix data race in CUDA's "cpy" kernel (influences GGML's DUP, CONT operations). (#20507 ) * Fix datarace in CUDA's "cpy" kernel. * Remove extra barrier by using more of shared memory.	2026-03-14 13:19:44 +08:00
lhez	3b439504ba	opencl: fix l2_norm (#20480 )	2026-03-13 22:18:52 -07:00
Adrien Gallouët	463b6a963c	tools : enable kvu in perplexity for hellaswag, winogrande, multiple-choice (#19954 ) llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande winogrande_score : tokenizing selected tasks winogrande_score : calculating winogrande score over selected tasks. split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 46 failed to decode the batch, n_batch = 2048, ret = 1 winogrande_score: llama_decode() failed same for hellaswag: split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag) decode: failed to find a memory slot for batch of size 99 failed to decode the batch, n_batch = 2048, ret = 1 hellaswag_score: llama_decode() failed Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-13 21:25:57 +01:00
Georgi Gerganov	e30f1fdf74	graph : remove redundant GDN state transposes (#20443 ) * ggml : transpose fused GDN state access for coalesced memory reads (#20436) The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix column-wise on row-major storage, causing strided reads (stride S_v = 128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a 39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused path. Transpose the state indexing so threads read contiguously: - Metal: s_ptr[isS_v] -> s_ptr[is] (stride 1 vs S_v) - CUDA: curr_state[iS_v+col] -> curr_state[colS_v+i] (coalesced) - CPU: restructured loops for row-wise transposed access Also add --fused-gdn [on\|off\|auto] CLI flag (mirrors --flash-attn) so users can control fused GDN independently of auto-detection. All GATED_DELTA_NET backend-ops tests pass. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags - Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized dot products in the CPU fused GDN kernel (delta and attention output) - Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one path lacks device support, disable both to prevent state layout mismatch between transposed (fused) and non-transposed (unfused) formats Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * llama : rever fgdn argument changes * graph : remove GDN state transposes * vulkan : adapt * cuda : remove obsolete smem code --------- Co-authored-by: Paul Flynn <paul@arkavo.com> Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Oliver Simons <osimons@nvidia.com>	2026-03-13 22:12:54 +02:00
Piotr Wilkin (ilintar)	1430c35948	common/parser: gracefully handle undetected tool parser, print error message. (#20286 )	2026-03-13 20:56:10 +01:00
ZeroV0LT	f17b3be63f	llama : fix pooling assertion crash in chunked GDN detection path (#20468 ) * llama : fix pooling assertion crash in chunked GDN detection path The chunked fused Gated Delta Net detection in sched_reserve() calls graph_reserve(16n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs. This creates a dimension mismatch in build_pooling() for embedding models with mean/rank pooling: build_inp_mean() creates a tensor with shape [n_tokens=16n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...] via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b). Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation, matching the pattern used by the pp/tg worst-case reservations. Regression introduced by #20340 (`d28961d`). Same class of bug as #12517, fixed by #12545. * server : add mean pooling tests to embedding test suite Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple to cover the --pooling mean codepath, which was previously untested. These tests would have caught the regression introduced by #20340 where build_pooling() crashes with a ggml_mul_mat assertion due to mismatched dimensions in the chunked GDN detection path. --------- Co-authored-by: Domenico Crupi <domenico@zerovolt.it>	2026-03-13 20:53:42 +02:00
SoftwareRenderer	d7ba99c485	server: reset counter related to kill-switch on client error (#20513 ) * server: reset kill-switch on client error This avoids triggering a server kill switch. If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated. However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates. * moved counter reset as per recommendation * cont : minor --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-13 19:58:09 +02:00
rehan-10xengineer	fbaa95bc29	ggml-cpu: add RVV vec dot kernels for quantization types (#18859 ) * ggml-cpu: add rvv quantize_row_q8_K kernel Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for iq4_nl, mxfp4, iq2_xxs Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: add rvv vec_dot for iq4_xs, refactor * ggml-cpu: remove ifunc for rvv vec dot * ggml-cpu: add vec_dot for iq2_xs, iq3_xxs Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> * ggml-cpu: refactor quants.c --------- Co-authored-by: taimur-10x <taimur.ahmad@10xengineers.ai> Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai> Co-authored-by: Rehan Qasim <rehanbhatti0317@gmail.com>	2026-03-13 17:36:04 +02:00
Adrien Gallouët	b5e1212063	ggml : fix typo gmml (#20512 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-13 14:36:13 +01:00
Daniel Bevenius	8f974d2392	mtmd : rename mtmd_get_audio_bitrate to mtmd_get_audio_sample_rate (#20105 ) This commit renames the the function `mtmd_get_audio_bitrate` to `mtmd_get_audio_sample_rate` to better reflect its purpose. The motivation for this is that the function currently returns the audio sample rate, not the bitrate (sample_rate × bit_depth × channels), and that is how it is used in the code as well. This is a breaking change, but I believe mtmd is still in experimental/development phase so it might be alright to simply rename.	2026-03-13 12:30:02 +01:00
Piotr Wilkin (ilintar)	2948e6049a	general: CONTRIBUTING.md - guidelines for quantization schemes (#19762 ) * Guidelines for quantization schemes * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Change required precision from Q8 to FP16/BF16 * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update CONTRIBUTING.md [no ci] * Update CONTRIBUTING.md [no ci] --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-13 12:21:33 +01:00
Georgi Gerganov	73c9eb8ced	metal : fix l2 norm scale (#20493 )	2026-03-13 11:43:20 +02:00
Daniel Bevenius	983df142a9	convert : fix/suppress pyright errors (#20442 ) * convert : fix/suppress pyright errors This commit fixes the pyright errors that are generated by pyright for convert_hf_to_gguf.py. The motivation for this is that running this locally generates errors that CI does not, and it can be difficult to spot new errors. One use case is when working on new models which cannot be run in CI due to privacy. Having the ability to run pyright locally is would be helpful in this cases. In the linked issue there is the mention of switching to `ty` which I don't know anything about but in the meantime I would appreciate if we could suppress these errors for now, and later perhaps revert this commit. With this change there are no errors but there are 4 informations messages if the `mistral_common` package is installed. The `--level error` flag can be used to suppress them. Resolves: https://github.com/ggml-org/llama.cpp/issues/20417	2026-03-13 06:00:52 +01:00
Georgi Gerganov	57819b8d4b	llama : disable graph reuse with pipeline parallelism (#20463 )	2026-03-12 21:04:13 +02:00
Alessandro de Oliveira Faria (A.K.A.CABELO)	557fe2d913	vendor : update cpp-httplib to 0.37.1 (#20390 )	2026-03-12 13:57:06 +01:00
Piotr Wilkin (ilintar)	0e810413bb	tests : use `reasoning` instead of `reasoning_budget` in server tests (#20432 )	2026-03-12 13:41:01 +01:00

1 2 3 4 5 ...

8408 Commits All Branches Search

8408 Commits

All Branches