llama-perplexity -hf unsloth/Qwen3-0.6B-GGUF:Q4_K_M -f winogrande-debiased-eval.csv --winogrande
winogrande_score : tokenizing selected tasks
winogrande_score : calculating winogrande score over selected tasks.
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 46
failed to decode the batch, n_batch = 2048, ret = 1
winogrande_score: llama_decode() failed
same for hellaswag:
split_equal: sequential split is not supported when there are coupled sequences in the input batch (you may need to use the -kvu flag)
decode: failed to find a memory slot for batch of size 99
failed to decode the batch, n_batch = 2048, ret = 1
hellaswag_score: llama_decode() failed
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : transpose fused GDN state access for coalesced memory reads (#20436)
The fused Gated Delta Net kernel accessed the [S_v, S_v] state matrix
column-wise on row-major storage, causing strided reads (stride S_v =
128 floats = 512 bytes) that waste GPU cache bandwidth. This produced a
39% regression on Qwen3.5-9B (Metal, M4 Max) compared to the unfused
path.
Transpose the state indexing so threads read contiguously:
- Metal: s_ptr[is*S_v] -> s_ptr[is] (stride 1 vs S_v)
- CUDA: curr_state[i*S_v+col] -> curr_state[col*S_v+i] (coalesced)
- CPU: restructured loops for row-wise transposed access
Also add --fused-gdn [on|off|auto] CLI flag (mirrors --flash-attn) so
users can control fused GDN independently of auto-detection.
All GATED_DELTA_NET backend-ops tests pass.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* ggml : use SIMD dot products in CPU GDN kernel, couple AR/chunked fused flags
- Replace scalar inner loops with ggml_vec_dot_f32 for SIMD-optimized
dot products in the CPU fused GDN kernel (delta and attention output)
- Couple fused_gdn_ar and fused_gdn_ch flags in auto-detection: if one
path lacks device support, disable both to prevent state layout mismatch
between transposed (fused) and non-transposed (unfused) formats
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* llama : rever fgdn argument changes
* graph : remove GDN state transposes
* vulkan : adapt
* cuda : remove obsolete smem code
---------
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
* llama : fix pooling assertion crash in chunked GDN detection path
The chunked fused Gated Delta Net detection in sched_reserve() calls
graph_reserve(16*n_seqs, n_seqs, n_outputs, ...) where n_outputs = n_seqs.
This creates a dimension mismatch in build_pooling() for embedding models
with mean/rank pooling: build_inp_mean() creates a tensor with shape
[n_tokens=16*n_seqs, ...] while t_embd is reduced to [n_outputs=n_seqs, ...]
via out_ids, causing ggml_mul_mat to assert on ggml_can_mul_mat(a, b).
Fix: pass n_tokens as n_outputs in the chunked GDN graph reservation,
matching the pattern used by the pp/tg worst-case reservations.
Regression introduced by #20340 (d28961d).
Same class of bug as #12517, fixed by #12545.
* server : add mean pooling tests to embedding test suite
Add test_embedding_pooling_mean and test_embedding_pooling_mean_multiple
to cover the --pooling mean codepath, which was previously untested.
These tests would have caught the regression introduced by #20340 where
build_pooling() crashes with a ggml_mul_mat assertion due to mismatched
dimensions in the chunked GDN detection path.
---------
Co-authored-by: Domenico Crupi <domenico@zerovolt.it>
* server: reset kill-switch on client error
This avoids triggering a server kill switch.
If the client sends a request that exceeds the configured context size, an appropriate HTTP 400 response is provided and no tokens are generated.
However since no tokens are generated, update_slots() increments n_empty_consecutive. If the client sends 3 such messages in a row, the server terminates.
* moved counter reset as per recommendation
* cont : minor
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit renames the the function `mtmd_get_audio_bitrate` to
`mtmd_get_audio_sample_rate` to better reflect its purpose.
The motivation for this is that the function currently returns the audio
sample rate, not the bitrate (sample_rate × bit_depth × channels), and
that is how it is used in the code as well.
This is a breaking change, but I believe mtmd is still in
experimental/development phase so it might be alright to simply rename.
* convert : fix/suppress pyright errors
This commit fixes the pyright errors that are generated by pyright for
convert_hf_to_gguf.py.
The motivation for this is that running this locally generates errors
that CI does not, and it can be difficult to spot new errors. One use
case is when working on new models which cannot be run in CI due to
privacy. Having the ability to run pyright locally is would be helpful
in this cases.
In the linked issue there is the mention of switching to `ty` which I
don't know anything about but in the meantime I would appreciate if we
could suppress these errors for now, and later perhaps revert this
commit.
With this change there are no errors but there are 4 informations
messages if the `mistral_common` package is installed. The
`--level error` flag can be used to suppress them.
Resolves: https://github.com/ggml-org/llama.cpp/issues/20417