Commit Graph

12 Commits

Author SHA1 Message Date
itigges22 19fdba56b5 feat: MTP support for dense Qwen 3.5 with FastMTP vocabulary trimming
Add Multi-Token Prediction (MTP) speculative decoding for Qwen3.5 dense
models (0.8B-27B). The MTP head uses a full transformer block (attention
+ FFN) to predict the next-next token, enabling ~28 tok/s on RTX 5060 Ti.

Key changes:
- Model loading: Qwen3.5 MTP layer tensors (nextn.eh_proj, attention
  weights, FFN) loaded into layers[n_layer-1]
- Graph builder: Full MTP head with self-attention, gated RoPE, FFN,
  and vocabulary projection. Unfiltered hidden state passed for proper
  KV cache population during prompt processing.
- FastMTP: Vocabulary trimming from 248K to 32K tokens via ggml_view_2d
  on the lm_head. Reduces draft generation from 22ms to 6ms (3.7x).
- Speculative framework: MTP auto-detection for hybrid models, fuzzy
  seq_rm checkpoint matching for DeltaNet rollback.
- Server: Two-phase decode option for hybrid/recurrent models to avoid
  DeltaNet state corruption from rejected drafts.
- Recurrent state: Fixed copy_cell (ggml_view_1d takes element count,
  not bytes), buffer assignment for no_alloc views.

Results on Qwen3.5-9B Q4_K_M (RTX 5060 Ti 16GB):
- 28.1 tok/s with 82% acceptance rate (temp=0)
- 92% acceptance with two-phase decode (correct output, 15 tok/s)
- Draft generation: 6.1ms with FastMTP (vs 22.4ms full vocab)
2026-03-21 14:18:40 -04:00
Michael Grau 6729d4920c
model : add control vector support where missing (#20653)
* Add control vector functions to qwen3.5 and qwen-next models

* Add missing cvec compatibility to the rest of the models

* Adjust comments and formatting

* cleanup

* whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-18 23:25:12 +01:00
Aman Gupta 3c8521c4f5
llama-graph: replace cont with reshape for alpha in qwen35 (#20640) 2026-03-16 22:07:13 +08:00
Michael Wand d23355afc3
model : wire up Qwen3.5/Qwen3.5MoE tensors for NVFP4 support (#20506) 2026-03-14 22:44:42 +01:00
Georgi Gerganov d28961d81e
llama : enable chunked fused GDN path (#20340)
* llama : enable chunked fused GDN path

* models : avoid Q and K repeats when using fused GDA

* cont : fix comment

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix the fix

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* cont : fix

* metal : add GDN kernel (#20361)

* metal : add Metal backend for GGML_OP_GATED_DELTA_NET

Add a fused Metal kernel for the gated delta net recurrence op
(#19504), enabling GPU-accelerated inference for DeltaNet-based
models (Qwen3.5, etc.) on Apple Silicon.

Supports both GDA (scalar gate) and KDA (per-row gate) modes
with head_size 64 and 128. Unsupported configurations (head_size
32, non-contiguous tensors) gracefully fall back to CPU.

Performance: Qwen3.5-0.8B Q4_K_M on M4 Max
  tg128: 170 -> 213 t/s (+25%)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : validate contiguity of all input tensors in supports_op

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* metal : add algorithm equivalence comment for GDA decay path

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

* cont : unslop + optimize

* cont : clean-up

---------

Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

* CUDA: AR gated delta net improvements (#20391)

* Add FastDiv to gated_delta_net_cuda

* Shard columns across warps

This reduces register pressure (avoids spill for S_v = 128) and gives
the warp-scheduler more CTAs to schedule (thus hiding data-access
latencies).

* Remove unneded include in gated_delta_net.cu

* Improve comments

* Apply code-formating

* Make sharding HIP-compatible

1. Use ggml_cuda_get_physical_warp_size() to determine warp size flexibly
2. Add test with partial warp to test sum reduction on CUDA

* Remove fastdiv_s64, as we can treat neqk1 and rq3 as uint32_t

* Rename variables

* Enable GDN also for prefill, move TODO for chunked_GDN

* Actually remove the TODO from 2068908975

* Get warp size at runtime

warp_size is not known at compile time in hip host code.

* Don't expose ggml_cuda_get_physical_warp_size on host

---------

Co-authored-by: uvos <devnull@uvos.xyz>

* llama : refactor llm_build_delta_net_base API

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Paul Flynn <paul@arkavo.com>
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Co-authored-by: Oliver Simons <osimons@nvidia.com>
Co-authored-by: uvos <devnull@uvos.xyz>
2026-03-11 22:46:40 +02:00
Xuan-Son Nguyen 59db9a357d
llama: dynamic head_dim and n_rot for SWA (#20301)
* llama: dynamic head_dim and n_rot for SWA

* also add gguf_writer wrappers

* fix build

* build_rope_shift arg reorder
2026-03-09 22:22:39 +01:00
Aman Gupta c5a778891b
ggml: add GATED_DELTA_NET op (#19504)
* ggml: add GATED_DELTA_NET op

* remove the transpose

* add KDA

* add qwen35 dense

* llama : check for fused gated delta net backend support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-07 15:41:10 +08:00
Georgi Gerganov 244641955f
models : fix graph splits (#19866) 2026-02-25 00:01:13 +02:00
Georgi Gerganov da348c9dfb
models : fix qwen3.5 beta/gate shapes (#19730)
* models : fix qwen3.5 beta/gate shapes

* cont : avoid extra reshapes
2026-02-19 15:19:53 +02:00
Georgi Gerganov 27326bfce1
models : dedup qwen35 graphs (#19660)
* models : dedup qwen35 graphs

* cont : add missing sigmoid
2026-02-19 08:17:49 +02:00
Georgi Gerganov cc45f2ada6
models : deduplicate delta-net graphs for Qwen family (#19597)
* models : add llm_build_delta_net_base

* cont : keep qwen35 and qwen35moe graphs intact

* cont : add comments
2026-02-16 14:35:04 +02:00
JJJYmmm fc0fe40049
models : support qwen3.5 series (#19468)
* support qwen3.5 series

* remove deepstack for now, and some code clean

* code clean

* add FULL_ATTENTION_INTERVAL metadata

* code clean

* reorder v heads for linear attention to avoid expensive interleaved repeat
2026-02-10 18:00:26 +02:00