llama.cpp

Commit Graph

Author	SHA1	Message	Date
Reese Levine	5a23695d5a	ggml-webgpu: Update register tiling matmul to use f32 accumulation (#21644 ) * Update register tiling matmul to use f32 accumulation * fix profiling code * Fix register tiling matmul for chrome, i'm blaming dawn * Update batch tuning value for iOS * compile fix * Fix use of new load function	2026-04-14 13:46:41 +03:00
Jeff Bolz	6a6780a232	vulkan: Support GGML_TYPE_NVFP4 (#21455 ) This adds nvfp4 support for get_rows, dequant, and mul_mat(_id). For mul_mat, it does not add support for the dp4/q8_1 path, it's all via fp16/fp32.	2026-04-14 11:34:23 +02:00
Ruben Ortlam	75f3bc94e6	vulkan: Flash Attention DP4A shader for quantized KV cache (#20797 ) * use integer dot product for quantized KV flash attention * small improvements * fix SHMEM_STAGING indexing * add missing KV type quants * fixes * add supported quants to FA tests * readd fast paths for <8bit quants * fix mmq gate and shmem checks	2026-04-13 14:21:31 +02:00
Oliver Simons	9f5e1edb10	CUDA: Limit DeviceSegmentedSort to immediate mode (#21718 ) * CUDA: Limit DeviceSegmentedSort to immediate mode DeviceSegmentedSort is currently not capturable in a cuda graph. Hence, we have to go for the slower DeviceSegmentedRadixSort in that case. Perf numbers on RTX Pro 6000 Blackwell Max-Q: DeviceSegmentedRadixSort in graph mode (i.e. CUDA Graphs) ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 12291 runs - 105.94 us/run - 8192 kB/run - 73.75 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 10245 runs - 115.08 us/run - 16384 kB/run - 135.77 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 221.22 us/run - 32768 kB/run - 141.26 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 430.98 us/run - 65536 kB/run - 145.02 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1028 runs - 1185.83 us/run - 131072 kB/run - 105.41 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 387 runs - 2748.62 us/run - 262144 kB/run - 90.95 GB/s DeviceSegmentedSort in immediate mode ARGSORT(type=f32,ne=[2048,512,1,1],order=1): 16388 runs - 71.17 us/run - 8192 kB/run - 109.78 GB/s ARGSORT(type=f32,ne=[4096,512,1,1],order=1): 12294 runs - 81.38 us/run - 16384 kB/run - 192.00 GB/s ARGSORT(type=f32,ne=[8192,512,1,1],order=1): 5125 runs - 240.81 us/run - 32768 kB/run - 129.77 GB/s ARGSORT(type=f32,ne=[16384,512,1,1],order=1): 2565 runs - 406.60 us/run - 65536 kB/run - 153.71 GB/s ARGSORT(type=f32,ne=[32768,512,1,1],order=1): 1285 runs - 873.23 us/run - 131072 kB/run - 143.15 GB/s ARGSORT(type=f32,ne=[65536,512,1,1],order=1): 516 runs - 2288.46 us/run - 262144 kB/run - 109.24 GB/s * Add test case for dispatch to DeviceSegmentedRadixSort We currently lack a way to force graph mode in CUDA, patch callback to invoke ggml_backend_compare_graph_backend twice to enforce each test to run in graph mode	2026-04-13 11:14:06 +02:00
Masashi Yoshimura	bafae27654	Remove extra conditional check on debug mode. (#21798 )	2026-04-12 20:13:04 -07:00
Akarshan Biswas	873c825611	sycl: disable Q1_0 in backend and cleanup unused variables (#21807 )	2026-04-13 09:44:58 +08:00
Stephen Cox	547765a93e	mtmd: add Gemma 4 audio conformer encoder support (#21421 ) * mtmd: add Gemma 4 audio conformer encoder support Add audio processing for Gemma 4 E2B/E4B via a USM-style Conformer. Architecture: - 12-layer Conformer: FFN → Self-Attention → Causal Conv1D → FFN → Norm - Subsampling Conv Projection: 2x Conv2D(stride=2) with LayerNorm - Full self-attention with sinusoidal RPE and sliding window mask (24) - Logit softcapping at 50.0, ClippableLinear clamping - Output: 1024 → 1536 → RMSNorm → multimodal embedder Mel preprocessing (dedicated mtmd_audio_preprocessor_gemma4a): - HTK mel scale, 128 bins, magnitude STFT, mel_floor=1e-3 - Standard periodic Hann window (320 samples), zero-padded to FFT size - Semicausal left-padding (frame_length/2 samples) - Frame count matched to PyTorch (unfold formula) - No pre-emphasis, no Whisper-style normalization - Mel cosine similarity vs PyTorch: 0.9998 Key fixes: - Tensor loading dedup: prevent get_tensor() from creating duplicate entries in ctx_data. Fixed with std::set guard. - ClippableLinear clamp_info loading moved after per-layer tensors. - Sliding window mask (24 positions) matching PyTorch context_size. - Skip Whisper normalization for Gemma4 mel output. Tested on E2B and E4B with CPU and Vulkan backends. Transcribes: "Glad to see things are going well and business is starting to pick up" (matching ground truth). Ref: #21325	2026-04-12 14:15:26 +02:00
Johannes Gäßler	ff5ef82786	CUDA: skip compilation of superfluous FA kernels (#21768 )	2026-04-11 18:52:11 +02:00
shaofeiqi	af1127d3c4	opencl: add basic support for q5_k (#21593 ) * opencl: add general q5_k mv * opencl: add flattened Q5_K mv and general Q5_K mm * opencl: fix Q5_K unit tests	2026-04-11 01:46:19 -07:00
Sigbjørn Skjæret	2b2cd57de6	ggml : fix a few instances of missing GGML_TYPE_Q1_0 cases (#21716 )	2026-04-11 09:45:00 +03:00
Aman Gupta	a29e4c0b7b	CUDA: also store node->src ne/nb for graph equality (#21736 )	2026-04-11 10:30:30 +08:00
Max Krasnyansky	9aa2807769	hexagon: improved Op queuing, buffer and cache management (#21705 ) * hexagon: introduce op request batching and rewrite buffer managment The host now prepares batches of requests and dispatches them via a single dspqueue message. Buffers are mapped explicitly by NPU while processing batches. * hex-dma: disable l2 bypass since to work around new issue due to no flushes between Ops * hex-utils: add explicit l2flush and l2clear helpers * hex-opreq: use fine-grain per tensor l2 management * hex-opreq: avoid redundant invalidates for tensors we already flushed * hex-opreq: update debug messages * htp-opreq: reuse ops_context * hex-opreq: do not flush or invalidate cache lines beyond buffer boundry * hex-opreq: fix errors in log message * Revert "hex-opreq: do not flush or invalidate cache lines beyond buffer boundry" This reverts commit 8b7f0a55a750a6430ce4eb1874c7feb3d720056d. * hexagon: limit l2 flushes to 1MB which covers l2 cache * hex-opreq: limit cache flush to 4MB Looks like 4MB cont. vitual space should cover the 1MB cache. * hexagon: drop cache flush size to 2MB * hex-opreq: start reworking opreq packing * hex-opreq: introduce new way of packing opbatch where tensors are stored separately * hex-opreq: add a simple fastrpc call to force unmap all buffers * hex-l2flush: somehow 2MB does not seem robust, also cleanup step size to use line-size * hex-opreq: bump opreq batch size to 256 * hex-mm: place src1 spad at the top of vtcm for easy reuse * hex-ops: introduce internal types and disable src1 reuse for now Nothing new just formalizing the repack / qyn.quant types we've been using. * htp-opreq: use tensor pointers instead of copies * hex-opreq: introduce more robust way for tracking vtcm/spad reuse This removes the SKIP_QUANTIZE flag that became fragile with the addition of HMX and other ops. * hex-cumsum: fix error post opreq merge * hex-opreq: move request batch handling into the session Prepping everything for using dspqueue buffers and doing that inside the session is much cleaner. * hex-mm: yet another fix for src1 reuse when we're mixing hmx/hvx * hex-bufs: introduce pinned mmapings and use non-pinned ones for model buffers * hex-buf: add support for allocating shared/pinned buffer for opreqs * hex-opbatch: make opbatches configurable * hex-naming: better name for ggml_hexagon_shared_buffer * hex-naming: add session->c_name() helper * hex-opbatch: start using shm but still copy for now * hex-opbatch: use shared buffer for packing opbatch * hex-opbatch: beter naming for opbatch related classes and code * hex-opbatch: reuse batched tensors with same data/dims/strides * hex-opbatch: update logging * hex-opbatch: add support for vmem limit for op batching * hex-opbatch: update htp side to properly support dynamic mmap/unmap * hex-opbatch: add OB and OQ params for run-completion script and fix the asserts in batch processing * hex-opbatch: fixed src1 handling in act ops * hex-act: fix empty src1 handling in swiglu and friends Simplify preamble macro while at it * hex-mm: minor fix vtcm and dma handling in matmul cleaning up some left-overs from merges * hex-opbatch: allocate extra 1KB for dspqueue overhead * hexagon: fix softmax for non-aligned tensors and cleanup vtcm alloc * hex-mm: properly handle hmx_disabled flag * hex-ops: update comments * hex-ops: add debug output for get/set-rows * hex-mmap: optimize un/mapping of buffers * hex-opreq: global cache flush and invalidate beyond 128KB threshold * hex-ops: add super simple opfilter regex for debugging If an Op matches the regex hex backend will reject it. * hex-opbatch: wireup newer ops missed in merge and update main switch to detect this in future * hexagon: improved vtcm acquision to remove inter-op overhead Fully compatible with QNN-HTP coex * hex-mm: fixed hvx fallback path * hex-mm: lower the vmem threshold a bit further to ~3GB * hexagon: update debug & error logs This also fixes an issue with newer llvm merging repack and non-repack functions. We use those pointer to distinguish between buffer types. * hexagon: move ops context into main context Just a cleanup. We don't need separate contexts at this point. * hex-opbatch: cleanup naming and headers for opbatch and related descriptors * hex-fa: it's now better to enable FA during TG to reduce graph splits * hexagon: remove GGML_HEXAGON_EXPERIMENTAL env var It's no longer useful. Please use more flexible GGML_HEXAGON_OPFILTER to disable Ops if needed for debugging or validation. * hexagon: fixed editorconfig check * Update ggml/src/ggml-hexagon/ggml-hexagon.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Trivikram Reddy <tamarnat@qti.qualcomm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-10 15:47:43 -07:00
Rithik Sharma	bfd1f453cb	ggml-webgpu: support non-square subgroup matrix configs for Intel GPUs (#21669 )	2026-04-10 10:52:38 -07:00
Chen Yuan	e4fed9d08d	ggml-webgpu: address quantization precision and backend lifecycle managment (#21521 ) * ggml(webgpu): fix the busy-polls in Emscripten in the waitAny after #20618, and remove the busy webgpu log * Merge with upstream * Fix GET_ROWS packed integer NaN when using f16 as memory buffer in shader quants * Update Unary wgsl EXP and EXPM1 for f16 stability * Fix GET_ROWS IQ4_XS strcut for NaN f16 canonicalization * Fix numerical percision for unary sqrt when working with f16 * Fix NaN canonicalization for packed integers using f16 * Update err threshold for binary div ops when using f16 * backend: Keep one Dawn/WebGPU instance alive for the lifetime of the static backend * clean: uncomment existing code logs * clean: clean the unncessary debug info * Refactor and generalize dequant helpers * Remove deprecated quant structs * Refactor shader defines to reduce repetition * Remove error override for F16 type * fix: fix the accidential removal of the proper initialization of ctx * clean: clean legacy and format code * fix: did not modify tests ops --------- Co-authored-by: Jeremy J. Hartmann <jeremy@mtion.tv>	2026-04-10 10:52:01 -07:00
Jeff Bolz	7b69125331	vulkan: Support Q1_0 (#21539 ) * vulkan: Support Q1_0 * use get_dm	2026-04-10 08:35:27 +02:00
Aman Gupta	e34f042154	CUDA: fuse muls (#21665 )	2026-04-10 10:24:09 +08:00
andyluo7	d132f22fc9	HIP: add CDNA4 (gfx950) architecture support for MI350X/MI355X (#21570 ) Add AMD Instinct MI350X/MI355X (gfx950, CDNA4) support: - vendors/hip.h: Add CDNA4 preprocessor define for __gfx950__ - common.cuh: Add GGML_CUDA_CC_CDNA4 and GGML_CUDA_CC_IS_CDNA4 macros - mma.cuh: Route CDNA4 to compatible MFMA instructions: * f32 matmul: mfma_f32_16x16x4f32 (xf32 variant unavailable on gfx950) * bf16 matmul: mfma_f32_16x16x16bf16_1k (same as CDNA3) * int8 matmul: mfma_i32_16x16x32_i8/32x32x16 (same as CDNA3) - mmq.cuh: Include CDNA4 in stream-k kernel dispatch CDNA4 is largely compatible with CDNA3 except: - No xf32 MFMA (mfma_f32_16x16x8_xf32) — routes to f32 path - Different FP8 format (e4m3fn vs e4m3_fnuz) — not changed here Tested on AMD Instinct MI355X (gfx950), ROCm 7.0.1: - Build: compiles cleanly with -DAMDGPU_TARGETS=gfx950 - llama-bench (Qwen2.5-1.5B Q4_K_M, single GPU): * f16+FA: 40,013 tok/s prefill, 254 tok/s decode * q8_0+FA: functional - Flash attention: works correctly - MMQ: works correctly with stream-k dispatch Co-authored-by: Andy Luo <andyluo7@users.noreply.github.com>	2026-04-09 21:13:32 +02:00
Johannes Gäßler	d6f3030047	ggml: backend-agnostic tensor parallelism (experimental) (#19378 ) * ggml: backend-agnostic tensor parallelism * support for GPT-OSS, Qwen 3 MoE * partial Vulkan fix * add support for 4/8 GPUs * unconditional peer access * re-use buffers + ggml contexts * fix output pattern * NCCL support * GGML: HIP: add RCCL support * Remove shfl and AllReduce from backend interface * move allocation workaround out of ggml-alloc.c * 2d tensor set/get support * Fix the seg fault without NCCL * Apply suggestion from JohannesGaessler * support for tensor dims % n_devs != 0 * fix view_offs scaling * arbitrary num. of GPUs/tensor split * fix compilation * better granularity estimate * Support device-specific host buffer types if all underlying backends expose the same type. This allows using pinned memory instead of pageable memory for CUDA. Fix compilation errors. * partial Qwen 3 Next support * Fix qwen3 30b (#8) * Fix crash with Qwen-30B-A3B Q4_0 Qwen-30B-A3B Q4_0 has an intermediate dimension of 768. Using a granularity of 256 forces an uneven split between GPUs, which is not supported by the current implementation. * Decide block size based on tensor quantization type * Fix crashes due to KV cache serialization (#9) KV cache serialization requires non-zero offsets on the tensor. Add support in the meta backend to set/get a tensor with a non-zero offset. * metal : fix build (#7) * static memory allocations, fix usage count * fix tensor granularity * more even memory distribution * use BF16 for allreduce * rebase fixup * better error message for unsupported architectures * Fix device mismatch during scatter of allReduce. (#11) There is a mismatch between the dst buffer device and the backend device, causing the use of sync copies * Enable the previous allreduce implementation. It is better in both perf and stability (#12) * delay AllReduce for Moe for less I/O * build : clean-up compile warnings * backend : move most of the meta backend API to ggml-backend-impl.h * cont : hide unused public API in the implementation * llama : use llama_device + remove ggml_backend_dev_is_meta() * ggml-backend : remove unused alloc include * minor : remove regex include * ggml : introduce ggml-ext.h for staging new APIs * rebase fixup * fix tests * llama : more robust logic for determining Meta devices (#16) * llama : more robust logic for determining Meta devices * cont : fix devs size check Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cont : fix log type Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * disable roundtrip for meta backend * fix arch selection * Qwen 3.5 support * fix Gemma 4 MoE * fix OpenVino, SYCL * fix test-llama-archs for CPU-only builds * Fix Qwen 3.5 MoE * disable meta backend tests for WebGPU * tests : filter CPU-based devices from the Meta backend tests (#17) * meta : formatting, naming, indentation (#18) * formatting : llama-model.cpp * formatting : ggml-ext.h * formatting : ggml-backend-meta.cpp * meta : add TODO * add documentation * better error messages * fix GPT-OSS --------- Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz> Co-authored-by: Gaurav Garg <gaugarg@nvidia.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-09 16:42:19 +02:00
fairydreaming	009a113326	ggml : check return value of CUB calls used in argsort and top-k (they all return cudaError_t) (#21676 ) Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>	2026-04-09 21:17:11 +08:00
Georgi Gerganov	5e9c635463	metal : add missing mm-id specializations for q1_0 (#21662 )	2026-04-09 10:54:00 +03:00
Akarshan Biswas	b54cb2e3d0	sycl : add flash-attn support for head size 512 (#21654 ) * sycl : add flash-attn support for head size 512 This patch extends the SYCL Flash Attention implementation to support head sizes (DKQ/DV) of 512. Changes: - Added DKQ/DV 512 cases to both tile and vector Flash Attention kernels. - Updated kernel selection logic to allow vector kernels for head sizes up to 512 (previously 256). - Removed unused/redundant AMD and RDNA-specific configuration functions in `fattn-tile.hpp`. - Refactored `ggml_backend_sycl_buffer_init_tensor` to use a switch statement for clearer tensor extra buffer initialization. - Added necessary template instances for the new 512 head size across various quantization types. * remove defunct mxfp4 reorder from setting buffer type	2026-04-09 09:36:48 +03:00
Ruben Ortlam	8a132faaa0	vulkan: unify type macros to use Vx instead of _VECx (#21605 )	2026-04-09 07:31:51 +02:00
Aman Gupta	d12cc3d1ca	CUDA: also store `node->src->data` ptrs for equality check (#21635 ) * CUDA: also store node->src->data ptrs for equality check * address review comments	2026-04-09 01:01:56 +08:00
RealOrko	2dcb7f74ed	fix: free ctx_copy in ggml_opt_free to plug per-training-session leak (#21592 ) * fix: free ctx_copy in ggml_opt_free to plug per-training-session leak ggml_opt_alloc populates opt_ctx->ctx_copy via a free+init pair every time the allocated graph shape changes. The last ctx_copy from the final ggml_opt_alloc call survives until ggml_opt_free is invoked, but ggml_opt_free was only freeing ctx_static and ctx_cpu, never ctx_copy. Each opt_ctx lifetime therefore leaks the final per-batch context — ~900 KB for a typical GNN training session in sindarin-pkg-tensor, surfaced via AddressSanitizer. ctx_copy is nullptr-initialized and ggml_free() handles NULL safely, so the new release is guard-free. * Update ggml/src/ggml-opt.cpp Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: realorko <realorko@nowhere.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-04-08 17:40:15 +02:00
Reese Levine	5473949070	webgpu : Query for adapter support when registering WebGPU backend (#21579 )	2026-04-08 16:08:29 +03:00
Pasha Khosravi	dcdcbad42a	metal: Q1_0 backend (#21528 ) * initial Q1_0 Metal backend * tuning q1_0 metal kernels * add Q1_0 to test-backend-ops * add Q1_0<->F32 copy test * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-04-08 16:07:47 +03:00
Aman Gupta	c5ce4bc227	CUDA: make cuda graphs props check faster (#21472 ) * CUDA: compute fast hash instead of expensive props check * use seen node * use memcp	2026-04-08 09:05:51 +08:00
iacopPBK	66c4f9ded0	ggml-cuda: ds_read_b128 for q4_0 and q4_1 mmq kernels (#21168 ) * ds_read_b128 for q4_0 and q4_1 mmq kernels Current for loop generates ds_read_b32 instructions with hip compiler, the new solution generates ds_read_b128 instructions for the same operation, saving some LDS bandwidth. Tested on MI50 and RX6800XT, its faster on both. * Vectorized lds load update: used ggml_cuda_get_max_cpy_bytes and ggml_cuda_memcpy_1 functions for generic implementation * Explicit for loop in mmq, renamed vec into tmp * Fixed max_cpy usage in the loading loop * Fixed typo in q4_1 kernel * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update ggml/src/ggml-cuda/mmq.cuh Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Renoved trailing white line 500 * Update mmq.cuh removed other whitelines * Remove trailing whitespaces --------- Co-authored-by: iacopPBK <iacopPBK@users.noreply.github.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: iacopPBK <iacop@deneb.com>	2026-04-07 21:47:42 +02:00
Reese Levine	957d717ce5	ggml-webgpu: parameterize submission size and add iOS specific limits (#21533 ) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting * Add parameters for different browsers in-flight submissions * Update handling of batch size too * Throttle ios as much as possible * Increase timeout for llvm-pipe testing	2026-04-07 20:30:01 +03:00
Aman Gupta	de1aa6fa73	CUDA: check for buffer overlap before fusing (#21566 ) * CUDA: check for buffer overlap before fusing * use ggml_cuda_check_fusion_memory_ranges	2026-04-08 00:57:04 +08:00
Georgi Gerganov	22fc79134e	ggml : deprecate GGML_OP_ADD1 (#21363 ) * ggml : deprecate GGML_OP_ADD1 * cont : remove tests * cont : re-enable vulkan check	2026-04-07 15:28:27 +03:00
Tom Overlund	2a619f6fbc	ggml: Vulkan build, Linux -- output error string for errno on fork failure (#20868 ) (#20904 )	2026-04-07 13:54:55 +02:00
mkoker	edd4d9bca5	vulkan: add FA dequant for q4_1, q5_0, q5_1, iq4_nl (#21029 ) Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL in the flash attention base shader. Register them in the shader generator, pipeline creation, and enable in the scalar/coopmat1 FA support check.	2026-04-07 13:41:29 +02:00
Antoine Viallon	71a81f6fcc	ggml-cuda : fix CDNA2 compute capability constant for gfx90a (MI210) (#21519 ) GGML_CUDA_CC_CDNA2 was set to 0x910 Fix by setting the constant to 0x90a to match the actual gfx90a ISA.	2026-04-07 12:18:55 +02:00
PMZFX	0988accf82	[SYCL] Add Q8_0 reorder optimization (~3x tg speedup on Intel Arc) (#21527 ) Extend the existing reorder optimization to Q8_0. The reorder separates scale factors from weight data for coalesced memory access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing. On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x) on Qwen3.5-27B. BW utilization: 21% -> 66%. The key fix beyond the kernels: Q8_0 was missing from the type check in ggml_backend_sycl_buffer_init_tensor() that allocates the extra struct carrying the reorder flag -- so the optimization was silently skipped. AI (Claude) was used to assist with root cause investigation and writing the kernel code. All code was human-reviewed and tested on real hardware. Fixes: #21517	2026-04-07 16:12:49 +08:00
Masashi Yoshimura	d0a6dfeb28	ggml-webgpu: Add the support of `MUL_MAT_ID` (#21147 ) * Add mul_mat_id support to WebGPU * Apply suggestion from @reeselevine --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-04-06 13:08:46 -07:00
Pasha Khosravi	2e1f0a889e	ggml: add Q1_0 1-bit quantization support (CPU) (#21273 ) * ggml: add Q1_0 and Q1_0_g128 1-bit quantization support (CPU) * add generic fallback for x86 * remove Q1_0 (group size 32) * rename Q1_0_g128 => Q1_0 * fix Q1_0 LlamaFileType Enum * Fix trailing spaces; add generic fallback for othre backends * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * fix /r/n spacing + arch-fallback --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-04-06 20:55:21 +02:00
Gaurav Garg	15f786e658	[CUDA ] Write an optimized flash_attn_stream_k_fixup kernel (#21159 ) * Write an optimized flash_attn_stream_k_fixup kernel Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst. Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst * Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs * Address review comments * Address review comments * Revert variable names to original	2026-04-06 20:34:29 +02:00
Neo Zhang	f51fd36d79	sycl : handle other FA case (#21377 )	2026-04-06 13:28:00 +03:00
Yarden Tal	25eec6f327	hexagon: slight optimization for argosrt output init (#21463 )	2026-04-05 18:30:25 -07:00
Reese Levine	d006858316	ggml-webgpu: move from parameter buffer pool to single buffer with offsets (#21278 ) * Work towards removing bitcast * Move rest of existing types over * Add timeout back to wait and remove synchronous set_tensor/memset_tensor * move to unpackf16 for wider compatibility * cleanup * Remove deadlock condition in free_bufs * Start work on removing parameter buffer pools * Simplify and optimize further * simplify profile futures * Fix stride * Try using a single command buffer per batch * formatting	2026-04-03 11:40:14 -07:00
Vishal Singh	f1ac84119c	ggml-zendnn : add MUL_MAT_ID op support for MoE models (#21315 ) * ggml-zendnn : add MUL_MAT_ID op support for MoE models - Add MUL_MAT_ID op acceleration for Mixture-of-Experts models - MUL_MAT_ID op fallback to CPU backend if total experts > 32 - Point ZenDNN lib to latest bits ZenDNN-2026-WW13 * ggml-zendnn : add braces to sgemm failure condition for consistency Co-authored-by: Aaron Teo <taronaeo@gmail.com> --------- Co-authored-by: Aaron Teo <taronaeo@gmail.com>	2026-04-03 12:19:08 +03:00
Radoslav Gerganov	0c58ba3365	rpc : reuse compute graph buffers (#21299 ) Reuse the buffer for the ggml context which is used for creating the compute graph on the server side. This partially addresses a memory leak created by the CUDA backend due to using buffer addresses as cache keys. ref: #21265 ref: #20315	2026-04-03 10:28:09 +03:00
Zheyuan Chen	a1cfb64530	ggml-webgpu: add vectorized flash attention (#20709 ) * naive vectorized version * add vectorized flash attention * update vec version * remove unused path and shader * remove unused helper functions * add comments * remove pad path * ggml-webgpu: fix flash-attn vec nwg=1 path and tighten vec specialization * change back to vec4 * enable multi split * enable vec path when: - Q->ne[1] < 20 - Q->ne[0] % 32 == 0 - V->ne[0] % 4 == 0 - K->type == f16 * update flast_attn_vec_split.wgsl to reduce redundant workgroup barrier usage and use select * enable vec path for q4 and q8 * flash-attn vec nwg=1 fast path (skip tmp/reduce staging) * use packed f16 K loads in flash-attn vec split * use packed f16 K loads in flash-attn vec split on host side * tune flash-attn vec f16 VEC_NE by head dim * cleanup * cleanup * keep host side clean * cleanup host side * change back to original host wait/submit behavior * formatting * reverted param-buffer pool r ecfactor * add helper functions * ggml-webgpu: move flash-attn vec pipeline caching back into shader lib * ggml-webgpu: remove duplicate functions * ggml-webgpu: reserve flash-attn vec scratch in dst buffer allocation * ggml-webgpu: revert unrelated change * ggml-webgpu: revert deleted comment * disable uniformity check * remove unnecessary change * Update ggml/src/ggml-webgpu/wgsl-shaders/flash_attn_vec_split.wgsl * Update ggml/src/ggml-webgpu/ggml-webgpu.cpp --------- Co-authored-by: Reese Levine <reeselevine1@gmail.com>	2026-04-02 10:40:42 -07:00
Georgi Gerganov	bc07d55922	ggml : bump version to 0.9.11 (ggml/1456)	2026-04-02 10:39:00 +03:00
Neo Zhang	4888137b17	sycl : fix llama_kv_cache hang when kv_cache is huge: 5GB (#21283 )	2026-04-02 10:08:32 +03:00
Todor Boinovski	fbd441c379	hexagon : add cumsum op support (#21246 ) * hexagon : add cumsum op support * hexagon: enable dma for cumsum op * Fix line-ending --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-04-01 17:44:02 -07:00
lhez	95a6ebabb2	opencl: fix leak in Adreno q8_0 path (#21212 )	2026-04-01 12:54:58 -07:00
Johannes Gäßler	86221cf6da	CUDA: fix FA kernel selection logic (#21271 )	2026-04-01 22:28:19 +03:00
Aparna M P	8710e5f9b9	hexagon: improve RMS_NORM and DIV accuracy (#21251 ) * hexagon-rms_norm: fix RMS_NORM for non-aligned tensor sizes Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com> * hexagon-div: perform DIV in fp16 domain for lower dsp archs --------- Co-authored-by: Krishna Sridhar <srsr@qti.qualcomm.com>	2026-04-01 08:43:08 -07:00

1 2 3 4 5 ...

2277 Commits