llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	1fe00296f5	vulkan: fuse adds (#15252 ) * vulkan: fuse adds Fuse adds that have the same shape, which are common in MoE models. It will currently fuse up to 6 adds, because we assume no more than 8 descriptors per dispatch. But this could be changed. * check runtimeDescriptorArray feature * disable multi_add for Intel due to likely driver bug	2025-08-16 11:48:22 -05:00
Jeff Bolz	de2192794f	vulkan: Support mul_mat_id with f32 accumulators (#15337 ) * vulkan: Add missing bounds checking to scalar/coopmat1 mul_mat_id * vulkan: Support mul_mat_id with f32 accumulators, but they are not hooked up - There's no explicit way to request f32 precision for mul_mat_id, but there probably should be, and this gets the code in place for that. - A couple fixes to check_results. - Remove casts to fp16 in coopmat1 FA shader (found by inspection).	2025-08-16 11:18:31 +02:00
Georgi Gerganov	5edf1592fd	vulkan : fix out-of-bounds access in argmax kernel (#15342 ) ggml-ci	2025-08-15 16:16:36 +02:00
Georgi Gerganov	db3010bd23	vulkan : fix compile warnings on macos (#15340 ) ggml-ci	2025-08-15 15:28:28 +02:00
Jeff Bolz	863d341eeb	vulkan: perf_logger improvements (#15246 ) * vulkan: perf_logger improvements - Account for batch dimension in flops calculation. - Fix how "_VEC" is detected for mat_mul_id. - Fix "n" dimension for mat_mul_id (in case of broadcasting). - Include a->type in name. * use <=mul_mat_vec_max_cols rather than ==1	2025-08-14 08:38:10 -05:00
Jonathan Graehl	5cdb27e091	finetune: SGD optimizer, more CLI args (#13873 ) * examples/finetune -opt SGD (stochastic gradient descent) memory opt add unit tested GGML_OPT_OPTIMIZER_SGD to ggml - avoids allocating m, v tensors. support finetune.cpp arg -opt SGD (or sgd). (default adamw as before) llama 3.2-1b-F32 result: observed 11gb gpu ram (41 sec/epoch) when using SGD instead of 19gb (55 sec/epoch) using adamw. (wikipedia 100 lines finetune) ( using the same GPU memory, adamw can only do before OOM 512 batch/context, reaching: train: [███████▉] data=0000140/0000140 loss=0.02575±0.00099 acc=99.52±0.03% t=00:00:47 ETA=00:00:00 val: [███████▉] data=0000008/0000008 loss=4.76565±0.28810 acc=41.46±0.77% t=00:00:00 ETA=00:00:00 SGD is superior, though it converges slower, with max before OOM 1728 batch/context (esp see the better validation perf): train: [███████▉] data=0000039/0000039 loss=0.00371±0.00010 acc=99.96±0.01% t=00:00:41 ETA=00:00:00 val: [███████▉] data=0000003/0000003 loss=5.11406±0.76034 acc=48.01±0.69% t=00:00:01 ETA=00:00:00 ) note: when finetuning long enough (or w/ enough -lr), validation accuracy eventually drops ('catastrophic forgetting') -lr-half (halflife) option useful for SGD to avoid oscillation or super slow underdamped learning (makes setting -lr more forgiving). terminal -lr for now is set by lr-halvings i.e. if you want at most 1/8 the inital -lr you set -lr-halvings 3. note: objective loss not directly comparable between adamw, sgd? - check perplexity or accuracy or consider relative improvements for convergence new finetune args -wd 1e-9 to enable weight decay in sgd or adamw, and max -epochs N (default 2 as before) cache (1 - wdalpha) in 'adamw' opt struct - no noticeable perf benefit, disabled (still done for new SGD though) since opt. memory is pre-allocated, the ggml_opt_get_optimizer_params would probably be able to change between SGD and AdamW with each epoch but would need to use adamw for the first (unconfirmed - no cmdline arg to set such a policy yet) test-opt checks adamw as before and now sgd (except for a few disabled tests for sgd only; probably just needs logging values and adding alternate reference values); tolerance on the 'regression' test is broader for sgd (so we don't need many more epochs) Vulkan: Implement GGML_OP_OPT_STEP_SGD * tests: Fix OPT_STEP_SGD test-backend-ops * SGD op param store weight-decay and not 1-alphawd minor + cosmetic changes * fix vulkan sgd * try CI fix --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-14 12:03:57 +02:00
AN Long	cd6983d56d	ggml : fix field name when new ggml_backend (#14944 )	2025-08-08 14:37:22 +02:00
Jeff Bolz	c4f53563df	vulkan: support fattn sinks (#15126 )	2025-08-07 22:44:20 +02:00
Jeff Bolz	a0552c8bee	vulkan: Add env var to disable host visible vidmem (#15109 )	2025-08-07 22:07:11 +02:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Jeff Bolz	5aa1105da2	vulkan: fix build when using glslang that does not support coopmat2 (#15062 )	2025-08-04 07:09:19 +02:00
Jeff Bolz	6c7a441161	vulkan: Use coopmat2 for conv2d (#14982 )	2025-08-03 14:23:57 +02:00
Jeff Bolz	4cb208c93c	vulkan: coopmat2 mul_mat optimizations (#14934 ) - Increase tile size for k-quants, to match non-k-quants - Choose more carefully between large and medium tiles, considering how it interacts with split_k - Allow larger/non-power of two split_k, and make the splits a multiple of 256 - Use split_k==3 to when >1/2 and <=2/3 of the SMs would hae been used	2025-08-02 11:21:37 +02:00
Jeff Bolz	ec0b18802c	vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015 )	2025-08-02 10:48:30 +02:00
Jeff Bolz	a9f7541ec2	vulkan: optimizations for direct convolution (#14933 ) * vulkan: optimizations for direct convolution - Empirically choose a better tile size. Reducing BS_K/BS_NPQ helps fill the GPU. The new size should be amenable to using coopmat, too. - Fix shmem bank conflicts. 16B padding should work with coopmat. - Some explicit loop unrolling. - Skip math/stores work for parts of the tile that are OOB. - Apply fastdiv opt. - Disable shuffles for NV. * Three tiles sizes for CONV_2D, and a heuristic to choose * reallow collectives for pre-Turing * make SHMEM_PAD a spec constant * fixes for intel perf - no shmem padding, placeholder shader core count * shader variants with/without unrolling * 0cc4m's fixes for AMD perf Co-authored-by: 0cc4m <picard12@live.de> --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-08-02 09:57:04 +02:00
Ruben Ortlam	e08a98826b	Vulkan: Fix minor debug mode issues (#14899 ) * vulkan: fix debug mode issues * vulkan: remove broken check_results GGML_OP_SET_ROWS support	2025-07-31 17:46:54 +02:00
Kai Pastor	73a8e5ca03	vulkan : fix 32-bit builds (ggml/1313) The pipeline member can be cast to VkPipeline. This is a VkPipeline_T* on 64 bit but a uint64_t on 32 bit. Cf. VK_DEFINE_NON_DISPATCHABLE_HANDLE documentation.	2025-07-30 17:33:11 +03:00
Erik Scholz	89d1029559	vulkan : add fp16 support for the conv_2d kernel (#14872 ) * add f16 to conv_2d testing * weaken conv2d test error threshold	2025-07-27 12:04:33 +02:00
Jeff Bolz	f1a4e72de5	vulkan: skip empty set_rows to avoid invalid API usage (#14860 )	2025-07-27 11:05:34 +02:00
Jeff Bolz	84712b6043	vulkan: fix rms_norm_mul to handle broadcasting dim0 (#14817 )	2025-07-22 17:35:21 +02:00
Ervin Áron Tasnádi	a979ca22db	ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316 ) * ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan * ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly with gemm (no need for im2col), * test-backend-ops: adds test_case_ref to check the validity/performance of ops against reference implementations having different graphs, adds tests * * Performance fixes: minimized branch divergence, uses collectives to eliminate redundant calculation, macros removed. * Kernel shared memory size check * Updates test-backend-ops to support graphs for performance measurement. * * Apple/Win32 compile errors fixed * Subgroup size used to determine tile size -> fixes llvmpipe errors. * Collectives disabled by default. * Intel support is disabled as the performance is poor. * Conv2d enabled for Intel with disabled collectives, disabled for Apple * test-backend-ops modifications are reverted * Trailing spaces and missing override fixed. * Triggering pipeline relaunch. * Code formatted with .clang-format.	2025-07-19 21:59:08 +02:00
Peter0x44	d4b91ea7b2	vulkan: Add logging for bf16 features to ggml_vk_print_gpu_info (#13274 ) (#14707 )	2025-07-19 17:58:03 +02:00
Jeff Bolz	ba1ceb3456	vulkan: fix noncontig check for mat_mul_id splitting (#14683 ) * vulkan: fix noncontig check for mat_mul_id splitting Remove supports_op check for > 4096 (splitting fixes this) * vulkan: fix batched matmul dequant for Q*_K	2025-07-15 21:51:09 +02:00
Jeff Bolz	10a0351a97	vulkan: add RTE variants for glu/add/sub/mul/div (#14653 )	2025-07-15 21:32:11 +02:00
Georgi Gerganov	3120413ccd	vulkan : remove unused vars (#0 ) ggml-ci	2025-07-12 14:25:44 +03:00
Acly	74bb294591	vulkan : implement bilinear interpolation (ggml/1291) ggml-ci	2025-07-12 14:25:44 +03:00
Acly	3e303b1107	vulkan : implement ggml_roll (ggml/1290) ggml-ci	2025-07-12 14:25:44 +03:00
Jeff Bolz	b3ad3a0191	vulkan: support SET_ROWS (#14587 ) * vulkan: support SET_ROWS Add variants of the copy_to_quant shader that do the SET_ROWS operation. Change these shaders to spread the work across the workgroup. The memory access pattern is probably not great (one thread per quant block), but should be fine for now. * vulkan: optimize set_rows Larger workgroups for non-quant types. Set "norepeat" (there is manual repeat logic). Use fastmod.	2025-07-12 12:12:26 +02:00
Jeff Bolz	98197e5c98	vulkan: optimizations for deepseek prompt processing (#14555 ) * vulkan: allow unclamped loads in coopmat2 mul_mat_id shader * vulkan: increase coopmat2 mul_mat_id tile size * vulkan: optimize mat_mul_id row_ids search to batch loads, and port to coopmat1 path * vulkan: use smaller FA row size when head size is large. applies to both scalar and CM2 paths (CM1 isn't used due to shared memory limits)	2025-07-12 11:51:58 +02:00
Xuan-Son Nguyen	98bab638fb	ggml : add ggml_scale_bias (#14417 ) * ggml : add ggml_scale_bias * ggml_vec_mad1_f32 * add more simd * add CUDA * sycl * vulkan * cann (placeholder) * opencl * will this fix cpu? * fix cuda * suggestions from coderabbit * fix cann compile error * vDSP_vsmsa * rm __ARM_FEATURE_SVE * use memcpy for op params * make code looks more consistent * use scalar for __ARM_FEATURE_SVE * add x param to ggml_vec_mad1_f32	2025-07-09 18:16:12 +02:00
Jeff Bolz	6efcd65945	vulkan: optimize flash attention split_k_reduce (#14554 ) * vulkan: allow FA split_k with smaller KV values * vulkan: spread split_k_reduce work across more threads k_num can get rather large. Use the whole workgroup to reduce the M/L values. Launch a thread for each element in the HSV dimension of the output. Helps a lot for large HSV (like deepseek).	2025-07-08 20:11:42 +02:00
Jeff Bolz	e592be1575	vulkan: fix rms_norm+mul fusion (#14545 ) The fused operation was grabbing the epsilon value from the wrong place. Add an env var to disable fusion. Add some missing checks for supported shapes/types. Handle fused rms_norm+mul in check_results.	2025-07-06 10:08:16 +02:00
Jeff Bolz	a0374a67e2	vulkan: Handle updated FA dim2/3 definition (#14518 ) * vulkan: Handle updated FA dim2/3 definition Pack mask boolean and n_head_log2 into a single dword to keep the push constant block under the 128B limit. * handle null mask for gqa * allow gqa with dim3>1	2025-07-05 09:26:04 +02:00
Sigbjørn Skjæret	28657a8229	ggml : implement GEGLU_ERF and GEGLU_QUICK ops (#14445 )	2025-07-03 23:07:22 +02:00
Jeff Bolz	2b72bedec1	vulkan: support mixed/deepseekR1 FA head sizes (#14509 ) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes	2025-07-03 20:21:14 +02:00
Georgi Gerganov	a70c8a0c4b	kv-cache : use ggml_set_rows (#14285 ) * kv-cache : use ggml_set_rows ggml-ci * graph : separate k and v indices ggml-ci * cont : remove redundant ifs ggml-ci * kv-cache : improve find_slot impl * kv-cache : bounds-check when accessing slot_info indices * kv-cache : add comments ggml-ci * ggml : add TODOs for adding GGML_OP_SET_ROWS support in the backends ggml-ci	2025-07-03 10:53:35 +03:00
Georgi Gerganov	9067487c44	ggml : fix FA mask dim 2 and 3 (#14505 ) * ggml : fix FA mask dim 2 and 3 ggml-ci * backends : unsupport batched FA in CUDA and Vulkan ggml-ci * vulkan : disable FA for mask->ne[2] != 1	2025-07-03 10:46:57 +03:00
Jeff Bolz	8875523eb3	vulkan: support softmax/FA batch and broadcast (#14449 )	2025-07-02 15:48:33 +03:00
Georgi Gerganov	ec68e84c32	ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435 ) ggml-ci	2025-07-02 15:48:33 +03:00
Jeff Bolz	6a746cf9c4	vulkan: Split large mul_mat_id to fit in shared memory (#14451 )	2025-07-01 10:43:08 +02:00
Sigbjørn Skjæret	eff5e45443	add GELU_ERF (#14455 )	2025-07-01 10:14:21 +02:00
Sigbjørn Skjæret	a0535ffa0d	ggml : implement REGLU/GEGLU/SWIGLU ops (#14158 ) * implement unary REGLU/GEGLU/SWIGLU cpu ops * relax constraints * duplicate shape of source * fix ggml_vec_geglu_f16 * special case gated ops * implement unary REGLU/GEGLU/SWIGLU cuda ops * tighten constraints again * refactor into GGML_GLU_OP * metal : add glu kernels ggml-ci * add CUDA_GLU_BLOCK_SIZE [no ci] * more constraints and use 64bit ints ggml-ci * 64bit multiplication [no ci] * implement swapped variants (cpu/cuda) * update comment [no ci] ggml-ci * Vulkan: Add GLU ops and shaders * SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate * ggml : implement GLU for split up/gate (#14181) * implement GLU for split up/gate * add tests for ggml_glu_split * Vulkan: Implement glu_split logic and shader support * add split to logging [no ci] * SYCL: refactor element_size ops and add split up and gate support to gated kernels * SYCL: switch GEGLU to use tanh approximation --------- Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> * GGML: increase OP count in assertion * Refactor: Optimize SYCL element-wise operations with unary function inlining This commit refactors the SYCL element-wise operations to improve performance by: - Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead. - Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic. - Replacing direct kernel calls with calls to these inlined functions. - Using `__dpct_inline__` to encourage compiler inlining. - Minor code cleanup and consistency improvements. The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices. * vulkan: Increase workgroup size for GLU, for performance (#14345) * vulkan: Increase workgroup size for GLU, for performance * vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup * merge fix * metal : add support for split and swap ggml-ci --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: 0cc4m <picard12@live.de> Co-authored-by: Akarshan <akarshan@menlo.ai> Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-06-29 11:04:10 +02:00
Jeff Bolz	bd9c981d72	vulkan: Add fusion support for RMS_NORM+MUL (#14366 ) * vulkan: Add fusion support for RMS_NORM+MUL - Add a use_count to ggml_tensor, so we can detect if an output is used more than once. - Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor. - Add detection logic and basic fusion logic in ggml-vulkan. - Add some testing support for fusion. Rather than computing one node at a time, allow for computing the whole graph and just testing one node's results. Add rms_norm_mul tests and enable a llama test. * extract some common fusion logic * fix -Winconsistent-missing-override * move ggml_can_fuse to a common function * build fix * C and C++ versions of can_fuse * move use count to the graph to avoid data races and double increments when used in multiple threads * use hash table lookup to find node index * change use_counts to be indexed by hash table slot * minimize hash lookups style fixes * last node doesn't need single use. fix type. handle mul operands being swapped. * remove redundant parameter --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-06-29 09:43:36 +02:00
Jeff Bolz	63a7bb3c7e	vulkan: handle noncontig in the final case of ggml_vk_get_cpy_pipeline (#14378 )	2025-06-28 17:36:40 +02:00
Jeff Bolz	00d5282c7f	vulkan: lock accesses of pinned_memory vector (#14333 )	2025-06-28 17:17:09 +02:00
Markus Tavenrath	bb16041cae	Add support for VK_EXT_debug_utils to add labels to Vulkan objects. (#13792 ) * Add support for VK_EXT_debug_utils to add labels to Vulkan objects. In step 1 compute pipelines are getting labeled. * remove #ifdef for debug utils and add queue marker.	2025-06-21 08:17:12 +02:00
0cc4m	10bb545c5b	Vulkan: Set device max size for host memory to avoid OOM warning and fallback to CPU buffer (#14249 )	2025-06-19 09:15:42 +02:00
Jeff Bolz	c89c2d1ab9	vulkan: mutex around vkQueueSubmit (#14127 ) This fixes the remaining crash in test-thread-safety on my system.	2025-06-16 08:21:08 +02:00
Jeff Bolz	bd248d4dc7	vulkan: Better thread-safety for command pools/buffers (#14116 ) This change moves the command pool/buffer tracking into a vk_command_pool structure. There are two instances per context (for compute+transfer) and two instances per device for operations that don't go through a context. This should prevent separate contexts from stomping on each other.	2025-06-11 09:48:52 -05:00
Jeff Bolz	1f7d50b293	vulkan: Track descriptor pools/sets per-context (#14109 ) Use the same descriptor set layout for all pipelines (MAX_PARAMETER_COUNT == 8) and move it to the vk_device. Move all the descriptor pool and set tracking to the context - none of it is specific to pipelines anymore. It has a single vector of pools and vector of sets, and a single counter to track requests and a single counter to track use.	2025-06-11 07:19:25 +02:00
0cc4m	97340b4c99	Vulkan: Don't default to CPU device (like llvmpipe), even if no other device is available, to allow fallback to CPU backend (#14099 )	2025-06-10 13:01:33 +01:00
Masato Nakasaka	669c13e0f6	vulkan: Enable VK_KHR_cooperative_matrix extension for Intel Xe2 GPUs (#14001 ) * allowing B580 and U9-288V * experimenting code to detect Xe2 * allowing coopmat only for Xe2 GPUs * fixed comment wording * fixed comment wording * removed unnecessary driver check	2025-06-05 16:00:29 +02:00
Jeff Bolz	5a8ae3053c	vulkan: automatically deduce size of push constants (#13936 )	2025-06-05 07:17:58 +02:00
Ervin Áron Tasnádi	0d3984424f	ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813 ) * * ggml-vulkan: adds op CONV_TRANSPOSE_1D * test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D * Missing barrier added to shader. Number of additional tests reduced to 108. * * Fixes typo in variable name. * Removes extra whitespaces. * Adds int64->int32 casts to prevent possible warnings. * Problem size reduced in tests to pass tests with llvmpipe. * supports_op condition moved from unintended position	2025-06-04 22:02:00 +02:00
Jeff Bolz	7e00e60ef8	vulkan: fix warnings in perf logger querypool code (#13937 )	2025-06-03 20:30:22 +02:00
Kai Pastor	108009f5c7	vulkan : Remove unexpected ; (ggml/1253)	2025-06-01 13:43:57 +03:00
Jeff Bolz	bef8176387	vulkan: use timestamp queries for GGML_VULKAN_PERF (#13817 ) Also change it to be controlled by an env var rather than cmake flag	2025-05-27 18:39:07 +02:00
Jeff Bolz	fef693dc6b	vulkan: mark IM2COL as supporting non-contig (#13783 )	2025-05-26 06:02:07 +02:00
Jeff Bolz	1dcd01960c	vulkan: support CPY from any type to itself (#13695 ) Reuse the f16/f32 copy shaders, and just scale the number of elements according to the type size.	2025-05-23 06:45:02 +02:00
Jeff Bolz	c10ed6cbcc	vulkan: Disable coopmat/coopmat2/bfloat extensions if glslc doesn't support it (#13696 )	2025-05-23 06:33:45 +02:00
Judd	a127ff1780	use LOG_WARN to replace `std::cerr` (#13657 )	2025-05-23 06:33:08 +02:00
Eve	fb1cab201c	vulkan: fix warnings (#13626 ) * small fixes * remove ifdef	2025-05-20 21:35:16 +00:00
0cc4m	8960efd0a6	Vulkan: Add f32 accumulator support to quantized mul mat to fix GLM4 32B incoherence (#13607 )	2025-05-19 17:54:08 +02:00
Jeff Bolz	4f41ee11d6	vulkan: use scalar FA rather than coopmat2 when N==1 (#13554 )	2025-05-17 08:35:47 +02:00
Jeff Bolz	24e86cae72	vulkan: KHR_coopmat flash attention (#13506 ) This shader uses coopmat1 to do the QK^T multiply. The PV multiply is more difficult for various reasons so I haven't done it. Performance for this shader is around 2.5x better than for the scalar shader when doing prompt processing. Some of the benefit may be from other optimizations like staging through shared memory, or splitting by rows.	2025-05-14 11:55:26 +02:00
Jeff Bolz	dc1d2adfc0	vulkan: scalar flash attention implementation (#13324 ) * vulkan: scalar flash attention implementation * vulkan: always use fp32 for scalar flash attention * vulkan: use vector loads in scalar flash attention shader * vulkan: remove PV matrix, helps with register usage * vulkan: reduce register usage in scalar FA, but perf may be slightly worse * vulkan: load each Q value once. optimize O reduction. more tuning * vulkan: support q4_0/q8_0 KV in scalar FA * CI: increase timeout to accommodate newly-supported tests * vulkan: for scalar FA, select between 1 and 8 rows * vulkan: avoid using Float16 capability in scalar FA	2025-05-10 08:07:07 +02:00
Jeff Bolz	02115dcd9a	vulkan: Allow up to 4096 elements for mul_mat_id row_ids (#13326 ) This assert fired running Qwen_Qwen3-30B-A3B-Q2_K.gguf: GGML_ASSERT(nei0 * nei1 <= 3072); The tensor is 8 x 512. Increase this array size to accommodate.	2025-05-09 09:23:41 +02:00
Jeff Bolz	8ae5ebcf85	vulkan: Additional type support for unary, binary, and copy (#13266 ) Support f16->f32 copy. Support f16->f16 and f32->f32 unary ops. Support all combinations of f16/f32 for src0/src1/dst for add/sub/mul/div.	2025-05-04 07:17:16 +02:00
Georgi Gerganov	b34443923c	sync : ggml (#13268 ) * vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204) * vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW) * review: remove src_x/y < 0 checks; add performance tests * sync : ggml ggml-ci * vulkan : fix lint (#0) --------- Co-authored-by: Acly <aclysia@gmail.com>	2025-05-02 20:54:30 +03:00
Jeff Bolz	79f26e9e12	vulkan: Add bfloat16 support (#12554 ) * vulkan: Add bfloat16 support This adds bfloat16 matrix multiply support based on VK_KHR_shader_bfloat16. The extension is required for coopmat multiply support, but matrix-vector multiply trivially promotes bf16 to fp32 and doesn't require the extension. The copy/get_rows shaders also don't require the extension. It's probably possible to fall back to non-coopmat and promote to fp32 when the extension isn't supported, but this change doesn't do that. The coopmat support also requires a glslc that supports the extension, which currently requires a custom build. * vulkan: Support bf16 tensors without the bf16 extension or coopmat support Compile a variant of the scalar mul_mm shader that will promote the bf16 values to float, and use that when either the bf16 extension or the coopmat extensions aren't available. * vulkan: bfloat16 fixes (really works without bfloat16 support now) * vulkan: fix spirv-val failure and reenable -O	2025-05-01 20:49:39 +02:00
Jeff Bolz	fc727bcdd5	vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader (#13191 ) * vulkan: Handle src1 batch dimension in non-contiguous mat-vec-mul shader	2025-05-01 20:19:31 +02:00
Eve	b3b6d862cf	vulkan: matmul gcn tuning (#13016 ) * tune matmul for gcn * this one is more power efficient * Update ggml/src/ggml-vulkan/ggml-vulkan.cpp Co-authored-by: 0cc4m <picard12@live.de> * disable this tune for the proprietary driver --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-04-24 09:18:33 +02:00
Jeff Bolz	66168204be	vulkan: support noncontiguous rms_norm (#13031 )	2025-04-20 10:50:02 +02:00
Georgi Gerganov	2f74c354c0	graph : make FA compatible with MLA + add initial Metal kernels (#12953 ) * graph : make mla compatible with FA * metal : add exp FA kernels for DeepSeek models ggml-ci * llama : minor naming updates ggml-ci * ggml : disable FA for DS head sizes * tests : add FA tests for MLA shapes ggml-ci	2025-04-17 18:16:36 +03:00
Jeff Bolz	015022bb53	vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931 ) The grouped query attention optmization doesn't require a power of two ratio, the only thing relying on it was the modulo operation written as bitwise &. split_k need not depend on gqa_ratio - enable it any time there's only one workgroup in the X dimension. The shader gets the split index from the x coord, and multiple workgroups in the X dimension (pre-split) indicates a larger FA operation that wouldn't need splitting.	2025-04-16 20:37:25 +02:00
Diego Devesa	fe92821ea9	ggml : add bilinear upscale support (ggml/1185)	2025-04-11 00:17:47 +03:00
Jeff Bolz	0090950f67	vulkan: In coopmat2 mmq, load q4_k/q5_k scales through shared memory (#12833 ) q4_k and q5_k had a lot of redundant global loads where the same 16B of scale information is repeatedly loaded and decoded during each loop iteration. This change restructures the loops to more explicitly iterate over whole blocks in the outer loop (with unrolled inner loop) and to copy/decode the scale data into shared memory once at the start of each outer loop. The copy is pipelined so the scale load from global memory is relatively cheap. This improves q4_k/q5_k model prompt processing performance by around 5-7%. I briefly tried applying this to q6_k and q4_0, and it didn't help for q6_k and hurt for q4_0. The big "else" path in mul_mm_cm2.comp that had all the clamped/unclamped variants isn't used as often as it originally was (e.g. due to the padded_N change), so I trimmed it down to offset some of the new complexity of the semi-manual loop unrolling.	2025-04-09 07:25:08 +02:00
Jeff Bolz	80b717d493	vulkan: Use unclamped loads for flash attention mask (#12720 ) nem1 must be a multiple of GGML_KQ_MASK_PAD, and GGML_KQ_MASK_PAD is a multiple of the number of rows in the matrix. The KV dim is a multiple of the number of columns for the aligned shader.	2025-04-06 10:47:13 +02:00
0cc4m	6bf28f0111	Vulkan: Tune Vulkan mmq int dot shader for performance (#12767 )	2025-04-05 18:04:03 +02:00
Jeff Bolz	74d4f5b041	vulkan: Hybrid waitForFences/getFenceStatus to reduce fence latency (#12630 ) There seems to be a bubble waking up from waitForFences, which costs a few percent performance and also increased variance in performance. This change inserts an "almost_ready" fence when the graph is about 80% complete and we waitForFences for the almost_ready fence and then spin (with _mm_pauses) waiting for the final fence to be signaled.	2025-04-04 07:54:35 +02:00
Jeff Bolz	f01bd02376	vulkan: Implement split_k for coopmat2 flash attention. (#12627 ) When using group query attention, we have one workgroup per KV batch and this can be very few workgroups (e.g. just 8 in some models). Enable split_k to spread the work across SMs. This helps a lot when the KV cache is large.	2025-04-02 14:25:08 -05:00
Jeff Bolz	be0a0f8cae	vulkan: Implement grouped query attention in the coopmat2 FA shader (#12559 ) When adjacent batches of Q share the same batches of K/V, batch them into the same workgroup. For example, when: dst(128,32,1,1) = FA(q(128,1,32,1), k(128,16640,8,1), v(128,16640,8,1)) previously we would run 32 workgroups computing 1 result each, now we will run 8 workgroups computing 4 results each. This doesn't directly translate to better performance (at least when you have >=32 SMs), but in a subsequent change I'll enable split_k which will scale much better with 4x fewer workgroups.	2025-04-02 19:40:32 +02:00
Wagner Bruna	2bb3597e42	vulkan: fix build when glslc doesn't support coopmat (#12683 )	2025-04-01 11:38:07 +02:00
0cc4m	a8a1f33567	Vulkan: Add DP4A MMQ and Q8_1 quantization shader (#12135 ) * Vulkan: Add DP4A MMQ and Q8_1 quantization shader * Add q4_0 x q8_1 matrix matrix multiplication support * Vulkan: Add int8 coopmat MMQ support * Vulkan: Add q4_1, q5_0 and q5_1 quants, improve integer dot code * Add GL_EXT_integer_dot_product check * Remove ggml changes, fix mmq pipeline picker * Remove ggml changes, restore Intel coopmat behaviour * Fix glsl compile attempt when integer vec dot is not supported * Remove redundant code, use non-saturating integer dot, enable all matmul sizes for mmq * Remove redundant comment * Fix integer dot check * Fix compile issue with unsupported int dot glslc * Update Windows build Vulkan SDK version	2025-03-31 14:37:01 +02:00
Georgi Gerganov	b4ae50810e	metal : improve FA + improve MoE (#12612 ) * ggml : FA with different K, V head sizes (CPU) ggml-ci * metal : add FA with HS=192 * metal : extend FA to support different K and V head sizes ggml-ci * metal : add FA vector kernels for heads K 192 and V 128 ggml-ci * ggml : restrict op on other backends to equal head sizes ggml-ci * metal : optimize FA-vec kernel ggml-ci * metal : FA remove mq registers * metal : improve MoE mul_mat_id condition ggml-ci * metal : fix comments + remove unnecessary addition ggml-ci * metal : avoid too much shared memory usage with mul_mat_id ggml-ci	2025-03-28 20:21:59 +02:00
Jeff Bolz	eddfb43850	vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505 ) * tests: add mul_mat perf/functional tests for p021/nc vulkan shaders * vulkan: Optimize mul_mat_vec p021 and nc shaders. These shaders are used in attention calculations, and when the KV cache grows large they start to dominate the run time. For the nc shader (which is called with large 'k' dimension), use unrolling and vector loads. For the p021 shader (which is called with large 'm' and small 'k' dimensions), take advantage of grouped query attention to reuse loads from the A matrix for the whole group, and reduce the number of workgroups (too much overhead from tiny dispatches). Using subgroupAdd in the p021 shader also helps, use that conditionally.	2025-03-22 09:40:11 +01:00
stduhpf	4375415b4a	Vulkan: RTE rounding for cpy to quant (#12480 ) * Vulkan: RTE rounding for cpy to quant Co-Authored-By: Jeff Bolz <jbolz@nvidia.com> * remove trailing whitespace * avoid duplicating pipeline_cpy_f32_quant * fix copypasting issue * remove duplicated code --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com>	2025-03-21 20:34:50 +01:00
Jeff Bolz	c446b2edd2	vulkan: Submit once enough matmul work has been recorded (#12406 ) I've been seeing significantly worse performance for tg with flash attention enabled vs disabled, and it seems to be related to the submit heuristic. Change the heuristic to check how many bytes worth of weight matrix are used and flush every 100MB, and ramp up after the first few submits. This seems to resolve the issue, and also increases perf for non-FA a bit.	2025-03-19 08:26:26 +01:00
0cc4m	fd123cfead	Vulkan: Default to 1GB allocations instead of 4GB to avoid fragmentation and driver issues (#12434 )	2025-03-18 07:21:40 +01:00
Molly Sophia	7dfad387e3	llama: Add support for RWKV v7 architecture (#12412 ) * ggml: Add op l2_norm Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: Add op rwkv_wkv7 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: Add support for RWKV7 and ARWKV7 models Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix inference with RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: add more (a)rwkv7 variants in size Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Apply code-format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * fix MUSA build Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * llama: fix shape error with rwkv using llama-parallel Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2025-03-18 07:27:50 +08:00
Jeff Bolz	484a8ab513	vulkan: Add N/2 and N/4 optimized paths in coopmat2 shader (#12312 )	2025-03-17 09:26:18 -05:00
Daniele	cf2270e4d3	vulkan: subgroup size tuning (#12087 ) * vulkan: subgroup size test * Vulkan: Add device architecture enum and logic to recognize AMD generations * vulkan: use new architecture logic to specify subgroup size * Initial vulkan subgroup size tuning for RDNA3 * vulkan: commonize RDNA subgroup tuning * vulkan: override subgroup size if required_subgroup_size = 0 * vulkan: disable warp 32 for RDNA3 * vulkan: fine tuned RDNA1 subgroup sizes * vulkan: adjusted subgroup size map * vulkan: fixed RDNA2 subgroup map --------- Co-authored-by: 0cc4m <picard12@live.de>	2025-03-17 12:42:33 +01:00
Jeff Bolz	891c63956d	vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking (#12273 ) * vulkan: Pad N dimension of B matrix for coopmat2 perf, to avoid bounds checking	2025-03-17 10:41:59 +01:00
Jeff Bolz	2f21123c1d	vulkan: Adjust coopmat2 tile sizes and selection heuristic (#12258 )	2025-03-17 10:35:00 +01:00
cmdr2	0cbee131ad	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) ggml-ci	2025-03-03 18:18:11 +02:00
William Tambellini	70680c48e5	ggml : upgrade init_tensor API to return a ggml_status (#11854 ) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-02-28 14:41:47 +01:00
Rémy O	438a83926a	vulkan: add specific MMV kernels for IQ2 and IQ3 quants + optimizations (#11595 ) * vulkan: implement specialized MMV kernels for IQ2 quantizations * vulkan: add MMV kernels for IQ3 quants * vulkan: Increase MMV batch size and unroll IQ LUT setup * vulkan: fix init_iq_shmem for WG sizes larger than tables * vulkan: common batch size for all I-quants	2025-02-28 09:42:52 +01:00
Jeff Bolz	a82c9e7c23	vulkan: fix assertion when qy_needs_dequant (#12068 ) Looks like a copy/paste bug from qx_needs_dequant.	2025-02-25 16:30:21 +01:00
Judd	c132239bfb	add OP sigmoid (#12056 ) Co-authored-by: Judd <foldl@boxvest.com>	2025-02-25 12:32:20 +01:00
Rémy O	61d4f39dfe	vulkan: implement more backpropagation operators (#11914 ) * vulkan: implement GGML_OP_ROPE_BACK * vulkan: implement GGML_OP_RMS_NORM_BACK * vulkan: implement GGML_OP_SILU_BACK * vulkan: implement GGML_OP_SOFTMAX_BACK	2025-02-25 12:04:45 +01:00

1 2 3 4 5

203 Commits