llama.cpp

Commit Graph

Author	SHA1	Message	Date
Alberto Cabrera Pérez	669696e00d	ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) (#18096 ) * wip: skeleton for q8_0 repack * q8_0 repack GEMV implementations * GEMM implementations * Formatting * Fixed format consistency of repack gemm and gemv declarations * gemv and gemm generic location consistent with declarations * Removed non-correct unused variables statements * Cleanup, consistent style * Missing generic fallbacks for x86 and powerpc	2025-12-17 13:39:13 +02:00
Alberto Cabrera Pérez	cd8370b408	ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) (#17494 ) * Enabled q4_K_4x8 path * Fixed generic Q4_K 8x4 implementation * wip: dotprod gemm * Working arm q4_K dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Undo acc rename Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Q4_K arm dotprod gemm Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fix: q4_qs reinterpret from uint to int Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed comments * Fixed macro guards * Fixed unused vars in generic implementation * Fixed unused vars in 8x4 repack * Fixed unused vars in generic implementation, unneeded comment * Missing arch fallback for x86 * minor : style --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-11-27 13:25:14 +02:00
Alberto Cabrera Pérez	dbb852b549	ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) (#16739 ) * Enabled q4_K_8x8_q8_K path on ARM * wip: I8mm qs multiplication, pending bias * cpu : arm : REPACK gemm q4_K8x8 implementation Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Guard gemm with proper features, improved superblock scale and min calc Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * cpu: arm: Implemented REPACK gemv for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Removed completed TODO * Fixed missing guards when selecting optimal repack type for Q4_K Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Fixed macro guard for gemv * Fixed wrong comment in GEMV * Fixed warning for unused variable * vdotq_s32 -> ggml_vdotq_s32 Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> * Clang-format issues * Apply suggestions from code review Co-authored-by: Diego Devesa <slarengh@gmail.com> * Removed unnecessary GGML_UNUSED * Fixed guards in q4_k gemm and gemv (repack) --------- Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-11-24 13:08:11 +02:00
Aaron Teo	9b26511857	ggml-cpu: implement MXFP4 SIMD for s390x (#16193 ) * ggml-cpu: impl mxfp4 s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing s = sumf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect kval_mxfp4 type Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rework mxfp4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: missing delta calc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix typo for vec_splats Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: expand to 2 blocks per loop Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: add unroll to boost perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: back to 1 block per loop to test perf Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "ggml-cpu: back to 1 block per loop to test perf" This reverts commit `1fe55724e2`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: rm unroll from single block Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-26 13:27:25 +03:00
Aaron Teo	ad5c975c2d	ggml-cpu: Support Q5_0 and Q5_1 on s390x (#15486 ) * ggml-cpu: initial q5_0 impl for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: updated q5_0 code for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: use optimised hsum for better performance Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: introduce q5_1 simd + refactor q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix incorrect return type vec_hsum Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: refactor q5_1 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: q5_1 update loop unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update q5_0 unroll to 4 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update build-s390x docs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: update unused variables q5_0 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * docs: update the last update date Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-08-22 16:11:04 +08:00
Marvin Gießing	6424594c56	ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (#15385 ) * Added VSX intrinsics for Power9+ systems Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Manual unrolling for minor perf improvement Signed-off-by: mgiessing <marvin.giessing@gmail.com> * Update ggml/src/ggml-cpu/arch/powerpc/quants.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: mgiessing <marvin.giessing@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-08-19 11:54:31 +03:00
Georgi Gerganov	00f35d509e	ggml : repack block_iq4_nlx8 (#14904 ) ggml-ci	2025-08-13 11:09:39 +03:00
Georgi Gerganov	fd1234cb46	llama : add gpt-oss (#15091 ) * oai moe * compat with new checkpoint * add attn sink impl * add rope scaling yarn * logits match with latest transformers code * wip chat template * rm trailing space * use ggml_scale_bias * rm redundant is_swa_all * convert interleaved gate_up * graph : fix activation function to match reference (#7) * vocab : handle o200k_harmony special tokens * ggml : add attention sinks support (#1) * llama : add attn sinks * ggml : add attn sinks * cuda : add attn sinks * vulkan : add support for sinks in softmax remove unnecessary return * ggml : add fused swiglu_oai op (#11) * ggml : add fused swiglu_oai op * Update ggml/src/ggml-cpu/ops.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * update CUDA impl * cont : metal impl * add vulkan impl * test-backend-ops : more test cases, clean up * llama : remove unfused impl * remove extra lines --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: slaren <slarengh@gmail.com> * repack mxfp4 upon conversion * clean up a bit * enable thinking * add quick hack to render only some special tokens * fix bf16 conversion * remove vocab hack * webui ok * support chat parsing for gpt-oss * fix webui * direct mapping mxfp4, FINALLY * force using mxfp4 * properly use lazy tensor * ggml : add mxfp4 ggml : use e8m0 conversion instead of powf Co-authored-by: Diego Devesa <slarengh@gmail.com> change kvalues_mxfp4 table to match e2m1 (#6) metal : remove quantization for now (not used) cuda : fix disabled CUDA graphs due to ffn moe bias vulkan : add support for mxfp4 cont : add cm2 dequant * ggml : add ggml_add_id (#13) * ggml : add ggml_add_id * add cuda impl * llama : add weight support check for add_id * perf opt * add vulkan impl * rename cuda files * add metal impl * allow in-place ggml_add_id * llama : keep biases on CPU with --cpu-moe * llama : fix compile error ggml-ci * cuda : add fallback for __nv_cvt_e8m0_to_bf16raw ggml-ci * cleanup ggml-ci * sycl : fix supports_op for MXFP4 ggml-ci * fix Unknown reasoning format * ggml-cpu : fix AVX build ggml-ci * fix hip build ggml-ci * cuda : add mxfp4 dequantization support for cuBLAS ggml-ci * ggml-cpu : fix mxfp4 fallback definitions for some architectures ggml-ci * cuda : fix version required for __nv_cvt_e8m0_to_bf16raw --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: slaren <slarengh@gmail.com>	2025-08-05 22:10:36 +03:00
Srihari-mcw	baad94885d	ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373 ) * Initial Q2_K Block Interleaving Implementation * Addressed review comments and clean up of the code * Post rebase fixes * Initial CI/CD fixes * Update declarations in arch-fallback.h * Changes for GEMV Q2_K in arch-fallback.h * Enable repacking only on AVX-512 machines * Update comments in repack.cpp * Address q2k comments --------- Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>	2025-08-01 09:20:33 +03:00
xctan	860a9e4eef	ggml-cpu : remove the weak alias trick (#14221 )	2025-06-17 12:58:32 +03:00

10 Commits