Alberto Cabrera Pérez
669696e00d
ggml-cpu: ARM64: repack version of q8_0 (dotprod and i8mm) ( #18096 )
...
* wip: skeleton for q8_0 repack
* q8_0 repack GEMV implementations
* GEMM implementations
* Formatting
* Fixed format consistency of repack gemm and gemv declarations
* gemv and gemm generic location consistent with declarations
* Removed non-correct unused variables statements
* Cleanup, consistent style
* Missing generic fallbacks for x86 and powerpc
2025-12-17 13:39:13 +02:00
Alberto Cabrera Pérez
cd8370b408
ggml-cpu: aarm64: q4_K repack gemm and gemv implementations (dotprod only) ( #17494 )
...
* Enabled q4_K_4x8 path
* Fixed generic Q4_K 8x4 implementation
* wip: dotprod gemm
* Working arm q4_K dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Undo acc rename
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Q4_K arm dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix: q4_qs reinterpret from uint to int
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Removed comments
* Fixed macro guards
* Fixed unused vars in generic implementation
* Fixed unused vars in 8x4 repack
* Fixed unused vars in generic implementation, unneeded comment
* Missing arch fallback for x86
* minor : style
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-27 13:25:14 +02:00
Alberto Cabrera Pérez
dbb852b549
ggml-cpu: arm64: q4_K repack gemm and gemv implementations (i8mm) ( #16739 )
...
* Enabled q4_K_8x8_q8_K path on ARM
* wip: I8mm qs multiplication, pending bias
* cpu : arm : REPACK gemm q4_K8x8 implementation
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Guard gemm with proper features, improved superblock scale and min calc
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* cpu: arm: Implemented REPACK gemv for Q4_K
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Removed completed TODO
* Fixed missing guards when selecting optimal repack type for Q4_K
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fixed macro guard for gemv
* Fixed wrong comment in GEMV
* Fixed warning for unused variable
* vdotq_s32 -> ggml_vdotq_s32
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Clang-format issues
* Apply suggestions from code review
Co-authored-by: Diego Devesa <slarengh@gmail.com>
* Removed unnecessary GGML_UNUSED
* Fixed guards in q4_k gemm and gemv (repack)
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-11-24 13:08:11 +02:00
Aaron Teo
9b26511857
ggml-cpu: implement MXFP4 SIMD for s390x ( #16193 )
...
* ggml-cpu: impl mxfp4 s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: missing s = sumf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: fix incorrect kval_mxfp4 type
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: rework mxfp4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: missing delta calc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: fix typo
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: fix typo for vec_splats
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: expand to 2 blocks per loop
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: add unroll to boost perf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: back to 1 block per loop to test perf
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "ggml-cpu: back to 1 block per loop to test perf"
This reverts commit 1fe55724e2 .
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: rm unroll from single block
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-26 13:27:25 +03:00
Aaron Teo
ad5c975c2d
ggml-cpu: Support Q5_0 and Q5_1 on s390x ( #15486 )
...
* ggml-cpu: initial q5_0 impl for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: updated q5_0 code for better performance
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: use optimised hsum for better performance
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: introduce q5_1 simd + refactor q5_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: fix incorrect return type vec_hsum
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: q5_0 incomplete refactor + table_b2b_0 activation
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: refactor q5_1
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: q5_1 update loop unroll to 4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: update q5_0 unroll to 4
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: update build-s390x docs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* ggml-cpu: update unused variables q5_0
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* docs: update the last update date
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-08-22 16:11:04 +08:00
Marvin Gießing
6424594c56
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware ( #15385 )
...
* Added VSX intrinsics for Power9+ systems
Signed-off-by: mgiessing <marvin.giessing@gmail.com>
* Manual unrolling for minor perf improvement
Signed-off-by: mgiessing <marvin.giessing@gmail.com>
* Update ggml/src/ggml-cpu/arch/powerpc/quants.c
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-19 11:54:31 +03:00
Georgi Gerganov
00f35d509e
ggml : repack block_iq4_nlx8 ( #14904 )
...
ggml-ci
2025-08-13 11:09:39 +03:00
Georgi Gerganov
fd1234cb46
llama : add gpt-oss ( #15091 )
...
* oai moe
* compat with new checkpoint
* add attn sink impl
* add rope scaling yarn
* logits match with latest transformers code
* wip chat template
* rm trailing space
* use ggml_scale_bias
* rm redundant is_swa_all
* convert interleaved gate_up
* graph : fix activation function to match reference (#7 )
* vocab : handle o200k_harmony special tokens
* ggml : add attention sinks support (#1 )
* llama : add attn sinks
* ggml : add attn sinks
* cuda : add attn sinks
* vulkan : add support for sinks in softmax
remove unnecessary return
* ggml : add fused swiglu_oai op (#11 )
* ggml : add fused swiglu_oai op
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* update CUDA impl
* cont : metal impl
* add vulkan impl
* test-backend-ops : more test cases, clean up
* llama : remove unfused impl
* remove extra lines
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: slaren <slarengh@gmail.com>
* repack mxfp4 upon conversion
* clean up a bit
* enable thinking
* add quick hack to render only some special tokens
* fix bf16 conversion
* remove vocab hack
* webui ok
* support chat parsing for gpt-oss
* fix webui
* direct mapping mxfp4, FINALLY
* force using mxfp4
* properly use lazy tensor
* ggml : add mxfp4
ggml : use e8m0 conversion instead of powf
Co-authored-by: Diego Devesa <slarengh@gmail.com>
change kvalues_mxfp4 table to match e2m1 (#6 )
metal : remove quantization for now (not used)
cuda : fix disabled CUDA graphs due to ffn moe bias
vulkan : add support for mxfp4
cont : add cm2 dequant
* ggml : add ggml_add_id (#13 )
* ggml : add ggml_add_id
* add cuda impl
* llama : add weight support check for add_id
* perf opt
* add vulkan impl
* rename cuda files
* add metal impl
* allow in-place ggml_add_id
* llama : keep biases on CPU with --cpu-moe
* llama : fix compile error
ggml-ci
* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw
ggml-ci
* cleanup
ggml-ci
* sycl : fix supports_op for MXFP4
ggml-ci
* fix Unknown reasoning format
* ggml-cpu : fix AVX build
ggml-ci
* fix hip build
ggml-ci
* cuda : add mxfp4 dequantization support for cuBLAS
ggml-ci
* ggml-cpu : fix mxfp4 fallback definitions for some architectures
ggml-ci
* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Srihari-mcw
baad94885d
ggml : Q2k interleaving implementation - x86/x64 SIMD ( #14373 )
...
* Initial Q2_K Block Interleaving Implementation
* Addressed review comments and clean up of the code
* Post rebase fixes
* Initial CI/CD fixes
* Update declarations in arch-fallback.h
* Changes for GEMV Q2_K in arch-fallback.h
* Enable repacking only on AVX-512 machines
* Update comments in repack.cpp
* Address q2k comments
---------
Co-authored-by: Manogna-Sree <elisetti.manognasree@multicorewareinc.com>
2025-08-01 09:20:33 +03:00
xctan
860a9e4eef
ggml-cpu : remove the weak alias trick ( #14221 )
2025-06-17 12:58:32 +03:00