llama.cpp/ggml/src/ggml-cpu
Georgi Gerganov fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
..
amx ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317) 2025-06-25 23:49:04 +02:00
arch llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
cmake ggml : build backends as libraries (#10256) 2024-11-14 18:04:35 +01:00
kleidiai kleidiai: add support for get_rows (#14676) 2025-07-21 16:49:52 +03:00
llamafile ggml : refactor llamafile_sgemm PPC code (#14673) 2025-07-14 16:16:42 +03:00
CMakeLists.txt ggml-cpu : disable GGML_NNPA by default due to instability (#14880) 2025-07-25 19:09:03 +02:00
arch-fallback.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
binary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
binary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
common.h ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317) 2025-06-25 23:49:04 +02:00
ggml-cpu-impl.h ggml-cpu: enable IBM NNPA Vector Intrinsics (#14317) 2025-06-25 23:49:04 +02:00
ggml-cpu.c llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
ggml-cpu.cpp ggml : add ggml_set_rows (#14274) 2025-06-27 16:41:40 +03:00
hbm.cpp ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
hbm.h ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
ops.cpp llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
ops.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
quants.c llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
quants.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00
repack.cpp ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373) 2025-08-01 09:20:33 +03:00
repack.h ggml : Q2k interleaving implementation - x86/x64 SIMD (#14373) 2025-08-01 09:20:33 +03:00
simd-mappings.h llama : initial Mamba-2 support (#9126) 2025-07-02 13:10:24 -04:00
traits.cpp ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
traits.h ggml-cpu : split arch-specific implementations (#13892) 2025-06-09 16:47:13 +02:00
unary-ops.cpp cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
unary-ops.h cpu: de-duplicate some of the operators and refactor (ggml/1144) 2025-03-30 08:33:31 +03:00
vec.cpp ggml : add asserts (#14720) 2025-07-16 14:43:32 +03:00
vec.h llama : add gpt-oss (#15091) 2025-08-05 22:10:36 +03:00