llama.cpp

History

shaobo.xie 4367734ac3 ggml: add moe_sum operator for Mixture of Experts aggregation Add a new operator GGML_OP_MOE_SUM that efficiently aggregates outputs from multiple experts in MoE models by summing along the expert dimension. Input format: [hidden_dim, n_expert_used, n_tokens] Output format: [hidden_dim, n_tokens] CPU implementation: - Optimized cache-friendly loop order (expert -> token -> hidden_dim) - Multi-threaded parallelization across tokens - Specialized F32 implementation for better performance - 1.28x faster than naive add_loop approach CUDA implementation: - Warp-per-token kernels for large token counts - Specialized F16 vectorized kernel for large batches - Small-token kernels for edge cases - 1.50x faster than naive add_loop approach Tests: - 96 test cases covering F32/F16, various expert counts (2,4,8), hidden dimensions (64-4096), and token counts (16-256) - Relaxed error threshold for F16 (1e-6 vs 1e-7 for F32) due to limited precision when summing multiple expert outputs		2026-02-05 15:34:24 +08:00
..
cmake	ggml: Skip backend library linking code when GGML_BACKEND_DL=ON (#15094 )	2025-08-07 13:45:41 +02:00
include	ggml: add moe_sum operator for Mixture of Experts aggregation	2026-02-05 15:34:24 +08:00
src	ggml: add moe_sum operator for Mixture of Experts aggregation	2026-02-05 15:34:24 +08:00
.gitignore	vulkan : cmake integration (#8119 )	2024-07-13 18:12:39 +02:00
CMakeLists.txt	Bump cmake max version (needed for Windows on Snapdragon builds) (#19188 )	2026-02-01 14:13:38 -08:00