llama.cpp

History

pestopoppa 49162df87a feat: add --moe-n-expert flag for MoE expert count override (Hard Mask) Add ability to reduce the number of active experts in MoE models at runtime, providing significant speedup with minimal quality loss when using 50% of default experts. Implementation: - Add moe_n_expert_override parameter to llama_context_params - Add --moe-n-expert CLI flag to override n_expert_used - Implement "Hard Mask" in build_moe_ffn() that slices expert tensors - Uses ggml_view_2d/3d + ggml_cont to reduce actual computation Benchmark results (AOCL BLIS 5.0, AMD EPYC 9655): - Qwen3-Coder-480B-A35B: 2.5 → 3.7 t/s (48% speedup) - GLM-4.6-355B-A32B: 2.2 → 3.0 t/s (36% speedup) - Qwen3-Coder-30B-A3B: 26.6 → 33.6 t/s (26% speedup) - Qwen3-VL-30B-A3B: 32.2 → 38.9 t/s (21% speedup) Quality: Excellent at 50% experts, degraded at 25%, gibberish at 12.5% Usage: llama-cli -m model.gguf --moe-n-expert 4 -p "prompt" 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2025-12-14 13:32:50 +01:00
..
llama-cpp.h	llama : add `llama_vocab`, functions -> methods, naming (#11110 )	2025-01-12 11:32:42 +02:00
llama.h	feat: add --moe-n-expert flag for MoE expert count override (Hard Mask)	2025-12-14 13:32:50 +01:00

pestopoppa 49162df87a feat: add --moe-n-expert flag for MoE expert count override (Hard Mask)

Add ability to reduce the number of active experts in MoE models at runtime,
providing significant speedup with minimal quality loss when using 50% of
default experts.

Implementation:
- Add moe_n_expert_override parameter to llama_context_params
- Add --moe-n-expert CLI flag to override n_expert_used
- Implement "Hard Mask" in build_moe_ffn() that slices expert tensors
- Uses ggml_view_2d/3d + ggml_cont to reduce actual computation

Benchmark results (AOCL BLIS 5.0, AMD EPYC 9655):
- Qwen3-Coder-480B-A35B: 2.5 → 3.7 t/s (48% speedup)
- GLM-4.6-355B-A32B: 2.2 → 3.0 t/s (36% speedup)
- Qwen3-Coder-30B-A3B: 26.6 → 33.6 t/s (26% speedup)
- Qwen3-VL-30B-A3B: 32.2 → 38.9 t/s (21% speedup)

Quality: Excellent at 50% experts, degraded at 25%, gibberish at 12.5%

Usage: llama-cli -m model.gguf --moe-n-expert 4 -p "prompt"

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

2025-12-14 13:32:50 +01:00

llama-cpp.h

llama : add `llama_vocab`, functions -> methods, naming (#11110 )

2025-01-12 11:32:42 +02:00

llama.h

feat: add --moe-n-expert flag for MoE expert count override (Hard Mask)

2025-12-14 13:32:50 +01:00