llama.cpp/tools/moe-pruning/README.md

98 lines
4.3 KiB
Markdown

# MoE Expert Pruning Tools for NemotronH
REAP-style expert pruning for `NVIDIA-Nemotron-3-Nano-30B-A3B` (and other
NemotronH MoE models), implemented in two complementary ways:
1. **`tools/expert-profile/`** — C++ profiler built into llama.cpp, collects
REAP scores directly from GGUF inference via the ggml eval callback.
2. **`tools/moe-pruning/`** (this directory) — Python scripts to prune the model
using the collected scores, either on a GGUF file directly or on a
HuggingFace BF16 checkpoint.
---
## Inspiration & Prior Art
This work is a direct implementation of the **REAP** saliency criterion
introduced in:
> **REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression**
> Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
> Cerebras Research, 2025
> arXiv: https://arxiv.org/abs/2510.13999
> Code: https://github.com/CerebrasResearch/reap
The REAP score for expert `j` is (Equation 9 of the paper):
```
REAP(j) = mean_{t : j ∈ topk(t)} [ g_j(t) · ‖f_j(t)‖₂ ]
```
where `g_j(t)` is the router gate weight and `f_j(t)` is the expert FFN output
(pre-weighting) for token `t`. Experts with the lowest REAP score contribute
least to the layer output and are pruned first.
The original REAP repo targets HuggingFace models via PyTorch hooks on
standard architectures (Qwen3-MoE, Mixtral, DeepSeek-V2, Llama-4, …).
**What we added / adapted:**
- `tools/expert-profile/expert-profile.cpp` — llama.cpp C++ implementation
of REAP that intercepts `ffn_moe_topk`, `ffn_moe_weights`, and `ffn_moe_down`
tensors via `ggml_backend_eval_callback`, enabling REAP profiling on any
GGUF-quantised model (Q4_K_M, Q6_K, etc.) without needing full BF16 VRAM.
- `gguf_prune.py` — prunes the GGUF file **directly**, slicing the expert axis
of the stacked weight tensors (`ffn_up_exps`, `ffn_down_exps`, `ffn_gate_inp`,
`ffn_exp_probs_b`) and patching `{arch}.expert_count` in the metadata.
Quantised blocks are preserved as raw bytes — no dequantise/requantise step.
- `nemotron_reap.py` — HuggingFace-based alternative: profiles with 4-bit NF4
on GPU (phase 1) and prunes the BF16 checkpoint on CPU (phase 2). Adds
NemotronH (`NemotronHForCausalLM`) support that the original REAP repo does
not have.
---
## Recommended Workflow (low-VRAM, e.g. RTX 4060 Ti 16 GB)
```
┌─────────────────────────────────────────────┐
│ Phase 1 — Profile (GPU, GGUF Q4, ~15 GB) │
│ │
│ llama-expert-profile │
│ -m nemotron-Q4_K_M.gguf │
│ --jsonl sample_calibration.jsonl │
│ --output expert_stats.json │
│ -ngl 99 --ctx-size 2048 │
└───────────────────┬─────────────────────────┘
│ expert_stats.json
┌───────────────────▼─────────────────────────┐
│ Phase 2 — Prune (CPU, pure Python, ~2 GB) │
│ │
│ python gguf_prune.py │
│ --input nemotron-Q4_K_M.gguf │
│ --stats expert_stats.json │
│ --output nemotron-pruned-26e.gguf │
│ --keep_ratio 0.20 # 26/128 experts │
└─────────────────────────────────────────────┘
```
At 20 % keep ratio a ~22 GB Q4_K_M becomes ~4.5 GB.
---
## Files
| File | Description |
|---|---|
| `gguf_prune.py` | GGUF-native pruner — no GPU needed, preserves quantisation |
| `nemotron_reap.py` | HF-based pruner — 4-bit GPU profile + CPU BF16 prune |
| `build_expert_profile.sh` | Build script for `llama-expert-profile` |
| `run_nemotron_profile.sh` | Example profiling run |
| `run_prune.sh` | Example pruning run |
| `run_convert_quantize.sh` | Convert HF → GGUF and quantise |
| `analyze_stats.py` | Visualise and compare expert stats JSON files |
| `sample_calibration.jsonl` | Sample calibration data (prompt+response pairs) |
| `expert_stats_reap.json` | Example stats output from expert-profile |