llama.cpp/tools/moe-pruning/README.md

# MoE Expert Pruning Tools for NemotronH

REAP-style expert pruning for `NVIDIA-Nemotron-3-Nano-30B-A3B` (and other
NemotronH MoE models), implemented in two complementary ways:

1. **`tools/expert-profile/`** — C++ profiler built into llama.cpp, collects
   REAP scores directly from GGUF inference via the ggml eval callback.
2. **`tools/moe-pruning/`** (this directory) — Python scripts to prune the model
   using the collected scores, either on a GGUF file directly or on a
   HuggingFace BF16 checkpoint.

---

## Inspiration & Prior Art

This work is a direct implementation of the **REAP** saliency criterion
introduced in:

> **REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression**
> Mike Lasby, Ivan Lazarevich, Nish Sinnadurai, Sean Lie, Yani Ioannou, Vithursan Thangarasa
> Cerebras Research, 2025
> arXiv: https://arxiv.org/abs/2510.13999
> Code:  https://github.com/CerebrasResearch/reap

The REAP score for expert `j` is (Equation 9 of the paper):

```
REAP(j) = mean_{t : j ∈ topk(t)} [ g_j(t) · ‖f_j(t)‖₂ ]
```

where `g_j(t)` is the router gate weight and `f_j(t)` is the expert FFN output
(pre-weighting) for token `t`. Experts with the lowest REAP score contribute
least to the layer output and are pruned first.

The original REAP repo targets HuggingFace models via PyTorch hooks on
standard architectures (Qwen3-MoE, Mixtral, DeepSeek-V2, Llama-4, …).

**What we added / adapted:**

- `tools/expert-profile/expert-profile.cpp` — llama.cpp C++ implementation
  of REAP that intercepts `ffn_moe_topk`, `ffn_moe_weights`, and `ffn_moe_down`
  tensors via `ggml_backend_eval_callback`, enabling REAP profiling on any
  GGUF-quantised model (Q4_K_M, Q6_K, etc.) without needing full BF16 VRAM.

- `gguf_prune.py` — prunes the GGUF file **directly**, slicing the expert axis
  of the stacked weight tensors (`ffn_up_exps`, `ffn_down_exps`, `ffn_gate_inp`,
  `ffn_exp_probs_b`) and patching `{arch}.expert_count` in the metadata.
  Quantised blocks are preserved as raw bytes — no dequantise/requantise step.

- `nemotron_reap.py` — HuggingFace-based alternative: profiles with 4-bit NF4
  on GPU (phase 1) and prunes the BF16 checkpoint on CPU (phase 2). Adds
  NemotronH (`NemotronHForCausalLM`) support that the original REAP repo does
  not have.

---

## Recommended Workflow (low-VRAM, e.g. RTX 4060 Ti 16 GB)

```
┌─────────────────────────────────────────────┐
│  Phase 1 — Profile  (GPU, GGUF Q4, ~15 GB)  │
│                                             │
│  llama-expert-profile                       │
│    -m nemotron-Q4_K_M.gguf                  │
│    --jsonl sample_calibration.jsonl         │
│    --output expert_stats.json               │
│    -ngl 99 --ctx-size 2048                  │
└───────────────────┬─────────────────────────┘
                    │ expert_stats.json
┌───────────────────▼─────────────────────────┐
│  Phase 2 — Prune  (CPU, pure Python, ~2 GB) │
│                                             │
│  python gguf_prune.py                       │
│    --input  nemotron-Q4_K_M.gguf            │
│    --stats  expert_stats.json               │
│    --output nemotron-pruned-26e.gguf        │
│    --keep_ratio 0.20   # 26/128 experts     │
└─────────────────────────────────────────────┘
```

At 20 % keep ratio a ~22 GB Q4_K_M becomes ~4.5 GB.

---

## Files

| File | Description |
|---|---|
| `gguf_prune.py` | GGUF-native pruner — no GPU needed, preserves quantisation |
| `nemotron_reap.py` | HF-based pruner — 4-bit GPU profile + CPU BF16 prune |
| `build_expert_profile.sh` | Build script for `llama-expert-profile` |
| `run_nemotron_profile.sh` | Example profiling run |
| `run_prune.sh` | Example pruning run |
| `run_convert_quantize.sh` | Convert HF → GGUF and quantise |
| `analyze_stats.py` | Visualise and compare expert stats JSON files |
| `sample_calibration.jsonl` | Sample calibration data (prompt+response pairs) |
| `expert_stats_reap.json` | Example stats output from expert-profile |