# llama.cpp — Native QLoRA Training Native QLoRA + Reward-Weighted SFT training pipeline for quantized GGUF models. The base model weights remain **frozen** (quantized tensors are skipped by `llama_set_param` because they are not `GGML_TYPE_F32`). Only freshly-allocated F32 LoRA A/B tensors are trained. The saved adapter GGUF is directly compatible with the existing `llama_adapter_lora_init` loader and `llama-export-lora` merge tool. **Status:** Working. Phase 1 (QLoRA SFT) and Phase 2 (Reward-Weighted SFT) are implemented and functional. Training speed is currently limited by full backprop through quantized weights — see [Known Limitations](#known-limitations). --- ## Build ```bash cd /mnt/w/llm-trading-arena/unsloth-api/llama.cpp # First time (CUDA build): cmake -B build -DGGML_CUDA=ON -DBUILD_SHARED_LIBS=ON -DLLAMA_CURL=OFF cmake --build build -j$(nproc) # Incremental rebuild (after code changes): cmake --build build --target llama-finetune-qlora -j$(nproc) # If llama-adapter.cpp or llama-context.cpp changed, rebuild all: cmake --build build -j$(nproc) ``` --- ## Phase 1 — QLoRA SFT (`llama-finetune-qlora`) Trains LoRA adapters on a quantized GGUF model. ### Recommended command (1.7B model, 16 GB card) ```bash ./build/bin/llama-finetune-qlora \ --model ~/qwen3-1.7b-q4_k_m.gguf \ --train-file data/train.jsonl \ --lora-rank 16 --lora-alpha 16 \ -c 4096 -b 4096 -ub 512 \ --save-every 10 \ --lora-out ~/adapter.gguf \ --epochs 3 --seed 42 ``` ### Recommended command (15B model, 16 GB card, partial offload) ```bash ./build/bin/llama-finetune-qlora \ --model ~/nemotron-15b-q4_k_m.gguf \ --train-file data/train.jsonl \ --lora-rank 16 --lora-alpha 16 \ -ngl 13 -c 14336 -b 14336 -ub 1024 \ --save-every 8 \ --lora-out ~/nemotron-lora.gguf \ --epochs 3 --seed 42 ``` ### All flags | Flag | Default | Description | |---|---|---| | `--model` | *(required)* | Path to quantized GGUF model | | `--train-file` | *(required)* | JSONL training dataset | | `--lora-rank` | `16` | LoRA rank r | | `--lora-alpha` | `0` (= rank) | LoRA alpha; effective scale = alpha/rank | | `--lora-targets` | see below | Comma-separated internal tensor name substrings | | `--lora-out` | `adapter.gguf` | Output adapter GGUF path (supports `~`) | | `--save-every` | `0` | Save checkpoint every N dataset windows (0 = end only) | | `--freeze-layers` | `0` | Skip LoRA on first N transformer layers (blk.0..N-1); backward already pruned automatically | | `--grad-checkpoint` | `0` | Mark every Nth forward node persistent to reduce activation VRAM; good values: 32–64 | | `--train-on-prompt` | off | Compute loss on prompt tokens too (default: response-only loss) | | `--shuffle-dataset` | off | Shuffle dataset windows at the start of each epoch | | `--val-split` | `0.0` | Fraction of data to hold out for validation (e.g. `0.1` = 10%); val loss logged per epoch | | `-epochs` / `--epochs` | `3` | Training epochs | | `-c` / `--ctx-size` | `512` | Training context window (tokens) | | `-b` / `--batch-size` | `2048` | Tokens per `llama_decode` call; set equal to `-c` | | `-ub` / `--ubatch-size` | `512` | GPU micro-batch tokens; controls VRAM vs. step time | | `-ngl` | `999` | GPU layers to offload | | `-lr` / `--learning-rate` | `1e-4` | AdamW learning rate | | `--seed` | `42` | Random seed for LoRA init | ### VRAM vs. step-time tradeoff Step time and VRAM both scale linearly with `-ub`: | Model | `-ub` | VRAM | Step time (approx) | |---|---|---|---| | 1.7B Q4_K_M | 512 | ~18 GB | ~120 s (OOM on 16 GB) | | 1.7B Q4_K_M | 128 | ~6 GB | ~30 s | | 15B Q4_K_M | 1024 | ~11 GB | ~60 s | Use `-c` equal to your target sequence length. More context = more windows per sample = more steps per epoch. Reducing `-c` reduces total training time proportionally. ### Default LoRA targets llama.cpp uses **internal GGUF tensor names**, not HuggingFace names: | llama.cpp internal | HuggingFace equivalent | Status | |---|---|---| | `attn_q` | `q_proj` | ✅ default target, trainable | | `attn_output` | `o_proj` | ✅ default target, trainable | | `ffn_gate` | `gate_proj` | ✅ default target, trainable | | `ffn_up` | `up_proj` | ✅ default target, trainable | | `ffn_down` | `down_proj` | ✅ default target, trainable | | `attn_k` | `k_proj` | ❌ not in defaults — zero gradient (KV scatter via SET_ROWS) | | `attn_v` | `v_proj` | ❌ not in defaults — zero gradient (KV scatter via SET_ROWS) | | `ssm_in` | `in_proj` | ❌ not in defaults — zero gradient (SSM_SCAN no backward) | | `ssm_out` | `out_proj` | ❌ not in defaults — zero gradient (SSM_SCAN no backward) | **MoE models:** Expert tensors (`*_exps`) are excluded regardless of `--lora-targets`. The quantized expert weights are frozen (stop-gradient), but LoRA on the dense FFN layers (`ffn_gate`, `ffn_up`, `ffn_down`) works — backward via `MUL_MAT_ID` + `OUT_PROD_ID`. ### Dataset format (JSONL) **Chat format** (loss on response only; use `--train-on-prompt` for all tokens): ```json {"messages": [{"role": "user", "content": "Hello"}, {"role": "assistant", "content": "Hi!"}]} ``` **Prompt/response** (loss on response only): ```json {"prompt": "What is the capital of France?", "response": "Paris."} ``` **Plain text** (loss on all tokens): ```json {"text": "The quick brown fox."} ``` **With reward** (Phase 2 — scales gradient by reward): ```json {"prompt": "...", "response": "...", "reward": 0.85} ``` Rewards are normalized per epoch: clipped to `[-1, 1]`, then min-max scaled to `[0, 1]`. Reward 0 = sample ignored; reward 1 = full gradient. ### Verify and use the adapter ```bash # Hot-load for inference (no merge needed) ./build/bin/llama-cli --model base.gguf --lora adapter.gguf -p "Hello" # Merge into base model ./build/bin/llama-export-lora \ --model base.gguf --lora adapter.gguf --output merged.gguf ``` --- ## Phase 2 — Reward-Weighted SFT Built into `llama-finetune-qlora`. When the dataset contains a `reward` or `score` field, the cross-entropy loss for that sample is scaled by the reward before backprop. No extra flags needed — detection is automatic. --- ## Phase 3 — GRPO (Online RL via IPC) `llama-finetune-qlora --grpo-mode` implements a full GRPO training loop where the Python process owns prompt sampling and reward scoring, and the C++ process owns model state, generation, and gradient updates. ### Quick start ```bash python3 examples/qlora_training/grpo_example.py \ --model ~/qwen3-1.7b-q4_k_m.gguf \ --lora-out ~/grpo-adapter.gguf \ --rank 16 --n-steps 200 --n-gen 8 ``` For verbose output (includes IPC message trace): ```bash python3 examples/qlora_training/grpo_example.py \ --model ~/qwen3-1.7b-q4_k_m.gguf \ --lora-out ~/grpo-adapter.gguf \ --verbose ``` Resume from a checkpoint: ```bash python3 examples/qlora_training/grpo_example.py \ --model ~/qwen3-1.7b-q4_k_m.gguf \ --lora ~/grpo-adapter.ckpt50.gguf \ --lora-out ~/grpo-adapter.gguf ``` ### GRPO-specific flags | Flag | Default | Description | |---|---|---| | `--grpo-mode` | off | Enable GRPO IPC mode | | `--n-gen` | `8` | Rollouts per prompt | | `--n-steps` | `500` | Total GRPO steps | | `--grpo-temp` | `0.8` | Sampling temperature for rollouts | | `--grpo-max-tokens` | `512` | Max tokens per generation | All standard flags (`--lora-rank`, `-lr`, `-c`, `-ngl`, `--save-every`, etc.) work in GRPO mode too. `--train-file` is **not** required in GRPO mode. ### IPC protocol The protocol is line-based over stdout (C++ → Python) and stdin (Python → C++). All non-protocol C++ output (timing, debug, model logs) goes to **stderr** and never contaminates the protocol channel. **C++ → Python (stdout):** | Line | When | |---|---| | `[QLORA:READY]` | Process initialised, model loaded | | `[QLORA:PROMPT_REQ:]` | C++ requests the prompt for step N | | `[QLORA:GEN:/] ` | One generation (newlines escaped as `\n`) | | `[QLORA:REWARD_REQ:]` | C++ requests N reward scores | | `[QLORA:PROGRESS] step=X/Y loss=Z epoch=A/B` | After each weight update | | `[QLORA:CHECKPOINT] ` | After saving a checkpoint | | `[QLORA:DONE] final_loss=X` | Training complete | | `[QLORA:ERROR] ` | Fatal error | **Python → C++ (stdin):** | Line | Meaning | |---|---| | `PROMPT ` | Send prompt for the most recent `PROMPT_REQ` | | `REWARD ` | Send N advantage scores in `[0, 1]` range | | `STOP` | Request graceful shutdown after current step | **Text encoding:** newlines in generation text are escaped as the two-character sequence `\n`; backslashes are doubled. Use `unescape()` from `grpo_example.py` (or any equivalent) to recover the original text. ### Writing your own driver `grpo_example.py` contains two functions you replace with your own logic: ```python def get_prompt(step: int) -> str: """Return the training prompt for step N.""" ... def score_generations(prompt: str, generations: List[str]) -> List[float]: """Score each generation. Any numeric range — will be normalised.""" ... ``` The IPC helpers (`escape`, `unescape`, `parse_ipc`, `read_ipc`, `write_cmd`, `wait_for`, `normalise_rewards`) are standalone and have no external dependencies — copy them into your own project if needed. ### Training loop diagram ``` Python C++ (llama-finetune-qlora --grpo-mode) │ │ │◄──── [QLORA:READY] ────────────┤ model loaded │ │ │ ┌─────────────────────────────┤ │ │ for each step: │ │ │ ◄── PROMPT_REQ:N ─────────┤ │ │ ──► PROMPT ────────► generate n_gen rollouts │ │ ◄── GEN:1/n ──┤ │ │ ◄── GEN:2/n ──┤ │ │ ... │ │ │ ◄── GEN:n/n ──┤ │ │ ◄── REWARD_REQ:n ─────────┤ │ │ (score generations) │ │ │ ──► REWARD a1 a2 … an ────► one backward + AdamW step │ │ ◄── PROGRESS step=N/M … ──┤ │ └─────────────────────────────┤ │ │ │◄──── [QLORA:DONE] ─────────────┤ adapter saved ``` --- ## Known Limitations & Optimization Roadmap ### Current limitations **1. Full backprop through frozen quantized layers** Every backward step dequantizes all frozen Q4_K_M weight tensors to compute activation gradients (needed to propagate loss from the output back to each LoRA layer). For a 28-layer 1.7B model at `-ub 512`, this is ~280 dequantizing matmuls per step → step time is 3–5× slower than inference. **2. Activation VRAM** *(partially addressed by `--grad-checkpoint`)* All forward activations are kept in VRAM throughout the backward pass. VRAM ≈ `model + KV + n_layers × hidden × n_ubatch × 10 × 4B + 2 × lora_params × 4B`. Reducing `-ub` reduces VRAM linearly. Use `--grad-checkpoint 48` to prevent the allocator from reusing intermediate activation buffers during backward, which cuts peak activation VRAM at near-zero compute cost. **3. Full backprop through all layers** *(partially addressed by `--freeze-layers`)* Gradients propagate through all layers that have LoRA adapters. Use `--freeze-layers N` to skip LoRA allocation for blk.0..N-1 — those layers receive no gradient (the `grads_needed` pruner already skips their backward ops automatically). Only the top (total_layers - N) layers are trained. ### Optimization roadmap | Priority | Optimization | Expected gain | Status | |---|---|---|---| | ✅ Done | **`--freeze-layers N`** — no LoRA on first N layers; backward auto-pruned | Proportional to N/total | Implemented | | ✅ Done | **`--grad-checkpoint N`** — keep every Nth activation alive through backward | Reduces peak activation VRAM | Implemented | | ✅ Done | **`--train-on-prompt`** — compute loss on prompt tokens too | Configurable loss target | Implemented | | ✅ Done | **`--shuffle-dataset`** — shuffle windows each epoch | Better convergence | Implemented | | ✅ Done | **BOS separators** — insert BOS between concatenated samples | Correct cross-sample boundaries | Implemented | | ✅ Done | **Per-epoch loss summary** — log train/val loss after each epoch | Observability | Implemented | | ✅ Done | **`MUL_MAT_ID` backward** — LoRA on MoE dense FFN layers; `OUT_PROD_ID` for scattered outer product | Unlocks Mixtral/Nemotron-MoE | Implemented | | ✅ Done | **Quantized `OUT_PROD`** — dequantize on GPU + cuBLAS for backward matmul | Full GPU training (no CPU fallback) | Implemented | | ✅ Done | **Reuse `ctx_compute_opt`** — allocate tensor metadata context once, `ggml_reset()` across ubatches | Eliminate ~0.5 s/step overhead | Implemented | | ❌ Skip | **Static training graphs** — KV mask shape changes per ubatch (`n_kv` grows); graph topology not static | Would need KV cache redesign | Not feasible | | Low | **`SSM_SCAN/CONV` backward** — enable LoRA on Mamba SSM layers | Unlocks NemotronH SSM layers | Planned | | Low | **GELU backward** — implement `ggml_gelu_back` kernel (UNARY + GLU) | Support GPT-2/Phi-style models | Planned (needs new CUDA/CPU kernels) | --- ## Implementation notes (for developers) ### Modified llama.cpp files | File | Change | |---|---| | `ggml/src/ggml.c` | Backward graph fixes: `GET_ROWS` 3D, `SET_ROWS`, `MUL_MAT_ID`, `SSM_SCAN/CONV`, `FLASH_ATTN_EXT` all stop gradient; inplace-op assert → warn+skip | | `src/llama-context.cpp` | `opt_init`: scheduler and graph sized with inflated capacity before `ggml_opt_init`; `opt_epoch_iter`: per-ubatch timing instrumentation; reward scaling via `g_reward_weights` TLS | | `src/llama-adapter.cpp` | Repack-buft fallback for LoRA tensors: tries device-native buft before CPU | | `common/common.h` | Added `save_every`, `lora_freeze_layers`, `grad_checkpoint_interval`, `train_on_prompt`, `shuffle_dataset` fields | | `common/arg.cpp` | Added `--save-every`, `--freeze-layers`, `--grad-checkpoint`, `--train-on-prompt`, `--shuffle-dataset` arguments | | `include/llama.h` | Added `llama_opt_set_reward_weights()`; `grad_checkpoint_interval` in `llama_opt_params`; `shuffle` param in `llama_opt_epoch` | | `ggml/src/ggml-cuda/out-prod.cu` | `OUT_PROD` with quantized src0 (dequantize on GPU + cuBLAS); `OUT_PROD_ID` for MoE backward | | `ggml/src/ggml-cuda/ggml-cuda.cu` | `supports_op` for quantized `OUT_PROD` and `OUT_PROD_ID`; CPU-resident ids fix in `mul_mat_id` | | `ggml/include/ggml-opt.h` | Added `grad_checkpoint_interval` to `ggml_opt_params` | | `ggml/src/ggml-opt.cpp` | Gradient checkpointing: marks every Nth forward node `GGML_TENSOR_FLAG_OUTPUT` before backward build | ### Key invariants - `params.use_mmap = false` — forced; mmap'd tensors can't have data written back - `params.flash_attn_type = DISABLED` — no backward impl for flash attention - `params.warmup = false` — warmup runs inference with PARAM tensors → segfault - `params.cache_type_k = F32` — training requires F32 KV (or BF16 with `--cache-type-k bf16`) - LoRA A/B tensors are marked `PARAM` via `ggml_set_param` on the tensors loaded by `llama_adapter_lora_init`, not on the pre-init scratch tensors in `lt.buf` - The adapter GGUF is pre-saved and loaded via `params.lora_adapters` BEFORE `common_init_from_params` so that `sched_reserve` includes LoRA graph nodes in its sizing ### Why opt_init inflation matters `ggml_opt_init` captures `sched.get()` at construction time. The backward graph (`gb_grad`, `gb_opt`) is ~3–5× larger than the forward graph in node count. If the scheduler hash_set is sized only for the forward graph, `ggml_backend_sched_alloc_graph` on the backward graph will overflow it. We recreate `sched` with `inflated = fwd_nodes × 4` slots BEFORE calling `ggml_opt_init`. ### Reward weighting implementation `llama_opt_set_reward_weights(weights, n)` sets thread-local `g_reward_weights`. In `opt_epoch`, each window reads `g_reward_weights[idata]` and passes it as `reward_scale` to `opt_epoch_iter`. Inside the iter loop, instead of writing `1.0f` for the correct token's label position in the cross-entropy label tensor, it writes `reward_scale`. Since cross-entropy loss = `-mean(label × log(softmax(logit)))`, scaling the label scales both loss and gradient identically.