llama.cpp/examples/speculative-simple
ruixiangw 8fac4b1cc8 feat: add EAGLE3 speculative decoding support
EAGLE3 is an encoder-decoder based speculative decoding method:
- Extracts features from target model at specific layers
- Uses feature fusion layer to compress target features
- Generates draft tokens with single-layer decoder
- Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:
- Add LLM_ARCH_EAGLE3 architecture
- Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
- Add feature extraction from target model layers
- Add g_embeddings handling for decoder input
- Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
- Add --eagle3 flag for speculative-simple example
- Add EAGLE3 model conversion in convert_hf_to_gguf.py
2025-12-14 18:12:33 +00:00
..
CMakeLists.txt ggml : move AMX to the CPU backend (#10570) 2024-11-29 21:54:58 +01:00
README.md speculative : refactor and add a simpler example (#10362) 2024-11-25 09:58:41 +02:00
speculative-simple.cpp feat: add EAGLE3 speculative decoding support 2025-12-14 18:12:33 +00:00

README.md

llama.cpp/examples/speculative-simple

Demonstration of basic greedy speculative decoding

./bin/llama-speculative-simple \
    -m  ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf \
    -md ../models/qwen2.5-1.5b-coder-instruct/ggml-model-q4_0.gguf \
    -f test.txt -c 0 -ngl 99 --color \
    --sampling-seq k --top-k 1 -fa --temp 0.0 \
    -ngld 99 --draft-max 16 --draft-min 5 --draft-p-min 0.9