Add a new llama_pre_alloc_callback that fires after graph construction but before memory allocation in llama_decode/llama_encode. This allows downstream consumers to call ggml_backend_sched_set_tensor_backend() to route specific ops (e.g. attention) to a different backend without modifying llama.cpp internals. Changes: - Add llama_pre_alloc_callback typedef to llama.h - Add cb_pre_alloc + cb_pre_alloc_user_data to llama_context_params and llama_cparams - Invoke callback in process_ubatch() between build_graph and alloc_graph - Add test that verifies callback invocation and backend reassignment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| llama-cpp.h | ||
| llama.h | ||