- Remove unused #include <cstring>
- Fix false positive when only one backend is available
- Clarify comment: "reassign graph nodes" instead of "reassign ops"
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add missing llama_backend_free() on model load failure path
- Only print diagnostics on failure, not on success
- Pick target backend by finding one different from current instead
of assuming backend ordering
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add a new llama_pre_alloc_callback that fires after graph construction
but before memory allocation in llama_decode/llama_encode. This allows
downstream consumers to call ggml_backend_sched_set_tensor_backend()
to route specific ops (e.g. attention) to a different backend without
modifying llama.cpp internals.
Changes:
- Add llama_pre_alloc_callback typedef to llama.h
- Add cb_pre_alloc + cb_pre_alloc_user_data to llama_context_params
and llama_cparams
- Invoke callback in process_ubatch() between build_graph and
alloc_graph
- Add test that verifies callback invocation and backend reassignment
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>