This commit changes the runner for the gguf-publish workflow from
ubuntu-slim back to ubuntu-latest, which was updated in Commit
142cbe2ac6 ("ci : use new 1vCPU runner for
lightweight jobs (#19107)").
The motivation for this is that the action used in the workflow depends
on the docker daemon, which does not seem not available in the
ubuntu-slim runner. This is currently causing an error in the workflow
and preventing the gguf-publish workflow from running successfully.
Today was the the first time since the original change (I think) that
publish task has been run which may be why the issue was not noticed
before.
Refs: https://github.com/ggml-org/llama.cpp/actions/runs/22481900566
* server : support multiple model aliases via comma-separated --alias
* server : update --alias description and regenerate docs
* server : multiple model aliases and tags
- address review feedback from ngxson
- --alias accepts comma-separated values (std::set, no duplicates)
- --tags for informational metadata (not used for routing)
- aliases resolve transparently in router via get_meta/has_model
- /v1/models exposes aliases and tags fields
* regenerate docs
* nits
* server : use first alias as model_name for backward compat
address review feedback from ngxson
* server : add single-model test for aliases and tags
The binary relies on model files that it tries to find. However, when
configuring the build directory to be parallel to the source tree those
heuristics fail.
This sets the working directory for the test executable to be the
source-tree which resolves this issue.
- adapt ggml-zendnn.cpp to the new lowoha::matmul interface
- update the ZenDNN git tag in CMake to the latest release (ZenDNN‑2026‑WW08)
- add static lib support in CMake
* ggml-virtgpu-backend: validate the consistency of the received objects
This patch adds consistency checks in the
ggml-virtgpu-backend (running on the host side) to ensure that the
data received from the guest is consistent (valid pointers, valid
sizes and offsets).
* ggml-virtgpu-backend: add fallback/skips for optional ggml backend methods
```
1. bck->iface.synchronize(bck)
2. buft->iface.get_alloc_size(buft, op)
3. buft->iface.get_max_size(buft)
```
these three methods are optional in the GGML interface. `get_max_size`
was already properly defaulted, but `backend sychronize` and `butf
get_max_size` would have segfaulted the backend if not implemented.
* ggml-virtgpu-backend: fix log format missing argument
* ggml-virtgpu-backend: improve the abort message
* ggml-virtgpu-backend: more safety checks
* ggml-virtgpu-backend: new error code
* ggml-virtgpu-backend: initialize all the error codes
* ggml-virtgpu: add a missing comment generated by the code generator
* ggml-virtgpu: add the '[virtgpu]' prefix to the device/buffer names
* ggml-virtgpu: apir_device_buffer_from_ptr: improve the error message
* ggml-virtgpu: shared: make it match the latest api_remoting.h of Virglrenderer APIR
(still unmerged)
* ggml-virtgpu: update the code generator to have dispatch_command_name in a host/guest shared file
* ggml-virtgpu: REMOTE_CALL: fail if the backend returns an error
* docs/backend/VirtGPU.md: indicate that the RAM+VRAM size is limed to 64 GB with libkrun
* ggml-virtgpu: turn off clang-format header ordering for some of the files
Compilation breaks when ordered alphabetically.
* ggml-virtgpu: clang-format
* ggml-virtgpu/backend/shared/api_remoting: better comments for the APIR return codes
* WIP: Add EuroBERT support with autoformatting changes
This commit includes:
- EuroBERT model implementation for GGUF conversion
- C++ backend support for EuroBERT architecture
- Unintended autoformatting changes to Python files
Saving before reverting formatting-only changes.
* feat: add back eos assert when not last token pooling
* feat: removed duplicated code and cleanup
* feat: removed not working architectures and unnecessary check
* fix: typo
* fix: dynamic pooling config
* feat: added an example model for eurobert
* feat: proper llama-vocab implementation for jina-v5
* fix: removed unnecessary comments
* Update build command to build llama-* tools not just ggml-hip
* Update rocWMMA headers to 7.2
* Add GFX1150 target
* Correct library paths for AMD libraries in 26.Q1
* server: fix query params lost when proxying requests in multi-model router mode
* server: re-encode query params using httplib::encode_query_component in proxy
* vulkan: allow using fp16 in scalar flash attention shader
* split rows inside of subgroups for faster synchronization
* use row_split when Br >= 4, change reductions to use shared memory if row_split == 1
* use f32 scalar FA if f16 is not supported by device
* fix amd workgroup size issue
* optimize masksh use
* add medium rows FA shader Br size
* fixes
* add padding to mask shmem buffer
* cache q values into registers for KQ
* fuse lf accumulation, pf and v accumulation into a loop
* stage K loads through shmem
* stage V loads through shmem
* only stage through shmem on Nvidia
* default to Bc 32
* also stage V through shmem when this is done for K
* dynamic subgroups for intel
* use vectorized stores
* use float_type for dequantize4 functions
* use smaller scalar rows size for smaller rows count
* relax flash attention split_k condition to allow non-gqa use
* use minimal subgroup size on Intel
* fix shmem support function
* fix rebase issues
* fixes
* Bc 4 for scalar FA is not a valid configuration
* Use wave32 on AMD RDNA for scalar FA
* add Intel shader core count lookup-table
* fix regressions
* device tuning
* tmpsh size fix
* fix editorconfig
* refactor fa tuning logic into a single place
* fix gqa opt logic
* fix block_rows with small n_rows
* amd tuning
* fix hsk=72/80 issue
* tuning
* allow condition skipping for column check
* use float16 for Of if available
* address feedback
* fix bad RDNA performance on head size <= 128 by limiting occupancy
* allow printing pipeline stats
* cleanup and fixes
* limit occupancy for GCN for small batch FA with large HSK
* disable f16 FA for GCN AMD GPUs on the proprietary driver
* hexagon: refactor set/get/sum-rows ops to use local context
* hexagon: refactor ROPE and Softmax Ops to use local context
Improves performance a bit by precomputing things and saving in the context.
* hexagon: refactor activation ops to use local context struct
* hexagon: refactor unary ops to use local context struct and DMA/VTCM
* hexagon: use aligned hvx_scale function
* hexagon: remove unused fields from op_context
* hexagon: rewrite ROPE to use DMA and VTCM scratchpad
* hex-rope: keep N rows in scratchpad (instead of just two)
* hex-rope: introduce rowidx cache
* hex-rope: remove unused fields
* hex-rope: rewrite dma prefetch logic to allow for multi-row fetch/compute
also removes the need for fastdiv.
* hex-rope: minor formatting
* hex-rope: use indices and unroll the loops
* hex-rope: more updates to cleanup rope-block handling
* hexagon: cleanup supported type/dims checks
* hexagon: all reduce funcs replicated across lanes
There is no need to explicitly replicate the first value.
* snapdragon: update adb and windows scripts to use ubatch-size 256
Updated Ops support handles larger ubatches.
This commit replaces/merges the inspect-org-model.py script with the
contents tensor-info.py script. The merged script has also been updated
to also print tensor sizes which was the only thing that was not done
before (by tensor-info.py that is).
The motivation for this is that tensor-info.py does not load the tensor
weights which can be time consuming for larger models. And also now that
both are doing almost the same thing it makes sense to just have one and
not two scripts to maintain.
* llama : remove write/read of output ids/logits/embeddings
This commit removes the write/read of output ids, logits and
embeddings from the llama context state.
Refs: https://github.com/ggml-org/llama.cpp/pull/18862#issuecomment-3756330941
* completion : add replying of session state
This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.
* common : add common_prompt_batch_decode function
This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.
* update save-state.cpp to use llama_state_load_file
This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.
* examples : set n_seq_max = 2 for ctx3
This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.
The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.