Commit Graph

7317 Commits

Author SHA1 Message Date
Isaac McFadyen e0539eb6ae
webui: switch to hash-based routing (alternative of #16079) (#16157)
* Switched web UI to hash-based routing

* Added hash to missed goto function call

* Removed outdated SPA handling code

* Fixed broken sidebar home link
2025-09-26 18:36:48 +03:00
Aleksander Grygier 5d0a40f390
Always show message actions for mobile UI + improvements for user message sizing (#16076) 2025-09-26 15:59:07 +02:00
Radoslav Gerganov d12a983659
codeowners : add rgerganov as owner of RPC [no ci] (#16279) 2025-09-26 16:09:34 +03:00
Aleksei Nikiforov cc1cfa277b
mtmd : fix uninitialized variable in bicubic_resize (#16275)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-26 15:00:44 +02:00
Georgi Gerganov 54dbc37053
metal : report OOM errors (#16274) 2025-09-26 14:14:28 +03:00
Adrien Gallouët b995a10760
common : use cpp-httplib as a cURL alternative for downloads (#16185)
* vendor : update httplib

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* common : use cpp-httplib as a cURL alternative for downloads

The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* ggml : Bump to Windows 10

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-26 14:12:19 +03:00
Adrien Gallouët 4710dd31bb
build : fix build-ios-device (#16257)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-26 13:39:35 +03:00
Aaron Teo 9b26511857
ggml-cpu: implement MXFP4 SIMD for s390x (#16193)
* ggml-cpu: impl mxfp4 s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing s = sumf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix incorrect kval_mxfp4 type

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rework mxfp4

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: missing delta calc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix typo for vec_splats

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: expand to 2 blocks per loop

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: add unroll to boost perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: back to 1 block per loop to test perf

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "ggml-cpu: back to 1 block per loop to test perf"

This reverts commit 1fe55724e2.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: rm unroll from single block

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-26 13:27:25 +03:00
Radoslav Gerganov 00217cd413
ci : create git tags for released docker images (#16008)
* ci : create git tags for released docker images

When releasing a docker image for build number X, we should also create
the corresponding git tag. This allows users to easily checkout the
corresponding source tree for given docker image.

* Update .github/workflows/docker.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update .github/workflows/docker.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-26 10:19:23 +00:00
Daniel Bevenius 3b337b01a1
codeowners : add danbev as owner of build-xcframework.sh [no ci] (#16268) 2025-09-26 08:53:36 +03:00
R0CKSTAR a86a580a66
musa: upgrade musa sdk to 4.3.0 (#16240)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-26 02:56:38 +02:00
R0CKSTAR 0f7c69689f
musa: fix build warnings (#15611)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-09-26 02:56:10 +02:00
Sigbjørn Skjæret 835b2b915c
model : add GroveMoE support (#15510)
* add GroveMoE support

* remove constexpr that fails on certain compilers

* revert crude scalar div implementation, use cast

* build_attn_inp_kv_unified -> build_attn_inp_kv

* fix build_attn

* re-apply ffn_exps regex changes
2025-09-25 19:50:28 +02:00
Aaron Teo b05a9d650f
vendors: update miniaudio version (#16212)
* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* vendor: update miniaudio.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 23:38:10 +08:00
rtaluyev 27052978e4
readme : update bindings (#16144)
Link to Java JNA bindings to llama.cpp native libraries
2025-09-25 18:20:34 +03:00
Aman Gupta 077c94d0ca
CUDA: add a fused top-K MoE kernel (#16130)
* CUDA: add a fused top-K MoE kernel

This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory

It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models

* Refactor into ggml_cuda_should_use_topk_moe

* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before

* Review: format + micro-optimizations

* Fix bug: fix tie breakers

* Add optional norm + clean-up code

* Use smem for final write

* Add bounds check

* Use better memory pattern for writeback
2025-09-25 16:35:05 +02:00
Daniel Bevenius aa3ee0eb0b
model-conversion : add embedding prompt file support (#15871)
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
2025-09-25 12:02:36 +02:00
Daniel Bevenius d0991da39d
server : add support for external server for tests (#16243)
This commit adds support for using an externally started llama-server
instance for the server tests. This can be enabled by setting the
DEBUG_EXTERNAL environment variable.

The motivation for this is to allow debugging of the server itself
when investigating a test failure. Instructions for how to do this are
added to the README.md file in the tests directory.
2025-09-25 11:36:47 +02:00
junchao-zhao aa719c2f88
ggml : fix loongarch lsx compilation error (#15864) 2025-09-25 12:22:55 +03:00
Johannes Gäßler 4cdd0bb453
docs: fix typo [no ci] (#16244) 2025-09-25 12:12:27 +03:00
Douglas Hanley b5bd037832
llama : add support for qwen3 reranker (#15824) 2025-09-25 11:53:09 +03:00
Georgi Gerganov dfcd53f7ec
metal : fuse NORM + MUL + ADD, support non-multiples of 4 (#16220)
* metal : fuse NORM + MUL + ADD

* metal : support norms of non-multiple of 4

* cont : fix comment [no ci]
2025-09-25 11:30:16 +03:00
Georgi Gerganov 4ea00794b8
metal : relax reorder conditions (#16216) 2025-09-25 11:29:42 +03:00
Georgi Gerganov 02a6a82ae7
metal : restore im2col perf (#16219) 2025-09-25 11:29:08 +03:00
Radoslav Gerganov c498fc82fe
rpc : use ggml logging facilities
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
2025-09-25 07:20:02 +00:00
Aaron Teo e7a5130a20
codeowners: add ownership of zdnn backend [no ci] (#16232)
add @Andreas-Krebbel to owners of zDNN backend

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 08:06:30 +03:00
Eve bee378e098
ci: run the x64 and arm ci on the github machines instead (#16183)
* run the x64 ci on regular machines

* set up the same thing for arm

fix test-quantize-perf just like #12306

* try to disable sve

* add another sve run
2025-09-25 08:06:06 +03:00
Aaron Teo 5fb557653b
devops: fix s390x docker release failure (#16231) 2025-09-25 11:36:30 +08:00
Aaron Teo 4ae88d07d0
codeowners: add ownership of zdnn backend [no ci] (#16229)
add @AlekseiNikiforovIBM to owners of zDNN backend

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-25 00:25:04 +08:00
Johannes Gäßler e789095502
llama: print memory breakdown on exit (#15860)
* llama: print memory breakdown on exit
2025-09-24 16:53:48 +02:00
Acly f2a789e334
ggml : split graph allocations according to backend max buffer size (#15815)
* ggml : make gallocr respect the backend's max buffer size

* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max  allocation size in buffer type  interface

* fix missing newline, apple-clang warning

* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.

* track (chunk, offset) pairs instead of "global" offsets through gallocr.

* simpler, don't need loops to map between local/global offsets
* touches more code

* fix dyn_tallocr_max_size and initialization

* fix memory leak when buffers are reused due to same buffer type appearing multiple times

* make vbuffer allocation follow the same logic as backend_buffer did before

* continue to use leftover unallocated space of previous chunks after a new one has been created

* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size

* refactor: move adding new free block and new chunk into separate functions

* allocate chunks individually with a separate free-blocks list for each one

* needs a bit more memory/allocations/indirections, but code is simpler

* fix warnings (missing static) & debug checks
2025-09-24 16:17:49 +02:00
nullname 3994a9b7df
feat: perf opt dma (#56)
* Add power management utilities to NPU device context and update DCVS settings

* Update DCVS settings in power_utils to use v3 API and enhance power management

* wip

* Enhance dequantization functions by adding load_dequant_table support and updating signatures for improved performance

* use lut

* wip

* fix test failure

* wip

* Refactor load_qual_block_generic to improve block handling and optimize vector operations

* Enhance load_dual_block_generic and load_qual_block_generic to accept a mask parameter for improved block handling

* Refactor flash_attn_impl to optimize mask l2 prefetch

* wip

* wip

* wip

* wip

* add log

* link against shared libraries instead of static ones

* fix swiglu

* wip

* refactor expf_fix to handle overflow for different data types

* enhance is_glu_op_supported to validate shapes for multiple sources

* wip

* refactor logging macros to use hexagon namespace and improve formatting

* fix printf format error

* wip

* refactor: update static_assert messages for block size validation and add HVX_VectorPred_x3 type alias

* rename

* feat: enhance fa with mask

* wip

* wip

* refactor: replace instances of Q6_V_vzero() with kZeroV for consistency

* wip

* wip

* wip

* fix: improve address alignment check in HVX_Vector handling

* refactor: streamline vector dot product implementations for improved readability

* refactor: q4k add hvx intrinsic impl

* refactor: enhance dequantize_row_q4_K for clarity and performance

* refactor: optimize scale mask usage in dequantization functions for improved performance

* refactor: optimize dequantize_row_q4_K for intrinsic usage and performance improvements

* refactor: move GLU operation implementation into separated file

* sync after swiglu

* wip

* wip

* wip

* feat: increase prc main thread stack size

* fix: replace hardcoded stack size with NPU_THREAD_STACK_SIZE constant

* wip

* feat: add optimized vector operations for exponential and division with overflow handling

* wip

* feat: refactor exponential function to handle overflow and underflow with improved logic

* wip

* wip

* feat: add vector loading and scaling functions for improved performance in block processing

* wip

* feat: optimize block loading by refactoring scale index handling for improved performance

* use Q6_Vb_vlut32_VbVbR_nomatch instead

* feat: enhance scale loading by adding static assertion and restructuring block handling

* wip

* feat: refactor vec_dot_product_mixed_impl for improved clarity and performance

* wip

* feat: simplify vector loading functions and improve alignment handling

* wip

* feat: enhance scale loading mask with quantization block size validation

* wip

* feat: implement make_scale_load_mask function and refactor vector handling in vec_ops

* feat: enhance load_dual_block_generic to include scale indices for improved vector loading

* revert q8 dequant

* wip

* feat: optimize dequantization functions by removing unnecessary masking and updating lookup methods

* wip

* wip

* add qurt_mutex

* Add DMA transfer class and integrate into thread pool

* Enhance DMA transfer functionality by adding support for multiple descriptors and initiating transfers in parallel

* fix dma crash

* fix failed unit tests

* wip

* use alignas

* Improve DMA transfer error handling and update descriptor completion check

* Fix VTCM cache size calculation in element-wise operations

* Add cache clean operations before DMA transfers in element-wise operations

* reduce cache clean operations

* Refactor DMA transfer functions to support 1D operations and rename for clarity

* Enhance DMA transfer functionality by adding 2D submission support and improving descriptor initialization

* Update read buffer method to support forced invalidation and remove unnecessary invalidation calls in element-wise operations

* wip

* Improve DMA transfer handling in mul_mat_gemv_impl by replacing memcpy with initiate_dma_row_transfer and adding wait_for_dma logic

* fix 2d dma

* feat: add DMA plane cache

* rename

* wip

* use memcpy for debug

* fix cache plane calc

* refactor: remove debug logging from mul_mat_impl and optimize cache handling

* rename

* fix 2d dma type

* refactor: enhance DMA transfer handling in mul_mat_gemv_impl and wait functions

* refactor: optimize DMA transfer handling in mul_mat_gemv_impl and wait functions

* wip

* wip

* move op impl into sub dir

* add log

* fix: correct pointer usage in mul_mat_gemv_impl for next plane access

* fix: improve DMA transfer error handling in mul_mat_impl and mul_mat_gemv_impl

* fix: fix crash by using the entire row bytes

* wip

* wip

* fix: prevent parallelization for scalar src1 in is_mul_mat_supported

* fix: add dimension checks for 2D DMA transfers and fallback to 1D if necessary

* wip

* fix: enable thread barrier for mul multiplication operations

* feat: add synchronization checks for tensor operations and update related functions

* wip

* fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations

* Revert "fix: remove invalidation flag from get_read_buffer calls in element-wise and matrix multiplication operations"

This reverts commit af3441e67e706b2e5122369dc160353796867dd3.

* wip

* wip

* add comment

* wip
2025-09-24 21:40:17 +08:00
Tarek Dakhran 3a59971967
model : add label for LiquidAI LFM2-2.6B model (#16204)
* model : add label for LiquidAI LFM2-2.6B model

HF link: [LiquidAI/LFM2-2.6B](https://huggingface.co/LiquidAI/LFM2-2.6B).

Support for GGUF conversion and inference is added in #14620.

However, due to similar `n_embd`, it identifies as a 1.2B model.
Fix the label by using `n_ff` to identify the model instead.

Output of `llama-bench`:
```
| model                          |       size |     params | backend    | threads |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | --------------: | -------------------: |
| lfm2 1.2B F16                  |   2.18 GiB |     1.17 B | CPU        |      10 |           pp512 |        223.97 ± 5.32 |
| lfm2 2.6B F16                  |   4.79 GiB |     2.57 B | CPU        |      10 |           pp512 |         92.53 ± 4.14 |
| lfm2 350M F16                  | 676.25 MiB |   354.48 M | CPU        |      10 |           pp512 |       725.52 ± 11.70 |
| lfm2 700M F16                  |   1.38 GiB |   742.49 M | CPU        |      10 |           pp512 |       336.22 ± 12.93 |
```

* Update src/llama-model.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-24 13:42:26 +02:00
Jie Fu (傅杰) 63b54c81a6
model-conversion : make causal-verify-logits fails with model names containing "." (#16215)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 10:25:26 +02:00
Uilian Ries 152729f884
common : add missing chrono header for common.cpp (#16211)
Signed-off-by: Uilian Ries <uilianries@gmail.com>
2025-09-24 09:53:47 +03:00
Sigbjørn Skjæret c0c59c1157
codeowners : match all requirements files (#16214) 2025-09-24 08:53:20 +02:00
Jie Fu (傅杰) 7735706b93
model-conversion : run-org-model.py fails to run on mac m1 (#16213)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 08:46:52 +02:00
Daniel Bevenius 4d9ea03d17
codeowners : use slash prefix for root files [no ci] (#16210)
This commit adds a leading slash to the paths of root-level files
in the CODEOWNERS file.

The motivation for this is that these might otherwise match files
in subdirectories that have other/additional owners will override them.

Refs: https://github.com/ggml-org/llama.cpp/pull/16209#issuecomment-3326434274
2025-09-24 08:10:09 +02:00
Jie Fu (傅杰) 8ba548dae2
model-conversion : fix the make targets in the README.md (#16209)
Fix two incorrect make targets in the readme.

Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 06:19:23 +02:00
Georgi Gerganov f505bd83ca
ci : disable AMD workflows + update NVIDIA workflows (#16200)
* ci : disable AMD workflows + update NVIDIA workflows

* cont : fixes

* cont : update nvidia vulkan workflows
2025-09-23 20:41:40 +03:00
Georgi Gerganov 0889589dbe
ci : enable Vulkan workflow on Mac (#16194) 2025-09-23 13:44:25 +03:00
Xiangyan Sun 4e29084ba4
ggml-cpu: Respect cpumask settings (#16164) 2025-09-23 11:58:12 +03:00
Sigbjørn Skjæret f6b4af3d04
ggml : fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl (#15928)
* fix uninitialized is_on_grid in quantize_row_iq3_xxs_impl

* change initialization to true
2025-09-23 10:25:20 +02:00
Aaron Teo 264f1b5187
zdnn: refactor codebase + add docs (#16178)
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 14:53:05 +08:00
Daniel Bevenius 0bc7cc7154
codeowners : add @danbev to model-conversion example [no ci] (#16190)
This commit adds examples/model-conversion/ to the CODEOWNERS file and
assigns myself (@danbev) as the code owner for this directory.
2025-09-23 09:13:22 +03:00
Aaron Teo 4b9f4cb0f8
devops: add s390x containers (#15915)
* devops: add s390x dockerfile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing ninja

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move s390x docker into cpu docker

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: rework s390x docker

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: copy more tools

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add server build step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove apt clean steps as distroless misses it

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove apt commands from distroless

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix shared libs in distroless

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: use correct libs path

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix shared libs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add collector stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing stage ref

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix permission issue

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix unknown model loading failures

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at fixing model loading failure

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing ggml shared object

failure to load model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove move shared objects

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move libggml-cpu and blas into bin

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: finalise hardened server stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add cli target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix typos

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix missing shared libraries in base

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: update debian target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: formalise llama.cpp loc"

This reverts commit 0a7664af84.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: formalise llama.cpp loc

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0a7664af84)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at fixing missing dir

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempt at making it cache the build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix copying process

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: make build dir an argument

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: make build dir an argument"

This reverts commit 438698976b.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add build stage for gguf-py

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move gguf-py installation into build stage

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: break system packages?

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add rust compiler installer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix rustc not found

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove cache mount to allow rustc to persist

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move rustc installation to another layer

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: move gguf-py installation to full stage, fix copying

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove rustc installation in build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable full target for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: attempting static build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: merge s390x dockerfile into cpu for now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: switch to gcc image for build step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove build essentials

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: install openblas into base target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: go back to s390x dockerfile

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove libggml and libblas

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add full target

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add break system packages

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add libjpeg

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing cmake dep

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: finalise docker images for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add custom openblas patch

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: use libopenblas-dev instead of libopenblas-openmp-dev

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add s390x docker build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 13:59:34 +08:00
Daniel Bevenius 85e72271ba
ggml-cpu : fix typo in gemm comments [no ci] (#16189) 2025-09-23 05:59:03 +02:00
Gabe Goodhart 1d0125bcf1
feat: Add conversion support in GraniteHybrid for non-hybrid (all attn) (#16177)
This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.

Branch: gabe-l-hart/GraniteNonHybridConversion

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-22 20:40:10 +02:00
Haiyue Wang 351f3da39c
clang-tidy : disable warning about performance enum size (#16127)
Disable 'performance-enum-size' checking:

Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
2025-09-22 19:57:46 +02:00
Sigbjørn Skjæret 3ecb2f671a
ggml : implement set_rows with i32 index (#16159)
* implement set_rows with i32 index

* template fix

* test quantized path

warnings--

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* forgotten name change

* deduplicate cuda/sycl and test-fix

* indent++

* vulkan: support set_rows with i32 index type (#16162)

* disable i32 index for webgpu for now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-09-22 19:13:00 +02:00