Commit Graph

6439 Commits

Author SHA1 Message Date
Aaron Teo a1912c7fa9
devops: fix copying process
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 18:07:59 +08:00
Aaron Teo 03e642a9d1
devops: attempt at making it cache the build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 18:05:43 +08:00
Aaron Teo 0084c88929
devops: attempt at fixing missing dir
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:52:43 +08:00
Aaron Teo 73679520ce
devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0a7664af84)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:51:20 +08:00
Aaron Teo bff187d717
Revert "devops: formalise llama.cpp loc"
This reverts commit 0a7664af84.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:47:02 +08:00
Aaron Teo 0a7664af84
devops: formalise llama.cpp loc
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:40:27 +08:00
Aaron Teo 244d6cf56f
devops: update debian target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:29:00 +08:00
Aaron Teo 17a9985086
devops: fix missing shared libraries in base
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:24:23 +08:00
Aaron Teo 489e0ab54f
devops: fix typos
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:19:30 +08:00
Aaron Teo a0b22c8a29
devops: add cli target
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 17:14:33 +08:00
Aaron Teo f6baab6be8
devops: finalise hardened server stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:59:53 +08:00
Aaron Teo 10714efb6d
devops: move libggml-cpu and blas into bin
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:54:06 +08:00
Aaron Teo ab79c0bb80
devops: remove move shared objects
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:45:17 +08:00
Aaron Teo 944ef7f0bc
devops: fix missing ggml shared object
failure to load model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:38:05 +08:00
Aaron Teo b23e72e1d0
devops: attempt at fixing model loading failure
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:19:35 +08:00
Aaron Teo 451aceb9a0
devops: fix unknown model loading failures
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 16:16:49 +08:00
Aaron Teo c3ab7855fd
devops: fix permission issue
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:43:59 +08:00
Aaron Teo 7027c14d3c
devops: fix missing stage ref
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:35:29 +08:00
Aaron Teo 74767bbc16
devops: add collector stage
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:34:47 +08:00
Aaron Teo 3a09c656a7
devops: fix shared libs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 15:25:01 +08:00
Aaron Teo 28b41f73ed
devops: use correct libs path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-09 02:59:06 +08:00
Aaron Teo 2ff6694a0f
devops: fix shared libs in distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 18:31:58 +08:00
Aaron Teo a070157511
devops: remove apt commands from distroless
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 18:16:32 +08:00
Aaron Teo 23d34f9a98
devops: remove apt clean steps as distroless misses it
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 17:57:48 +08:00
Aaron Teo e172b00445
devops: add server build step
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 17:50:10 +08:00
Aaron Teo e53e1c450c
devops: copy more tools
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 15:36:41 +08:00
Aaron Teo ce7bd1955d
devops: rework s390x docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 15:19:41 +08:00
Aaron Teo 955c426620
devops: move s390x docker into cpu docker
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 14:56:07 +08:00
Aaron Teo 75846921d8
devops: add missing ninja
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 14:03:38 +08:00
Aaron Teo bdcbcaeead
devops: add s390x dockerfile
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 13:59:54 +08:00
Jeff Bolz d413dca003
tests: large sizes for get_rows (#15687) 2025-09-07 23:23:41 -05:00
Chenguang Li 85ca66a746
CANN: Stream sync between devices for acl_graph (#15809)
* CANN: Switch to stream synchronization

Switch to stream synchronization because events are not effective.

Co-authored-by: hipudding <huafengchun@gmail.com>

* CANN: add Comments

---------

Co-authored-by: hipudding <huafengchun@gmail.com>
2025-09-08 10:03:29 +08:00
Jeff Bolz 3976dfbe00
vulkan: support im2col_3d (#15795) 2025-09-07 13:50:26 -05:00
Aaron Teo d36e61c580
ggml-cpu: clean up s390x SIMD (#15855)
* ggml-cpu: clean up s390x simd

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
(cherry picked from commit 0da4b6aa07)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-cpu: fix hsum data types

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-08 02:18:28 +08:00
Jeff Bolz c97b5e5854
vulkan: Support pad_ext (#15794) 2025-09-07 19:00:49 +02:00
Jeff Bolz 267e99867f
vulkan: Use larger loads in scalar/coopmat1 matmul (#15729)
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...

rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
2025-09-07 18:53:07 +02:00
Daniel Bevenius 3b15924d71
ggml WebGPU: remove userdata from request adapter callback (#15527)
* ggml WebGPU: remove userdata from request adapter callback

This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.

The motivation for this change is to simplify the code and improve
readability.

* inline the callback lambda into the RequestAdapter call

This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
2025-09-07 11:19:45 +03:00
Johannes Gäßler 79bc429262
CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769) 2025-09-07 00:26:28 +02:00
Charles Xu c4df49a42d
kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (#15817) 2025-09-06 22:08:43 +08:00
Xuan-Son Nguyen 3c3635d2f2
server : speed up tests (#15836)
* server : speed up tests

* clean up

* restore timeout_seconds in some places

* flake8

* explicit offline
2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen 61bdfd5298
server : implement prompt processing progress report in stream mode (#15827)
* server : implement `return_progress`

* add timings.cache_n

* add progress.time_ms

* add test

* fix test for chat/completions

* readme: add docs on timings

* use ggml_time_us

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-09-06 13:35:04 +02:00
Johannes Gäßler 01806e7771
ggml-cpu: document use of "free" memory [no ci] (#15834) 2025-09-06 13:28:44 +02:00
Aaron Teo 186415d595
ggml-cpu: drop support for nnpa intrinsics (#15821) 2025-09-06 11:27:28 +08:00
Gabe Goodhart fd621880f3
aLoRA Support (#15327)
* feat: Add python-side constants and conversion for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add c++ side constants for adapter.lora.invocation_string

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse invocation string for adapters from GGUF

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(python): Update conversion to alora_invocation_tokens

This is the preferred method in PEFT which is the source of ground truth

https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix(cpp): Update to alora_invocation_tokens on c++ side

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add C APIs to get alora invocation token array from lora

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Initial implementation of alora cache logic in server

This does not yet do the part to identify the invocation tokens and only
apply the lora adapter afterwards, but it does seem to produce correct
results if the invocation tokens are the beginning of the uncached input.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Identify alora invocation sequences

This currently limits to a single enabled alora per slot. Multiple aloras
with different invocation sequences would be possible, but it would require
a more complex integration of the adapter toggling and is not really a well
studied case for alora since it's unclear if one alora can reuse cache from
previous prefill computed with a different alora.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Only reuse cache for tokens before the alora invocation start

This is a bit of an edge case, but theoretically a user could try the same
query with the alora disabled (just using the base model), then retry with
the alora. The cached tokens from the first pass should be invalid.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Handle un-cached tokens that come before the alora activation

The solution is to only fill up to the token before the invocation start in
the batch if there are any tokens to be prefilled between those pulled from
cache and the invocation start. When this is detected, the alora is
temporarily disabled with a scale of 0.0, then immediately re-enabled after
it has been initialized for the internal graph. Since the batch does not
complete the prompt tokens, the remaining prompt tokens are handled in the
next task, pulling all of the non-alora tokens from cache and proceeding
with prefill for the alora tokens.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use || instead of 'or'

Too much python 🤦

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix off-by-one for limiting cached tokens to before alora start

This was the cause of the inconsistent results from the dummy test script
with and without the turn that runs the prompt without the adapter before
running it with the adapter.

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Support backwards-compatibility for "invocation_string" in adapter_config.json

While this has been replaced in the PEFT PR in favor of
alora_invocation_tokens, the existing adapters in the ibm-granite org on HF
use "invocation_string," so this will enable backwards compatibility and
enable testing now (before PEFT PR changes have percolated everywhere).

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Remove duplicate logging

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters

Branch: gabe-l-hart/alora-support

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-05 17:32:39 -06:00
Sigbjørn Skjæret 4281c7b315
ci : exempt correct research label (#15825) 2025-09-06 01:21:15 +02:00
Gabe Goodhart 5fac79cbc7
Thinking model disabled assistant prefill (#15404)
* feat: Set enable_thinking IFF not disabled and supported

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Fix inverted logic condition for prefill error

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Always parse the enable_thinking kwarg to overwrite the default value

From what I can tell, this started as a Qwen3-specific keyword, but from
the use in `chat.cpp` translates this inputs.enable_thinking to the right
thinking kwarg for the given model, this is now more of a standardized
kwarg, so it should always override the default value when sent as part of
the chat_template_kwargs field in the API.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Don't limit tempalte expansion check to jinja

With the use_jinja check, non-jinja models would enable thinking and always
fail assistant prefill

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add the error text to json type errors in json_value

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Explicitly reject string values for "enable_thinking"

There are too many possible "truthy" / "falsy" strings and too many
ambiguous strings that don't have a clear truthy/falsy value, so the
simplest thing to do here is to reject the request. Ideally, this would be
a 422 (Unprocessable Entity), but right now it's coming back as a 500.

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Move logic for detecting template enable_thinking support to common

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use raw pointer for common chat template function

Branch: gabe-l-hart/thinking-model-disabled-agent-prefill

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-09-05 14:31:24 -06:00
Eric Curtin 408ff524b4
Implement --log-colors with always/never/auto (#15792)
With auto by default

Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>
2025-09-05 19:43:59 +01:00
Johannes Gäßler 5143fa895e
CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (#15802)
* CUDA: fastdiv, launch bounds for mmvq + q8_1 quant
2025-09-05 16:07:02 +02:00
Daniel Bevenius 3a550b5ca4
tests : add --list-ops and --show-coverage options (#15745)
This commit adds two new command-line options to the
test-backend-ops.cpp that allow users to list all available GGML
operations and to show test coverage of these operations.

The motivation for this is that it can be useful to quickly see which
operations are currently covered by tests and which are not. Also it
migth be useful when using the `support` mode.
2025-09-05 13:49:21 +01:00
Erik Scholz a81283820a
gguf: gguf_writer refactor (#15691)
* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits
2025-09-05 11:34:28 +02:00