llama.cpp

Commit Graph

Author	SHA1	Message	Date
Aaron Teo	a1912c7fa9	devops: fix copying process Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 18:07:59 +08:00
Aaron Teo	03e642a9d1	devops: attempt at making it cache the build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 18:05:43 +08:00
Aaron Teo	0084c88929	devops: attempt at fixing missing dir Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:52:43 +08:00
Aaron Teo	73679520ce	devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0a7664af84`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:51:20 +08:00
Aaron Teo	bff187d717	Revert "devops: formalise llama.cpp loc" This reverts commit `0a7664af84`. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:47:02 +08:00
Aaron Teo	0a7664af84	devops: formalise llama.cpp loc Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:40:27 +08:00
Aaron Teo	244d6cf56f	devops: update debian target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:29:00 +08:00
Aaron Teo	17a9985086	devops: fix missing shared libraries in base Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:24:23 +08:00
Aaron Teo	489e0ab54f	devops: fix typos Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:19:30 +08:00
Aaron Teo	a0b22c8a29	devops: add cli target Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 17:14:33 +08:00
Aaron Teo	f6baab6be8	devops: finalise hardened server stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:59:53 +08:00
Aaron Teo	10714efb6d	devops: move libggml-cpu and blas into bin Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:54:06 +08:00
Aaron Teo	ab79c0bb80	devops: remove move shared objects Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:45:17 +08:00
Aaron Teo	944ef7f0bc	devops: fix missing ggml shared object failure to load model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:38:05 +08:00
Aaron Teo	b23e72e1d0	devops: attempt at fixing model loading failure Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:19:35 +08:00
Aaron Teo	451aceb9a0	devops: fix unknown model loading failures Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 16:16:49 +08:00
Aaron Teo	c3ab7855fd	devops: fix permission issue Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 15:43:59 +08:00
Aaron Teo	7027c14d3c	devops: fix missing stage ref Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 15:35:29 +08:00
Aaron Teo	74767bbc16	devops: add collector stage Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 15:34:47 +08:00
Aaron Teo	3a09c656a7	devops: fix shared libs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 15:25:01 +08:00
Aaron Teo	28b41f73ed	devops: use correct libs path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-09 02:59:06 +08:00
Aaron Teo	2ff6694a0f	devops: fix shared libs in distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 18:31:58 +08:00
Aaron Teo	a070157511	devops: remove apt commands from distroless Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 18:16:32 +08:00
Aaron Teo	23d34f9a98	devops: remove apt clean steps as distroless misses it Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 17:57:48 +08:00
Aaron Teo	e172b00445	devops: add server build step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 17:50:10 +08:00
Aaron Teo	e53e1c450c	devops: copy more tools Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 15:36:41 +08:00
Aaron Teo	ce7bd1955d	devops: rework s390x docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 15:19:41 +08:00
Aaron Teo	955c426620	devops: move s390x docker into cpu docker Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 14:56:07 +08:00
Aaron Teo	75846921d8	devops: add missing ninja Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 14:03:38 +08:00
Aaron Teo	bdcbcaeead	devops: add s390x dockerfile Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 13:59:54 +08:00
Jeff Bolz	d413dca003	tests: large sizes for get_rows (#15687 )	2025-09-07 23:23:41 -05:00
Chenguang Li	85ca66a746	CANN: Stream sync between devices for acl_graph (#15809 ) * CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: add Comments --------- Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-08 10:03:29 +08:00
Jeff Bolz	3976dfbe00	vulkan: support im2col_3d (#15795 )	2025-09-07 13:50:26 -05:00
Aaron Teo	d36e61c580	ggml-cpu: clean up s390x SIMD (#15855 ) * ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0da4b6aa07`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 02:18:28 +08:00
Jeff Bolz	c97b5e5854	vulkan: Support pad_ext (#15794 )	2025-09-07 19:00:49 +02:00
Jeff Bolz	267e99867f	vulkan: Use larger loads in scalar/coopmat1 matmul (#15729 ) I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.	2025-09-07 18:53:07 +02:00
Daniel Bevenius	3b15924d71	ggml WebGPU: remove userdata from request adapter callback (#15527 ) * ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call.	2025-09-07 11:19:45 +03:00
Johannes Gäßler	79bc429262	CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769 )	2025-09-07 00:26:28 +02:00
Charles Xu	c4df49a42d	kleidiai: generalize compute_forward_kv_cache to compute_forward_fp16 (#15817 )	2025-09-06 22:08:43 +08:00
Xuan-Son Nguyen	3c3635d2f2	server : speed up tests (#15836 ) * server : speed up tests * clean up * restore timeout_seconds in some places * flake8 * explicit offline	2025-09-06 14:45:24 +02:00
Xuan-Son Nguyen	61bdfd5298	server : implement prompt processing progress report in stream mode (#15827 ) * server : implement `return_progress` * add timings.cache_n * add progress.time_ms * add test * fix test for chat/completions * readme: add docs on timings * use ggml_time_us Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-06 13:35:04 +02:00
Johannes Gäßler	01806e7771	ggml-cpu: document use of "free" memory [no ci] (#15834 )	2025-09-06 13:28:44 +02:00
Aaron Teo	186415d595	ggml-cpu: drop support for nnpa intrinsics (#15821 )	2025-09-06 11:27:28 +08:00
Gabe Goodhart	fd621880f3	aLoRA Support (#15327 ) * feat: Add python-side constants and conversion for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add c++ side constants for adapter.lora.invocation_string Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse invocation string for adapters from GGUF Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(python): Update conversion to alora_invocation_tokens This is the preferred method in PEFT which is the source of ground truth https://github.com/huggingface/peft/pull/2609/files#diff-13380145401d203d5935c5189dd09879f990b81aa63e8e3aaff8ce9110333f0e Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix(cpp): Update to alora_invocation_tokens on c++ side Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add C APIs to get alora invocation token array from lora Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Initial implementation of alora cache logic in server This does not yet do the part to identify the invocation tokens and only apply the lora adapter afterwards, but it does seem to produce correct results if the invocation tokens are the beginning of the uncached input. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Identify alora invocation sequences This currently limits to a single enabled alora per slot. Multiple aloras with different invocation sequences would be possible, but it would require a more complex integration of the adapter toggling and is not really a well studied case for alora since it's unclear if one alora can reuse cache from previous prefill computed with a different alora. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Only reuse cache for tokens before the alora invocation start This is a bit of an edge case, but theoretically a user could try the same query with the alora disabled (just using the base model), then retry with the alora. The cached tokens from the first pass should be invalid. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Handle un-cached tokens that come before the alora activation The solution is to only fill up to the token before the invocation start in the batch if there are any tokens to be prefilled between those pulled from cache and the invocation start. When this is detected, the alora is temporarily disabled with a scale of 0.0, then immediately re-enabled after it has been initialized for the internal graph. Since the batch does not complete the prompt tokens, the remaining prompt tokens are handled in the next task, pulling all of the non-alora tokens from cache and proceeding with prefill for the alora tokens. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use \|\| instead of 'or' Too much python 🤦 Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix off-by-one for limiting cached tokens to before alora start This was the cause of the inconsistent results from the dummy test script with and without the turn that runs the prompt without the adapter before running it with the adapter. Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Support backwards-compatibility for "invocation_string" in adapter_config.json While this has been replaced in the PEFT PR in favor of alora_invocation_tokens, the existing adapters in the ibm-granite org on HF use "invocation_string," so this will enable backwards compatibility and enable testing now (before PEFT PR changes have percolated everywhere). Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Remove duplicate logging Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * feat: Report alora_invocation_string and alora_invocation_tokens from /lora-adapters Branch: gabe-l-hart/alora-support Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-05 17:32:39 -06:00
Sigbjørn Skjæret	4281c7b315	ci : exempt correct research label (#15825 )	2025-09-06 01:21:15 +02:00
Gabe Goodhart	5fac79cbc7	Thinking model disabled assistant prefill (#15404 ) * feat: Set enable_thinking IFF not disabled and supported Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Fix inverted logic condition for prefill error Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Always parse the enable_thinking kwarg to overwrite the default value From what I can tell, this started as a Qwen3-specific keyword, but from the use in `chat.cpp` translates this inputs.enable_thinking to the right thinking kwarg for the given model, this is now more of a standardized kwarg, so it should always override the default value when sent as part of the chat_template_kwargs field in the API. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Don't limit tempalte expansion check to jinja With the use_jinja check, non-jinja models would enable thinking and always fail assistant prefill Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add the error text to json type errors in json_value Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Explicitly reject string values for "enable_thinking" There are too many possible "truthy" / "falsy" strings and too many ambiguous strings that don't have a clear truthy/falsy value, so the simplest thing to do here is to reject the request. Ideally, this would be a 422 (Unprocessable Entity), but right now it's coming back as a 500. Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Move logic for detecting template enable_thinking support to common Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use raw pointer for common chat template function Branch: gabe-l-hart/thinking-model-disabled-agent-prefill Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-09-05 14:31:24 -06:00
Eric Curtin	408ff524b4	Implement --log-colors with always/never/auto (#15792 ) With auto by default Signed-off-by: Eric Curtin <ericcurtin17@gmail.com>	2025-09-05 19:43:59 +01:00
Johannes Gäßler	5143fa895e	CUDA: fastdiv, launch bounds for mmvq + q8_1 quant (#15802 ) * CUDA: fastdiv, launch bounds for mmvq + q8_1 quant	2025-09-05 16:07:02 +02:00
Daniel Bevenius	3a550b5ca4	tests : add --list-ops and --show-coverage options (#15745 ) This commit adds two new command-line options to the test-backend-ops.cpp that allow users to list all available GGML operations and to show test coverage of these operations. The motivation for this is that it can be useful to quickly see which operations are currently covered by tests and which are not. Also it migth be useful when using the `support` mode.	2025-09-05 13:49:21 +01:00
Erik Scholz	a81283820a	gguf: gguf_writer refactor (#15691 ) * gguf: split gguf writer into base and buf impl * gguf: templated gguf write out * gguf: file based writer (avoid writing everything to memory first!) * examples(llama2c): fix log not being the same level and compiler nits	2025-09-05 11:34:28 +02:00

1 2 3 4 5 ...

6439 Commits All Branches Search

6439 Commits

All Branches