Commit Graph

6486 Commits

Author SHA1 Message Date
Dong Won Kim 20c2dac8c6
vulkan: add exp operation (#15456)
Co-authored-by: aeseulgi <kim2h7903@gmail.com>
2025-08-21 17:00:16 +02:00
Jeff Bolz 96452a3fa4
vulkan: Reuse conversion results in prealloc_y (#15410)
* vulkan: Reuse conversion results in prealloc_y

Cache the pipeline and tensor that were most recently used to fill prealloc_y,
and skip the conversion if the current pipeline/tensor match.

* don't use shared pointer for prealloc_y_last_pipeline_used
2025-08-21 16:55:00 +02:00
Jie Fu (傅杰) 9ad5e60dba
examples : fix some typos in examples/model-conversion/README.md (#15477)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 16:53:13 +02:00
Georgi Gerganov 715a6db02c
kv-cache : drop the "unified" prefix (#15467)
* kv-cache : drop the "unified" prefix

ggml-ci

* cont : fix comment [no ci]
2025-08-21 17:00:33 +03:00
Jie Fu (傅杰) ad294df03f
examples : install torch-cpu for model conversion tool/example (#15475)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 15:42:34 +02:00
Ali Tariq 029bb39eb1
ci : enable RVV1.0 native build (#15386)
* Changed the CI file to hw

* Changed the CI file to hw

* Added to sudoers for apt

* Removed the clone command and used checkout

* Added libcurl

* Added gcc-14

* Checking gcc --version

* added gcc-14 symlink

* added CC and C++ variables

* Added the gguf weight

* Changed the weights path

* Added system specification

* Removed white spaces

* ci: Replace Jenkins riscv native build Cloud-V pipeline with GitHub Actions workflow

Removed the legacy .devops/cloud-v-pipeline Jenkins CI configuration and introduced .github/workflows/build-riscv-native.yml for native RISC-V builds using GitHub Actions.

* removed trailing whitespaces

* Added the trigger at PR creation

* Corrected OS name

* Added ccache as setup package

* Added ccache for self-hosted runner

* Added directory for ccache size storage

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Changed the build command and added ccache debug log

* Added the base dir for the ccache

* Re-trigger CI

* Cleanup and refactored ccache steps

* Cleanup and refactored ccache steps

---------

Co-authored-by: Akif Ejaz <akifejaz40@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-21 14:52:16 +02:00
Georgi Gerganov 30649cab65
ci : continue file download with wget (#15471)
ggml-ci
2025-08-21 13:42:55 +03:00
Daniel Bevenius 2758fa10da
examples : add model conversion tool/example (#15455)
* examples : add model conversion tool/example

This commit adds an "example/tool" that is intended to help in the
process of converting models to GGUF. Currently it supports normal
causal models and embedding models. The readme contains instructions and
command to guide through the process.

The motivation for this to have a structured and repeatable process for
model conversions and hopefully with time improve upon it to make the
process easier and more reliable. We have started to use this for new
model conversions internally and will continue doing so and improve it
as we go along. Perhaps with time this should be placed in a different
directory than the examples directory, but for now it seems like a good
place to keep it while we are still developing it.

* squash! examples : add model conversion tool/example

Remove dependency on scikit-learn in model conversion example.

* squash! examples : add model conversion tool/example

Update transformer dep to use non-dev version. And also import
`AutoModelForCausalLM` instead of `AutoModel` to ensure compatibility
with the latest version.

* squash! examples : add model conversion tool/example

Remove the logits requirements file from the all requirements file.
2025-08-21 12:16:54 +02:00
Michael Giba b108e42904
ci : fix -Werror=return-type in clip.cpp so ci/run.sh can run without issue (#15221)
* Fix -Werror=return-type so ci/run.sh can run

* Update tools/mtmd/clip.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* Remove false now that we have abort

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-08-21 12:06:46 +02:00
Copilot 245be739df
ci : add copilot-instructions.md (#15286)
* Initial plan

* Initialize copilot instructions exploration

* Add comprehensive .github/copilot-instructions.md file

* Update Python environment and tools directory documentation

- Add instructions for using .venv Python environment
- Include flake8 and pyright linting tools from virtual environment
- Add tools/ as core directory in project layout
- Reference existing configuration files (.flake8, pyrightconfig.json)

* add more python dependencies to .venv

* Update copilot instructions: add backend hardware note and server testing

* Apply suggestions from code review

* Apply suggestions from code review

* Replace clang-format with git clang-format to format only changed code

* Minor formatting improvements: remove extra blank line and add trailing newline

* try installing git-clang-format

* try just clang-format

* Remove --binary flag from git clang-format and add git-clang-format installation to CI

* download 18.x release

* typo--

* remove --binary flag

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-08-21 11:47:52 +02:00
Julien Denize b2caf67db1
convert : make Mistral community chat templates optional via parameter (#15420)
* Make Mistral community chat templates optional

* Change the flag arg to disable instead of enable community chat templates

* Improve error message

* Improve help message

* Tone down the logger messages
2025-08-21 11:19:50 +02:00
Jie Fu (傅杰) 2f3dbffb17
common : fix incorrect print of non-ascii characters in the logging (#15466)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-21 11:54:34 +03:00
Xuan-Son Nguyen 945e1f12a6
ggml : fix condition of im2col on Metal backend (#15460) 2025-08-21 08:32:26 +03:00
stduhpf 1b0db8f6e0
server : fix webui (#15462)
* Fix webui crash after streaming

* build webui
2025-08-21 08:19:22 +03:00
Daniel Bevenius 29f538ac63
examples : remove references to `make` in examples [no ci] (#15457)
This commit removes references to `make` in the examples, as the build
system has been updated to use CMake directly and using `make` will now
generate an error since Commit 37f10f955f
("make : remove make in favor of CMake (#15449)").
2025-08-21 06:12:28 +02:00
R0CKSTAR 8ad038c0fd
musa: add GGML_UNUSED_VARS (#15446)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-21 11:06:05 +08:00
Diego Devesa 5682a3745f
sched : copy only the used experts when offloading prompt processing (#15346) 2025-08-21 01:35:28 +02:00
teo 1bc664a26a
server: fix OpenAI API compatibility for usage statistics in chat streams (#15444) 2025-08-21 00:10:08 +02:00
Johannes Gäßler 13aeb7aef2
CUDA: refactor FA support/selection code (#15454) 2025-08-20 23:14:14 +02:00
Johannes Gäßler 7a6e91ad26
CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433) 2025-08-20 16:58:49 +02:00
Jeff Bolz fec9519802
vulkan: shorten pipeline name strings (#15431)
These detailed strings were causing increased build time on gcc.
2025-08-20 16:33:14 +02:00
Daniel Bevenius 657b8a77bd
chat: handle gpt-oss return/end token inconsistency (#15421)
This commit addresses an inconsistency during inference by adding a new
member to the `templates_params` struct to indicate whether the chat is
in inference mode. This allows the gpt-oss specific function
`common_chat_params_init_gpt_oss` to check this flag and the
`add_generation_prompt` flag to determine if it should replace the
`<|return|>` token with the `<|end|>` token in the prompt.

The motivation for this change is to ensure that the formatted prompt of
past messages in `common_chat_format_single` matches the output of the
formatted new message. The issue is that the gpt-oss template returns
different end tags: `<|return|>` when `add_generation_prompt` is false,
and `<|end|>` when `add_generation_prompt` is true. This causes the
substring function to start at an incorrect position, resulting in
tokenization starting with 'tart|>' instead of '<|start|>'.

Resolves: https://github.com/ggml-org/llama.cpp/issues/15417
2025-08-20 14:26:01 +02:00
Jie Fu (傅杰) ec5ab1a36c
common : fix context shift help message (#15448)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-08-20 13:33:30 +03:00
xiaobing318 1a99c2d948
cmake : fix target include directories (#15450)
* Update docker.yml

修改docker.yml文件中的内容使其停止周期性的运行该workflow,如果想要运行该workflow可以手动启动

* feat:Modify the header file include path

1. There's no llava directory in the tools directory.
2. Because the command `target_include_directories(mtmd PUBLIC .)` is used in the `mtmd` CMakeLists.txt file, other targets that link against `mtmd` automatically include the `mtmd` directory as a search path for header files. Therefore, you can remove `target_include_directories(${TARGET} PRIVATE ../llava`` or use `target_include_directories(${TARGET} PRIVATE ../mtmd`` to explicitly require the `llama-server` target to use header files from `mtmd`.

* Restore the docker.yml file
2025-08-20 13:32:05 +03:00
Daniel Bevenius 37f10f955f
make : remove make in favor of CMake (#15449)
This commit removes the content from the Makefile and updates the
current deprecation message to information that `make` has been
replaced by CMake instead.

The message when `make` is invoked will now be the following:
```console
$ make
Makefile:6: *** Build system changed:
 The Makefile build has been replaced by CMake.

 For build instructions see:
 https://github.com/ggml-org/llama.cpp/blob/master/docs/build.md

.  Stop.
```

The motivation for this is that many, if not all targets fail to build
now, after changes to the system, and `make` has also been deprected for
some time now.
2025-08-20 13:31:16 +03:00
Georgi Gerganov 2f37014073
lookahead : add sample command to readme (#15447)
* lookahead : add sample command to readme

* cont : build-agnostic command
2025-08-20 13:30:46 +03:00
R0CKSTAR a094f38143
musa: fix build warnings (#15258)
* musa: fix build warnings

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* fix warning: comparison of integers of different signs: 'const int' and 'unsigned int' [-Wsign-compare]

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-20 10:17:37 +08:00
lhez fb22dd07a6
opencl: mark `argsort` unsupported if cols exceed workgroup limit (#15375) 2025-08-19 11:25:51 -07:00
Georgi Gerganov 9ef6b0b835
model : add gpt-oss type strings (#15424) 2025-08-19 19:58:28 +03:00
Gian-Carlo Pascutto 1e19f5d462
common : Add top-nsigma sampler to help globally (#15428)
Fixes #15423.
2025-08-19 19:58:14 +03:00
Georgi Gerganov d2fcd91cf9
server : disable context shift by default (#15416)
* server : disable context shift by default

ggml-ci

* server : make scopr of test parameters local
2025-08-19 16:46:37 +03:00
SHUAI YANG a6d3cfe7fa
CANN: optimize rope operator (#15335)
* optimize rope ops

* amendment

* delete trailing whitespace

* change the variable name
2025-08-19 21:28:22 +08:00
R0CKSTAR 67f09a3a27
musa: handle __hgt2_mask, available starting from MUSA SDK rc4.3.0 (#15413)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-08-19 12:33:47 +02:00
Marvin Gießing 6424594c56
ggml-cpu: add mxfp4 VSX intrinsics for Power9+ (ppc64le) hardware (#15385)
* Added VSX intrinsics for Power9+ systems

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Manual unrolling for minor perf improvement

Signed-off-by: mgiessing <marvin.giessing@gmail.com>

* Update ggml/src/ggml-cpu/arch/powerpc/quants.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Signed-off-by: mgiessing <marvin.giessing@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-08-19 11:54:31 +03:00
Xuan-Son Nguyen e9288e8869
chat : clarify the meaning of reasoning_format (#15408)
* chat : clarify the meaning of reasoning_format

* add link to this PR
2025-08-19 10:29:36 +02:00
Georgi Gerganov 9d262f4bad
server : remove swa_full warning (#15399) 2025-08-19 08:45:26 +03:00
Georgi Gerganov f0d3c7405c
batched-bench : use rand tokens (#15398) 2025-08-19 08:45:12 +03:00
Xuan-Son Nguyen f08c4c0d8d
mtmd : clean up clip_n_output_tokens (#15391) 2025-08-18 22:53:52 +02:00
Georgi Gerganov 6d7f1117e3 codeowners : remove mmv.* 2025-08-18 22:06:44 +03:00
Georgi Gerganov 60212f1ead sync : ggml 2025-08-18 22:06:44 +03:00
Georgi Gerganov f0c541d315 scripts : update sync scripts 2025-08-18 22:06:44 +03:00
Sigbjørn Skjæret baa9255a45
llama : merge conts and reshapes and remove unnecessary cont (#15380)
* remove unnecessary conts and merge reshapes

* restore necessary conts

* merge more conts and reshapes

* merge even more conts and reshapes
2025-08-18 19:30:17 +02:00
Georgi Gerganov 3007baf201
readme : update hot topics (#15397) 2025-08-18 18:11:44 +03:00
davidef d1d8241600
server : fix incoming tasks not process in order (#15395) 2025-08-18 17:51:42 +03:00
Dobri Danchev 618575c582
Fix broken build: require updated pip to support --break-system-packages (#15357)
* Revert "devops : fix compile bug when the BASE_CUDA_DEV_CONTAINER is based on Ubuntu 24.04 (#15005)"

This reverts commit e4e915912c.

* devops: Allow pip to modify externally-managed python environment (system installation)

- Updated pip install commands to include the --break-system-packages
  flag, ensuring compatibility when working with system-managed Python
  environments (PEP 668).

- Note: The --break-system-packages option was introduced in 2023.
  Ensure pip is updated to a recent version before using this flag.

fixes [#15004](https://github.com/danchev/llama.cpp/issues/15004)
2025-08-18 12:50:48 +02:00
compilade f44f793172
ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379)
* ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors

* ggml-quants : avoid division by zero in make_q3_quants
2025-08-18 09:23:56 +02:00
Jeff Bolz ae532eac2c
vulkan: disable spirv-opt for bfloat16 shaders (#15352) 2025-08-18 07:56:29 +02:00
Oleksandr Kuvshynov e5155e6986
server : export max observed n_past value (#15361)
Add tracking for high watermark cache usage and make it available in /metrics endpoint.

Use-case: Tracking largest needed cache usage under realistic workload
to better understand memory requirements and be able to adjust
cache size/quantization for model/cache accordingly.
2025-08-18 00:28:58 +02:00
Jeff Bolz 21c17b5bef
vulkan: Use larger workgroups for mul_mat_vec when M is small (#15355)
* vulkan: Use larger workgroups for mul_mat_vec when M is small

Also use subgroup instructions for (part of) the reduction when supported.
Without this, the more expensive reductions would eat into the benefits of
the larger workgroups.

* update heuristic for amd/intel

Co-authored-by: 0cc4m <picard12@live.de>

---------

Co-authored-by: 0cc4m <picard12@live.de>
2025-08-17 18:08:57 +02:00
Dong Won Kim 19f4decae0
vulkan: support sqrt (#15370) 2025-08-17 16:03:09 +02:00