Commit Graph

6626 Commits

Author SHA1 Message Date
Ed Addario 1fbc59f867
Replace slope with cross product 2025-09-22 20:10:10 +01:00
Ed Addario c855094dff
Exit loop if no better solution found 2025-09-22 20:09:11 +01:00
Ed Addario b748a1efa7
Fix typo 2025-09-21 22:03:54 +01:00
Ed Addario 896cdc2121
Refactor potential overflow 2025-09-21 22:03:36 +01:00
Ed Addario fecc472c61
Fix typos in variable names 2025-09-21 17:26:38 +01:00
Ed Addario e92db008bc
Refactor quantisation checks into its own function 2025-09-21 17:20:48 +01:00
Ed Addario 814f6b66be
Minor general refactoring 2025-09-21 16:45:09 +01:00
Ed Addario 0d5f18303e
Refactor lagrange_penalty() 2025-09-21 16:22:00 +01:00
Ed Addario 9a1656eb97
Refactor pareto optimise and convexify 2025-09-21 16:21:35 +01:00
Ed Addario 1a3e9ea4c8
Refactor estimate_error() 2025-09-21 16:21:00 +01:00
Ed Addario a7ee915e19
Refactor trimmed_sum() 2025-09-21 16:20:06 +01:00
Ed Addario b09662f86a
Refactor estimate_lambda() 2025-09-21 16:19:49 +01:00
Ed Addario 17be7615ce
Refactor candidate types build 2025-09-21 16:19:28 +01:00
Ed Addario 08146fd67f
Refactor side_data() and copy_or_broadcast() 2025-09-21 16:19:03 +01:00
Ed Addario 7386d4eadd
Refactor row sampling 2025-09-21 16:18:26 +01:00
Ed Addario b6c008fd8a
Refactor helper lambdas 2025-09-21 16:04:13 +01:00
Ed Addario b433fd9547
Refactor last budget pass 2025-09-21 13:43:09 +01:00
Ed Addario c466c53808
Refactor pareto pruning and convexification 2025-09-21 13:42:54 +01:00
Ed Addario 6b8cedf3bc
Refactor estimate_lambda() 2025-09-21 13:42:31 +01:00
Ed Addario bdefdb673c
Refactor copy_or_broadcast() 2025-09-21 13:42:07 +01:00
Ed Addario e8e2aed17a
Refactor row sampling 2025-09-21 13:41:44 +01:00
Ed Addario 9e74f83411
Replace --bpw-bias flag with --no-bias 2025-09-20 23:06:37 +01:00
Ed Addario ab02bb1f3e
Merge branch 'master' into quantize 2025-09-20 21:41:25 +01:00
Ed Addario a36946997e
Replace fast_bias() for per slice version and remove precise_bias() 2025-09-20 21:36:54 +01:00
Ed Addario 14fae69a7b
General refactoring 2025-09-20 21:31:31 +01:00
Georgi Gerganov 7f766929ca sync : ggml 2025-09-20 13:02:14 +03:00
Daniel Bevenius 405921dcef ggml : introduce semantic versioning (ggml/1336)
* ggml : introduce semantic versioning

This commit introduces semantic versioning for the GGML library.

The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.

The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
   PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
   or patch. This script will handle incrementing the version
   (major|minor|patch), create a new commit with the version change,
   create a tag for the version, and prepare for the next development
   iteration.
3. Inspect the commits/tag and push to master. This will trigger the
   github release workflow which is triggered for new tags which will
   then publish a new release on github.

Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made

Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0

Step 2: Updating version in ggml/CMakeLists.txt...
  [dry-run] Would update GGML_VERSION_MAJOR to 1
  [dry-run] Would update GGML_VERSION_MINOR to 0
  [dry-run] Would update GGML_VERSION_PATCH to 0
  [dry-run] Would remove -dev suffix

Step 3: Committing version bump...
  [dry-run] Would commit: 'ggml : bump version to 1.0.0'

Step 4: Creating git tag...
  [dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'

Step 5: Preparing for next development cycle...
  [dry-run] Would update GGML_VERSION_MINOR to 1
  [dry-run] Would add -dev suffix back

Step 6: Committing development version...
  [dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'

[dry-run] Summary (no changes were made):
  • Would have released version: 1.0.0
  • Would have created tag: v1.0.0
  • Would have set next development version: 1.1.0-dev
```

Refs: https://github.com/ggml-org/ggml/issues/1333

* ggml: create branch for release candidate and check master

* ggml : sign the git tag
2025-09-20 13:02:14 +03:00
Gregor Jasny fa6383ca7e CUDA : conditionally add cuda architectures (ggml/1341) 2025-09-20 13:02:14 +03:00
Ruben Ortlam 803dac2e48
vulkan: use vec dot for matrix matrix multiplications (#16056)
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions

* use fma instead of dot to fix Nvidia and Apple performance issues
2025-09-20 10:42:56 +02:00
Benni 459c0c2c1a
server: fix SSE and OpenAI compatibility for error messages when streaming (#16109)
* server: fix SSE and OpenAI compatibility for error messages when streaming

* server: remove obsolete event parameter and use required data fieldname instead
2025-09-20 07:56:30 +02:00
ssweens be79d9fdd9
llama-bench: add --devices and --list-devices support (#16039)
* * llama-bench: add --devices support
- Support --devices same as llama-server
- Provide for benchmarking different device combinations
- Include --list-devices like llama-server for convenience

* fix: field display ordering restored

* fix: integrated the rpc devices
- aimed to mimic the server as much as possible

* cleanup: defaults for list-devices
- handle dup device listing with RPC

* cleanup: remove dup device load calls

* docs: update llama-bench
- added the recently added n-cpu-moe option to the docs while in there

* llama-bench: rpc device simplification
* rpc servers unify with other devices earlier, simplifying code
* --list-devices made stateless and simpler
* various cleanup
2025-09-20 00:15:21 +02:00
shun095 f432d8d83e
chat: Fix streaming parser for granite models (#15682)
* fix(chat): fix streaming parser for granite models

* tests: add test cases for Granite models chat parser
2025-09-19 09:57:30 -06:00
Aleksander Grygier 4067f07fc5
feat: Improve mobile UI for Settings Dialog (#16084)
* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output
2025-09-19 09:52:27 +02:00
Xuan-Son Nguyen 4b8560ab56
chat : fix build on arm64 (#16101) 2025-09-19 13:02:51 +07:00
Xuan-Son Nguyen 0dd58b6877
ggml : refactor forward_dup for cpu backend (#16062)
* ggml : refactor forward_dup for cpu backend

* clean up a bit

* add quant/dequant perf test
2025-09-19 06:31:56 +02:00
Adrien Gallouët 69ffd89163
ggml-amx : fix ggml_amx_init() on generic Linux (#16049)
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.

Without this commit, the code compiles (with warnings) but fails:

    register_backend: registered backend CPU (1 devices)
    register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
    build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
    system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
    ....
    print_info: n_ctx_orig_yarn  = 262144
    print_info: rope_finetuned   = unknown
    print_info: model type       = 4B
    Illegal instruction (core dumped)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-18 23:07:26 +02:00
Adrien Gallouët 246c0d9c79
cmake : fix static linking for OpenMP on Unix-like systems (#16031)
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:

    $ cmake -B build \
            -DBUILD_SHARED_LIBS=OFF \
            -DLLAMA_CURL=OFF \
            -DGGML_CCACHE=OFF \
            -DGGML_NATIVE=OFF \
            -DGGML_STATIC=ON

    $ ldd llama-server
            linux-vdso.so.1 (0x0000e1a434e3b000)
            libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
            libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
            libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
            libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
            libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
            /lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)

This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-09-18 23:07:18 +02:00
Shawn Gu 3edd87cd05
opencl: optimize mxfp4 kernels (#16037)
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations

---------

Co-authored-by: Li He <lih@qti.qualcomm.com>
2025-09-18 12:03:34 -07:00
Jeff Bolz c0b45097c3
rename optimize_graph to graph_optimize (#16082) 2025-09-18 13:46:17 -05:00
Bowen Han 38dbdf4c05
CUDA: Optimize PAD_REFLECT_1D (#15957)
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D

* use fast_div to improve performance

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* Apply suggestion from JohannesGaessler

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* optimize

* use a concise expression to further speedup the cuda kernel

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-09-18 20:26:03 +02:00
Johannes Gäßler 368560a1e3
CUDA: fix compilation on CC 6.0 (#16091) 2025-09-18 19:28:32 +02:00
Eric Curtin 4ca088b036
Add resumable downloads for llama-server model loading (#15963)
- Implement resumable downloads in common_download_file_single function
- Add detection of partial download files (.downloadInProgress)
- Check server support for HTTP Range requests via Accept-Ranges header
- Implement HTTP Range request with "bytes=<start>-" header
- Open files in append mode when resuming vs create mode for new downloads

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
2025-09-18 16:22:50 +01:00
Georgi Gerganov 703f9e32c4
metal : use function constants for mul_mv_ext kernels (#16074)
* metal : use function constants for mul_mv_ext kernels

ggml-ci

* metal : remove NW template argument

ggml-ci

* metal : adjust constants

ggml-ci
2025-09-18 16:28:41 +03:00
Sigbjørn Skjæret ad6bd9083b
cuda : add missing F32<->I32 entries in ggml_cuda_cpy_fn (#16060) 2025-09-18 13:28:22 +02:00
Radoslav Gerganov 2b6b55a59f
server : include usage statistics only when user request them (#16052)
* server : include usage statistics only when user request them

When serving the OpenAI compatible API, we should check if
{"stream_options": {"include_usage": true} is set in the request when
deciding whether we should send usage statistics

closes: #16048

* add unit test
2025-09-18 10:36:57 +00:00
Georgi Gerganov e58174cecb
llama : bump max seq limit from 64 to 256 (#15916)
ggml-ci
2025-09-18 12:47:56 +03:00
Georgi Gerganov b213fce89b
metal : improve F32, F16 and BF16 mat-vec multiplication (#16057)
* metal : improve F32, F16 and BF16 mat-vec multiplication

ggml-ci

* metal : make the NSG a function constant in mul_mv kernels

ggml-ci
2025-09-18 12:33:45 +03:00
Jhen-Jie Hong e00f3fd8ff
metal : avoid call free for non-owned buffer (#16067) 2025-09-18 10:06:48 +03:00
Georgi Gerganov f2f28380ea
metal : handle nil cv during pipeline creation (#16065)
ggml-ci
2025-09-18 10:03:24 +03:00
Chenguang Li 62c3b645c5
CANN: Remove print (#16044)
Signed-off-by: noemotiovon <757486878@qq.com>
2025-09-18 09:26:33 +08:00