This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
* fix validation error for large im2col
* metal : support mul_mm with src1->type == GGML_TYPE_F16
* metal : support mul_mm_id with src1->type == GGML_TYPE_F16
[no ci]
* metal : mul_mm support ne00 % 32 != 0
* metal : support mul_mm_id with ne00 % 32 != 0
* cont : remove unnecessary unrolls
* cont : simplify data loading
* metal : optimize mul_mm when output bounds checks are not needed
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
The "Clamp" spec constant is already based on whether KV is a multiple of Bc,
so use that to control whether bounds checking is performed. Add bounds checking
to the scalar and coopmat1 paths. Coopmat2 didn't need any changes (the K/V
tensors are already optionally clamped, nothing else needed to be changed).
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32
This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.
My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
* Review: refactor if statement
The dequantize functions are copy/pasted from mul_mm_funcs.comp with very few
changes - add a_offset and divide iqs by 2. It's probably possible to call
these functions from mul_mm_funcs and avoid the duplication, but I didn't go
that far in this change.
* devops: move s390x and ppc64le ci build
we have access to ubuntu-24.04-s390x and ppc64le images now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le for now since they have compiler errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: stop warnings as errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: switch to non-macro flag
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: going the llama macro route
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian gguf test models
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le to test s390x, check test build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.inp files for big-endian tests
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.out files for big-endian too
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add python setup and endian byteswap
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: pooring thing does not have s390x python3
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing rust compiler for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try rust actions runner
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: try rust actions runner"
This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try a different path for rust
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dump home directory and user info
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: install gguf-py only
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: missed relative path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove big-endian files since local swapping is working
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: revert test-tokenizer-0 cmakelists
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix unicode flags conversion from and to uint16_t
Bitfields are allocated in different order on s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Simplify byteswap command
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix endianness detection in vocab loader
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Disable test-thread-safety on s390x
In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.
There is no clean way to separate all those steps
to add byteswapping between them, so just skip this test.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q8_0 test in test-quantize-fns
vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian stories260K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add s390x test-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix test does not exist
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix model not found llama-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q3_K dot product error in test-quantize-fns on s390x
Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: re-enable ppc64le for testing
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: activate test-thread-safety for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le tests
for some reason it keeps failing test-thread-safety tests and I do not
have a machine that is able to replicate the tests.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: LLAMA_FATAL_WARNINGS=ON
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Correct repository URL for s390x for test-thread-safety model
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix fs_get_cache_directory
Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Re-enable CI for ppc64le
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fortify ggml_rope_impl
Only memcpy data from sections argument if it's non-NULL.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way
* Update URL for big-endian model
* Update .github/workflows/build.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update remaining mentions of BE models to ggml-org/models repo
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vendor : update httplib
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : use cpp-httplib as a cURL alternative for downloads
The existing cURL implementation is intentionally left untouched to
prevent any regressions and to allow for safe, side-by-side testing by
toggling the `LLAMA_CURL` CMake option.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml : Bump to Windows 10
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
Use RPC_DEBUG environment variable to enable debug messages.
Add helper macro LOG_DBG() which does an early
check of the env var before calling GGML_LOG_DEBUG().
Make sure we log a debug message for every server function.
* ggml : make gallocr respect the backend's max buffer size
* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max allocation size in buffer type interface
* fix missing newline, apple-clang warning
* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.
* track (chunk, offset) pairs instead of "global" offsets through gallocr.
* simpler, don't need loops to map between local/global offsets
* touches more code
* fix dyn_tallocr_max_size and initialization
* fix memory leak when buffers are reused due to same buffer type appearing multiple times
* make vbuffer allocation follow the same logic as backend_buffer did before
* continue to use leftover unallocated space of previous chunks after a new one has been created
* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size
* refactor: move adding new free block and new chunk into separate functions
* allocate chunks individually with a separate free-blocks list for each one
* needs a bit more memory/allocations/indirections, but code is simpler
* fix warnings (missing static) & debug checks
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
* ggml : introduce semantic versioning
This commit introduces semantic versioning for the GGML library.
The motivation for this is that the current versioning, using build
numbers, makes it difficult to track changes and releases for projects
that use ggml.
The release steps are the following:
1. Sync the changes from llama.cpp using sync-llama-am.sh and after the
PR has been approved and merged move to step 2.
2. Run scripts/release.sh and specify the type of release, major, minor,
or patch. This script will handle incrementing the version
(major|minor|patch), create a new commit with the version change,
create a tag for the version, and prepare for the next development
iteration.
3. Inspect the commits/tag and push to master. This will trigger the
github release workflow which is triggered for new tags which will
then publish a new release on github.
Example usage:
```console
$ ./scripts/release.sh major --dry-run
[dry-run] - No changes will be made
Step 1: Reading current version...
Current version: 0.9.0-dev
New release version: 1.0.0
Step 2: Updating version in ggml/CMakeLists.txt...
[dry-run] Would update GGML_VERSION_MAJOR to 1
[dry-run] Would update GGML_VERSION_MINOR to 0
[dry-run] Would update GGML_VERSION_PATCH to 0
[dry-run] Would remove -dev suffix
Step 3: Committing version bump...
[dry-run] Would commit: 'ggml : bump version to 1.0.0'
Step 4: Creating git tag...
[dry-run] Would create tag: v1.0.0 with message 'Release version 1.0.0'
Step 5: Preparing for next development cycle...
[dry-run] Would update GGML_VERSION_MINOR to 1
[dry-run] Would add -dev suffix back
Step 6: Committing development version...
[dry-run] Would commit: 'ggml : prepare for development of 1.1.0-dev'
[dry-run] Summary (no changes were made):
• Would have released version: 1.0.0
• Would have created tag: v1.0.0
• Would have set next development version: 1.1.0-dev
```
Refs: https://github.com/ggml-org/ggml/issues/1333
* ggml: create branch for release candidate and check master
* ggml : sign the git tag
* vulkan: Change the mul_mm shared memory and register caching system to use vec2 instead of scalars, to enable using dot2 instructions
* use fma instead of dot to fix Nvidia and Apple performance issues
Generalize Linux check to `__linux__` to support non-glibc systems (like musl).
Also, return `false` on unknown/untested OS.
Without this commit, the code compiles (with warnings) but fails:
register_backend: registered backend CPU (1 devices)
register_device: registered device CPU (Intel(R) Xeon(R) Platinum 8488C)
build: 6487 (51c4cac6) with x86_64-linux-musl-gcc (GCC) 15.1.0 for x86_64-linux-musl (debug)
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16
....
print_info: n_ctx_orig_yarn = 262144
print_info: rope_finetuned = unknown
print_info: model type = 4B
Illegal instruction (core dumped)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
When compiling with GGML_STATIC=ON, the build process would produce a
binary that was still dynamically linked to OpenMP. This defeats the
purpose of a static build:
$ cmake -B build \
-DBUILD_SHARED_LIBS=OFF \
-DLLAMA_CURL=OFF \
-DGGML_CCACHE=OFF \
-DGGML_NATIVE=OFF \
-DGGML_STATIC=ON
$ ldd llama-server
linux-vdso.so.1 (0x0000e1a434e3b000)
libgomp.so.1 => /lib/aarch64-linux-gnu/libgomp.so.1 (0x0000e1a4345a0000)
libstdc++.so.6 => /lib/aarch64-linux-gnu/libstdc++.so.6 (0x0000e1a434300000)
libm.so.6 => /lib/aarch64-linux-gnu/libm.so.6 (0x0000e1a434240000)
libgcc_s.so.1 => /lib/aarch64-linux-gnu/libgcc_s.so.1 (0x0000e1a434200000)
libc.so.6 => /lib/aarch64-linux-gnu/libc.so.6 (0x0000e1a434030000)
/lib/ld-linux-aarch64.so.1 (0x0000e1a434df0000)
This commit resolves the issue by modifying `CMAKE_FIND_LIBRARY_SUFFIXES`
to prioritize `.a` files, forcing CMake to link the static version of
the library.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
- flatten mxfp4 and packed fp4->fp16 bit-wise convert function (replace lut)
- MoE kernel optimizations
---------
Co-authored-by: Li He <lih@qti.qualcomm.com>
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
* CANN: Fix ggml_cann_set_device to avoid redundant device switches
- Added a check to skip aclrtSetDevice if the current device is already set.
- Prevents unnecessary context switches while keeping thread/device consistency.
* CANN: add device default id
* ggml : remove adding extra dim timestep embedding
This commit updates the ggml_timestep_embedding function to no longer
add an extra dimension when the specified dimension is odd.
The motivation for this change is that this introduces an unnecessary
dimension when the dimension is odd, which caused an issue in the
kernels which were not expecting this extra dimension and it resulted in
uninitialized memory for the second to last dimension.
* ggml-cuda : fix padding in timestep embedding kernel
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* ggml-metal : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel
* ggml-opencl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-sycl : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-vulkan : fix padding in timestep embedding kernel
This commit fixes the zero padding for odd dimensions in
the timestep embedding kernel.
* ggml-cpu : fix padding in timestep embedding function
This commit removes the zeroing out of the last dimension now that we
are not adding the extra padding dimension.
* fix im2col_3d to respect non-contiguous inputs (views)
The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides.
This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged.
* use ggml_element_size() for src strides
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* SYCL: Add COUNT_EQUAL operator support (rebased on master)
* SYCL: remove duplicate op_count_equal definition
* tests: remove test_count_equal_typed and use test_count_equal for all cases
* tests: keep only I32 case for COUNT_EQUAL as suggested
* tests: keep only I32 case for COUNT_EQUAL as requested
* metal : remove mem pool usage
ggml-ci
* metal : remove mem pool implementation
ggml-ci
* metal : take into account the actual allocated memory of the tensor
ggml-ci
* cont : use ggml_backend_buft_get_alloc_size
ggml-ci
* cont : improve, comments
ggml-ci
* cont : add functions for the extra tensor sizes
* metal : add comments
ggml-ci
* metal : implement .get_alloc_size for the rest of the buffer types
ggml-ci
* metal : remove ggml_metal_heap
ggml-ci
Use this to query register count for shader compiles on NVIDIA. Currently
this is only for performance debug, but it could eventually be used in some
heuristics like split_k.
* metal : refactor bin kernels loading
ggml-ci
* metal : refactor rms kernel loading
ggml-ci
* ci : try to add memory leaks check
ggml-ci
* ci : try to enable memory leak detection for Mac
* cont : seems to be working
* ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type
ggml-backend : add device id to device props
llama : only use iGPU devices if there are no GPU devices
llama : do not use multiple devices from different backends with the same device id
This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling
MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when
the target hardware actually supports them.
The motivation for this is to fix ggml CI build failures where the
feature detection correctly identifies that i8mm is not supported,
adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are
still enabled, causing the compiler to attempt to use vmmlaq_s32
intrinsics without i8mm support.
Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499
Since the prefill length is not fixed, graphs constructed for the
prefill stage cannot be reused. For this reason, ACL graph
execution is disabled by default during prefill.
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
This commit fixes the zero padding for odd dimensions in
ggml_compute_forward_timestep_embedding_f32.
The motivation for this is that currently if an odd dimension is used,
the padding check incorrectly uses the dimension value for indexing.
For example, with dim=15:
Elements 0-6 are set to cosine values
Elements 7-13 are set to sine values
Element 14 is left uninitialized (contains garbage)
Element 15 is correctly set to zero
This fix changes embed_data[dim] to embed_data[2 * half] so that
element 14 (the first unused element) is properly set to zero as well
as the last element.
Resolves: https://github.com/ggml-org/ggml/issues/1324
* metal : make the backend async
ggml-ci
* cont : add comments, extend op offload, clean up
ggml-ci
* metal : fix batch size for MUL_MAT_ID
* metal : remove deprecated ggml_backend_metal_buffer_from_ptr
* metal : create only metal buffers, no wrapping of host memory
ggml-ci
* metal : restore .alloc_buffer for buffer_from_ptr_type
ggml-ci
* metal : remove broken implementation of GGML_OP_SET
ggml-ci
* metal : clean-up loose ends, ready for tests
ggml-ci
* metal : support both private and shared buffers
ggml-ci
* metal : enable private buffers + add global device queue
* metal : disable host buffer to prevent races
ggml-ci
* metal : avoid extra copy during set_tensor
ggml-ci
* metal : use separate buffer types for shread and private Metal buffers
ggml-ci
* metal : simplify synchronization logic
ggml-ci
* metal : fix build
ggml-ci
* metal : do not implement cpy_tensor
ggml-ci
* metal : separate implementations for shared and private buffers
ggml-ci
* CANN: Add ROPE sin/cos cache for reuse
Introduce sin/cos caching mechanism in ROPE to avoid redundant
computation across layers. The cache is built on the first layer
per device and reused by subsequent layers if parameters match.
- Added sin_cache / cos_cache pointers and position_length tracking
- Introduced cache validity flags and properties:
(ext_factor, theta_scale, freq_scale, attn_factor, is_neox)
- Accelerates ROPE by eliminating repeated sin/cos generation
This change reduces overhead in multi-layer scenarios while
preserving correctness by verifying parameter consistency.
Co-authored-by: hipudding <huafengchun@gmail.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
Co-authored-by: hipudding <huafengchun@gmail.com>
* CANN: implement LRU cache for ACL graphs in CANN backend
- Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects.
- Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded.
- Updated push, move_to_front, and clear methods to manage cached graphs efficiently.
- Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend.
* fix typo
* The LRU cache capacity can be configured via an env variable
Signed-off-by: noemotiovon <757486878@qq.com>
* refactory acl graph
* refactory && fix review comments
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
* CUDA: Add mul_mat_id support the mmf
Add support for mul_mat_id for bs < 16
* Review: use warp_size, fix should_use_mmf condition
* Launch one block per expert, stride along n_expert_used
* templatize mul_mat_id
* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids
* Reduce compile times by dividing mmf into f16, bf16 and f32 variants
* Divide mmf by ncols_dst
* Add missing files
* Fix MUSA/HIP builds
* vulkan: sort graph to allow more parallel execution
Add a backend proc to allow the backend to modify the graph. The
vulkan implementation looks at which nodes depend on each other
and greedily reorders them to group together nodes that don't
depend on each other. It only reorders the nodes, doesn't change
the contents of any of them.
With #15489, this reduces the number of synchronizations needed.
* call optimize_graph per-split
* cuda : fix supports_op condition for get_rows when src1->ne2 > 1
ggml-ci
* ggml : add comment about ggml_get_rows
ggml-ci
* cuda : add FIXME [no ci]
* cuda : update support condition
ggml-ci
* ggml: allow casting between f32 and i32
* fix cuda
* add vulkan
* fix CPU non-cont
* add non-cont test case
* add note
* extend test number range
* correct note
* add cont version for vulkan
I think glslang will translate an access like x[i][1].z to
OpAccessChain ... x, i, 1, 2
OpLoad float16_t ...
rather than loading all of x[i] in a single OpLoad. Change the
code to explicitly load the vector/matrix.
* ggml WebGPU: remove userdata from request adapter callback
This commit removes the `userdata` parameter from the WebGPU request
adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function
captures the `webgpu_context` directly.
The motivation for this change is to simplify the code and improve
readability.
* inline the callback lambda into the RequestAdapter call
This commit removes the callback lambda variable and inlines it directly
into the RequestAdapter call.
* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits
Fixes#15330
Adjust the allocation size of acl_rstd. The parameter `dims` is set to 3 according to the CANN documentation.
Co-authored-by: Yuchuan <yuchuan-cao@users.noreply.github.com>
* vulkan : update ggml_vk_instance_validation_ext_available
This commit updates ggml_vk_instance_validation_ext_available() to
check for VK_EXT_validation_features instead of
VK_KHR_portability_enumeration.
Based on how the returned boolean is used later in the code (to enable
both the validation layer and the VK_EXT_validation_features extension),
it appears the function may have been intended to check for the
validation layer features extension.
* remove try/catch
This was a left over from a previous iteration where I was explicitly
quering for a specific validation layer first, which would throw.
* update warning message about validation layers
* Add fastdiv, use it in modulo and use modulo in rms_norm_f32
Fastdiv is much faster way to do integer division, which was identified
as bottleneck in rms_norm_f32
* Support more `block_size` values in `rms_norm_f32`
This makes us more flexible in selecting the optimal threads w.r.t
paralellizing across a col vs. launch-overheads of threads and mio
throttles
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Replace modulo with fastmodulo in `rms_norm_f32`
* Use `BinPackArguments=true` for formating function calls
Will file a separate PR to adjust .clang-format file
* Update ggml/src/ggml-cuda/common.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Use uint3 for both `fastdiv` and `fastmodulo`
The compiler seems to reliably optimize away the unused .z component in
the fastdiv use-case, see https://godbolt.org/z/rx8KPrKr3
* More constrained type declarations
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Rename fastdiv and fastmodulo variables to shared variable name
As suggest by JohannesGaessler, this increases clarity of the intended
use
* Pack fastdiv/fastmodulo constants into uint2/uint3 objects
By packing constants to be used together into a struct, we are less
likely to make errors.
* Rename function parameter of fastmodulo
`modulo_consts` is more fitting/descriptive
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Previously, the slope tensor was set to fp16 to improve efficiency.
While this worked correctly in FA, it caused precision issues in soft_max.
This change applies different data types for different operators
to balance both accuracy and performance.
* [CANN] Support eager execution mode under ACL graph compilation
Add support for running operators in eager mode while ACL graph
compilation is enabled. This allows bypassing graph execution
and directly submitting ops, which is useful for debugging and
reducing graph build overhead in certain scenarios.
Signed-off-by: noemotiovon <757486878@qq.com>
* fix typo
Signed-off-by: noemotiovon <757486878@qq.com>
* rename to acl_graph_mode
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
* vulkan: use memory budget extension to read memory usage
* fix: formatting and names
* formatting
* fix: detect and cache memory budget extension availability on init
* fix: read `budgetprops.heapBudget` instead of `heap.size` when memory budget extension is available
* style: lints
* SVE support for exponential functions
Add const notation to variable pg
* Update ggml/src/ggml-cpu/vec.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add const
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Add Integer Dot Product mul_mat_vec shader for legacy quants
* vulkan: use subgroup operations for quantize_q8_1 shader
* vulkan: add q8_1_x4 type with 128-bit alignment, use in mul_mat_vecq shader
* vulkan: use q8_1_x4 blocks in mul_mmq shader
* vulkan: do 8 calculations per invocation instead of 32 in mul_mat_vecq, similar to mul_mat_vec
* vulkan: tune mul_mat_vecq performance for Intel
* vulkan: fix quantizing issue when tensor is not divisible by 128
* vulkan: adapt integer dot mmv to mmv small m optimization (#15355)
* vulkan: allow all subgroup modes for mmv and mmvq
* vulkan: use prealloc intermediate reuse for mmvq path
* vulkan: tune mmvq for Intel, AMD GCN and Nvidia RTX 3090
* vulkan: adapt mmv quantize_y path to conditional sync logic
* vulkan: disable q8_0 mmvq on Nvidia
* vulkan: enable q8_0 on Nvidia pre-turing
* fix prealloc sync condition
* fix llvmpipe subgroup 8 issue
* ggml : WebGPU add TRANSPOSE and RESHAPE to supported ops
This commit adds support for the TRANSPOSE and RESHAPE operations in the
ggml webgpu backend.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: fix build error from ambiguous __half conversions in conv2d
Building conv2d with half precision failed because `__half` defines
multiple implicit conversion operators (to float, int, short, etc.),
causing ambiguous overload resolution when multiplying with float.
Introduce a templated `to_float` helper that explicitly converts
`__half` via `__half2float`, while passing through float unchanged.
Use this helper in conv2d accumulation to ensure unambiguous and
correct promotion to float.
Fixes some build errors with half-precision kernels on CUDA.
ggml-ci
* CUDA: Replace custom to_float helper with unified ggml_cuda_cast and add half‑>float conversion
* CUDA: Add missing convert.cuh header
* CUDA: remove unnecessary extension in ggml_cuda_cast
* CUDA: Address review comment, remove second type template argument
* CANN: fix RoPE cache issue on multi-device
RoPE cache only needs to be computed once per token.
However, in multi-device scenarios, not every device starts
computation from layer 0, which may lead to unallocated memory
issues and precision errors.
This commit records the first layer of each device to avoid
the above issues.
* CANN: Optimize first-layer detection method
* CANN: Remove trailing whitespace
* CANN: Only cache the data that can be determined as unchanged through the parameters.
* CANN: Update function comment
Exposes ggml_backend_sched_split_graph() to allow splitting the graph without allocating compute buffers and uses it to split the graph for the automatic Flash Attention check.
* vulkan: mul_mat_id coopmat2 optimizations
Add a path for when the tile fits in BN/2, similar to what we have for mul_mat.
Only call fetch_scales/store_scales once per QUANT_K block, and once at the
beginning in case start_k is not aligned.
* Also add a path for BN/4 - worth a couple more percent
This commit removes the portability_enumeration_ext variable from the
ggml_vk_instance_portability_enumeration_ext_available function as it
is initialized to false but never modified, making it redundant.
Prior to this change, we faced undefined cublasLt references when
attempting to compile 'llama-cli' with GGML_STATIC=ON on Linux.
We add linking with CUDA::cublasLt_static when CUDA version is greater
than 10.1.
* CANN(flash-attn): refactor mask handling and improve performance
1. Refactored the mask computation in Flash Attention, unified the logic without separating prefill and decode.
2. Optimized performance in non-alibi scenarios by reducing one repeat operation.
3. Updated operator management to explicitly mark unsupported cases on 310P devices and when dim is not divisible by 16.
Signed-off-by: noemotiovon <757486878@qq.com>
* [CANN]: fix review
Signed-off-by: noemotiovon <757486878@qq.com>
* [CANN]: Optimization FA BNSD to BSND
Signed-off-by: noemotiovon <757486878@qq.com>
---------
Signed-off-by: noemotiovon <757486878@qq.com>
The original implementation unconditionally returned true for this operation, leading to a failure when the tensor's first dimension (ne[0]) was not a multiple of WARP_SIZE. This caused an GGML_ASSERT(ncols % WARP_SIZE == 0) failure in ggml-sycl/norm.cpp.
This change updates the ggml_backend_sycl_device_supports_op check to correctly return true for GGML_OP_RMS_NORM only when the first dimension of the tensor is a multiple of WARP_SIZE, ensuring the operation can be performed without error.
This patch improves GEMM for FP32 Data Type on PowerPC
Implements GEMM on large blocks with configurable block size mc, nc, kc
(default: 256, 256, 256).
Packing Function optimized to access blocks as per memory layout.
GEMM Optimized to work on larger blocks.
Isolated Packing from GEMM Operations for better MMA utilization.
Verified functionality and correctness uing llama-cli and stand alone
test case (performs matmul and compares final mattrix C result with base).
Minor code refactoring changes:
Replace macro with inline function
Code Indent made consistent with 4 spaces
Performance Testing:
Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using
llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with
Mistral-7b-Instruct-v0.3 Model.
model Size Params Backend Threads Test Patch Base
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78
llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44
25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in
Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch
sizes ( 1, 2, 4, 8, 16)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* metal : optmize FA vec for large heads and sequences
* metal : adjust small-batch mul mv kernels
ggml-ci
* batched-bench : fix total speed computation
ggml-ci
* cont : add comments
ggml-ci
* metal : mul_mm_id remove hdst
* metal : remove mul_mm_id hsrc1
* metal : mul_mm_id simplify + add test
* metal : opt mul_mm_id map0
* metal : optimize mul_mm_id id gathering
* metal : mul/div opt
* metal : optimize mul_mm_id_map0
ggml-ci
* CUDA: optimize get_int_from_table_16
* CUDA: use v_perm_b32 to replace byte_perm on AMD GPUs
* revise documentation
---------
Co-authored-by: xix <xiapc@outlook.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* vulkan: use subgroup function for mul_mat_id shader even without coopmat
* vulkan: fix compile warnings
* vulkan: properly check for subgroup size control and require full subgroups for subgroup mul_mat_id
* vulkan: disable subgroup mul_mat_id on devices with subgroups < 16
The scalar FA shader already handled multiples of 8. The coopmat1 FA
shader assumed 16x16x16 and the shared memory allocations need the HSK
dimensions padded to a multiple of 16. NVIDIA's coopmat2 implementation
requires multiples of 16 for N and K, and needs the matrix dimensions
padded and loads clamped.
Store the FA pipelines in a map, indexed by the pipeline state.
* vulkan: optimize rms_norm, and allow the work to spread across multiple SMs
There are really two parts to this change:
(1) Some optimizations similar to what we have in soft_max, to unroll with
different numbers of iterations.
(2) A fusion optimization where we detect add followed by rms_norm, and make
the add shader atomically accumulate the values^2 into memory. Then the
rms_norm shader can just load that sum. This allows the rms_norm to be
parallelized across multiple workgroups, it just becomes a simple per-element
multiply.
The fusion optimization is currently only applied when the rms_norm is on a
single vector. This previously always ran on a single SM. It could apply more
broadly, but when there are other dimensions the work can already spread across
SMs, and there would be some complexity to tracking multiple atomic sums.
* Change add+rms_norm optimization to write out an array of partial sums
rather than using atomic add, to make it deterministic. The rms_norm
shader fetches a subgroup's worth in parallel and uses subgroupAdd to
add them up.
* complete rebase against fused adds - multi_add shader can also compute partial sums
* fix validation errors
* disable add_rms_fusion for Intel due to possible driver bug
* resolve against #15489, sync after clearing partial sums
Track a list of nodes that need synchronization, and only sync if the new node
depends on them (or overwrites them). This allows some overlap which can
improve performance, and centralizes a big chunk of the synchronization logic.
The remaining synchronization logic involves writes to memory other than the
nodes, e.g. for dequantization or split_k. Each of these allocations has a bool
indicating whether they were in use and need to be synced. This should be
checked before they are written to, and set to true after they are done being
consumed.
* vulkan : support ggml_mean
* vulkan : support sum, sum_rows and mean with non-contiguous tensors
* vulkan : fix subbuffer size not accounting for misalign offset
* tests : add backend-op tests for non-contiguous sum_rows
* cuda : require contiguous src for SUM_ROWS, MEAN support
* sycl : require contiguous src for SUM, SUM_ROWS, ARGSORT support
* require ggml_contiguous_rows in supports_op and expect nb00=1 in the shader
- Spread the work across the whole workgroup. Using more threads seems to
far outweigh the synchronization overhead.
- Specialize the code for when the division is by a power of two.
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Work on templating for different types in shaders
* Work on shader type generation
* Working q4_0 mul_mat and some templating for different types
* Add q4_0_f16 matmul and fix device init
* Add matmul support for basic quantization types
* Add q2_k and q3_k quantization
* Add rest of k-quants
* Get firt i-quant working
* Closer to supporting all i-quants
* Support rest of i-quants
* Cleanup code
* Fix python formatting
* debug
* Bugfix for memset
* Add padding to end of buffers on creation
* Simplify bit-shifting
* Update usage of StringView
* Add Pad Reflect 1D CUDA support
* Update ggml/src/ggml-cuda/pad_reflect_1d.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>