This change enables the repack stage to utilize the user-specified
thread count, ensuring that both the logical thread IDs and the total
number of threads remain consistent between the repack and inference
stages.
In a NUMA architecture where the `--numa distribute` parameter is used,
logical threads are pinned to specific physical NUMA nodes. By aligning
the thread configuration across these two stages, we can fully leverage
the operating system's "first-touch" memory allocation policy:
1. Repack Stage: Logical thread i (bound to NUMA node j) is responsible
for repacking and writing the weight data. Since the "first touch"
occurs within this thread, the corresponding physical memory is
allocated on node j.
2. Inference Stage: The same logical thread i (still bound to node j)
reads these weights. Since the data already resides on the local
node, low-latency local memory access is achieved.
Without ensuring consistency in the number of threads, data may be
randomly allocated to mismatched nodes, resulting in significant
cross-node access overhead during inference.
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
When using repack buffer type, the physical memory allocation is dictated
by the first-touch policy. Since the main thread performs the write
operations, memory is often allocated on a single NUMA node, leading to
uneven weight distribution.
Multi-threaded repack can alleviate this problem, but the threads are
not bound to NUMA nodes.
This patch applies the same thread affinity strategy (--numa distribute)
to the repacking phase. By binding the repack threads to the same nodes
as the compute threads, we ensure that weights are written (and thus
allocated) on the local NUMA node, minimizing cross-node memory access
during inference.
Performance on Intel Xeon Silver 4514Y (32 core):
qwen3 8B Q4_K: 19.39 -> 26.92 t/s (+39%)
qwen3 32B Q4_K: 4.99 -> 7.38 t/s (+48%)
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened.
Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root).
This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
* ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* using the name VLEN instead of CNT
* Update ggml/include/ggml-cpu.h
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* enable mmf for RDNA3
* disable mmf for some shape
* move some mmvf to mmf
* more mmfv to mmf
* 3 is good in mmvf
---------
Co-authored-by: zhang hui <you@example.com>
* Extended TRI
* Fix whitespace
* chore: update webui build output
* Just use cuBLAS for everything...
* Merge both versions
* Remove incorrect imports causing failures for CI
* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"
* Defines for hipBlas
* Aaaand MUSA defines...
* I hate this job...
* Stupid typo...
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* fix test failure
* fix: correct scaling calculations in rope_cache_init
* fix: optimize element copying in rope_hex_f32 using memcpy
* fix: optimize loop boundaries in rope_hex_f32 for better performance
* feat: add profiling macros for performance measurement in operations
* tests: update barrier test to check for race condition in active threads
* cpu: combine n_graph and n_threads into a single atomic update
* tests: add multi-graph test for test_barrier
* feat: Add a batched version of ssm_conv
This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Optimized SSM_SCAN kernel for metal
This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Add test-backend-ops perf tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Real representitive tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use function constant for ssm_conv batch size
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: backend op tests for ssm_scan from granite4 1b-h
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: remove commented out templates
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: float4 version of ssm_conv_batched
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing ggml_metal_cv_free
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix: Provide macos-specific backtrace printing to avoid terminal death
Branch: MacOSSafeBacktrace
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace
Branch: MacOSSafeBacktrace
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* cann: add support for partial RoPE and Vision mode
Add support for two important RoPE variants: partial rotation (rope_dims < ne0)
and Vision mode rotation.
1. Support for partial RoPE (rope_dims < ne0):
- Split tensor into head (first rope_dims dimensions) and tail portions
- Apply rotation only to head portion using RotaryPositionEmbedding operator
- Copy unrotated tail portion directly from source to destination
- Handle both contiguous and non-contiguous tensor layouts
2. Support for Vision mode (GGML_ROPE_TYPE_VISION):
- Set rope_dims = ne0 for Vision mode to rotate entire tensor
- Vision mode pairs dimension i with dimension i+n_dims (where n_dims = ne0/2)
- No tail handling needed since entire tensor is rotated
Implementation details:
- Use has_tail flag to determine execution path: head/tail splitting when
rope_dims < ne0, or full tensor rotation when rope_dims == ne0
- Support both F32 and F16 data types with intermediate F32 conversion
- Copy non-contiguous tensors to contiguous buffers before calling
RotaryPositionEmbedding operator for compatibility
- Improve cache invalidation logic to include rope_dims and indep_sects
parameters
These enhancements enable CANN backend to handle various RoPE configurations
used in modern vision-language models and models with partial rotation.
* cann: fix review comment
* ggml-cuda: optimize solve_tri_f32_fast and fix stride handling
- Switch from using shared memory for the RHS/solution matrix to a register-based approach (x_low, x_high), reducing shared memory pressure and bank conflicts.
- Implement explicit `fmaf` instructions for the reduction loop.
- Update kernel arguments to pass strides in bytes rather than elements to align with standard ggml tensor arithmetic (casting to `char *` before addition).
- Remove unused `MAX_K_FAST` definition.
* Small cleanup
* Remove comments in solve_tri.cu
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Use const for variables in solve_tri.cu
* Replace fmaf with more readable code
* remove last fmaf
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* cmake: enable RISC-V zihintpause extension for Spacemit builds
* readme : add ZIHINTPAUSE support for RISC-V
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* Optimize Vulkan shader for matrix-vector multiplication
* Revert changes on compute_outputs and main
Refactor compute_outputs to handle remaining rows correctly.
* Fix trailing whitespace
* vulkan: perf_logger improvements
- Move perf_logger from device to ctx.
- Add an env var to control the frequency we dump the stats. If you set a very
large value, it just dumps when the ctx is destroyed.
- Add a fusion info string to the tracking, only log one item per fused op.
- Fix MUL_MAT_ID flops calculation.
* fix vector sizes
* Feat: Added vulkan circular tiling support
* Feat: Added cpu circular
* Feat: Added cuda kernels
* Added tests
* Added tests
* Removed non-pad operations
* Removed unneded changes
* removed backend non pad tests
* Update test-backend-ops.cpp
* Fixed comment on pad test
* removed trailing whitespace
* Removed unneded test in test-backend-ops
* Removed removed test from calls
* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp
Co-authored-by: Ruben Ortlam <picard12@live.de>
* Fixed alignment
* Formatting
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Format pad
* Format
* Clang format
* format
* format
* don't change so much stuff
* clang format and update to bool
* fix duplicates
* don't need to fix the padding
* make circular bool
* duplicate again
* rename vulkan to wrap around
* Don't need indent
* moved to const expr
* removed unneded extra line break
* More readable method calls
* Minor wording changes
* Added final newline
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added circular pad ext tests
* Gate non circular pad devices
* Cleaned gating of non-circular pad devices
---------
Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Improve error handling for search path existence checks
Refactor existence checks for search paths using std::error_code to handle potential errors.
* Improve cache file existence check with error code
Update fs::exists to use std::error_code for error handling.
* Simplify existence check for search paths
Simplify existence check for search paths
* Fix logging path in error message for posix_stat
* Update ggml/src/ggml-backend-reg.cpp
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Adapt to the coding standard
---------
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
The MoE models have a mul_mat_vec with very small m (32, 64, 128) right before
the topk_moe selection. Running multiple rows per wg doesn't utilize the SMs
well. I think even for larger m, f32 is so bandwidth-limited that running
multiple rows doesn't help.
* Fix shader to support 2D workgroup mapping to a single subgroup
* Set required_subgroup_size
topk_moe shader requires static WARP_SIZE and actual subgroup size to match
* vulkan: Reduce temporary memory usage for TOP_K
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* vulkan: fix top_k bug when there are ties in the input
I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).
I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.
* Update tests/test-backend-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>