This change enables the repack stage to utilize the user-specified
thread count, ensuring that both the logical thread IDs and the total
number of threads remain consistent between the repack and inference
stages.
In a NUMA architecture where the `--numa distribute` parameter is used,
logical threads are pinned to specific physical NUMA nodes. By aligning
the thread configuration across these two stages, we can fully leverage
the operating system's "first-touch" memory allocation policy:
1. Repack Stage: Logical thread i (bound to NUMA node j) is responsible
for repacking and writing the weight data. Since the "first touch"
occurs within this thread, the corresponding physical memory is
allocated on node j.
2. Inference Stage: The same logical thread i (still bound to node j)
reads these weights. Since the data already resides on the local
node, low-latency local memory access is achieved.
Without ensuring consistency in the number of threads, data may be
randomly allocated to mismatched nodes, resulting in significant
cross-node access overhead during inference.
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
When using repack buffer type, the physical memory allocation is dictated
by the first-touch policy. Since the main thread performs the write
operations, memory is often allocated on a single NUMA node, leading to
uneven weight distribution.
Multi-threaded repack can alleviate this problem, but the threads are
not bound to NUMA nodes.
This patch applies the same thread affinity strategy (--numa distribute)
to the repacking phase. By binding the repack threads to the same nodes
as the compute threads, we ensure that weights are written (and thus
allocated) on the local NUMA node, minimizing cross-node memory access
during inference.
Performance on Intel Xeon Silver 4514Y (32 core):
qwen3 8B Q4_K: 19.39 -> 26.92 t/s (+39%)
qwen3 32B Q4_K: 4.99 -> 7.38 t/s (+48%)
Signed-off-by: Jianhui Zhou <jonaszhou@zhaoxin.com>
Some backend depends on CMAKE_RUNTIME_OUTPUT_DIRECTORY to create temporary file like metal backened.
Missing CMAKE_RUNTIME_OUTPUT_DIRECTORY will cause some cmake error like permission denied (try to copy file to root).
This PR wants to setup a default path for CMAKE_RUNTIME_OUTPUT_DIRECTORY when it does not exist.
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
* model-conversion : use CONVERTED_MODEL value for converted model [no ci]
This commit updates the model verification scripts to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.
The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification scripts
will look for the wrong .bin files that were generating when running the
models.
For example, the following steps were not possible:
```console
(venv) $ huggingface-cli download google/gemma-3-270m-it --local-dir ggml-org/gemma-3-270m
(venv) $ python3 convert_hf_to_gguf.py ggml-org/gemma-3-270m --outfile test-bf16.gguf --outtype bf16
(venv) $ cd examples/model-conversion/
(venv) $ export MODEL_PATH=../../ggml-org/gemma-3-270m
(venv) $ export CONVERTED_MODEL=../../test-bf16.gguf
(venv) $ make causal-verify-logits
...
Data saved to data/llamacpp-test-bf16.bin
Data saved to data/llamacpp-test-bf16.txt
Error: llama.cpp logits file not found: data/llamacpp-gemma-3-270m.bin
Please run scripts/run-converted-model.sh first to generate this file.
make: *** [Makefile:62: causal-verify-logits] Error 1
```
With the changes in this commit, the above steps will now work as
expected.
* clip: move model cgraphs into their own files
* more explicit enums
* fix linux build
* fix naming
* missing headers
* nits: add comments for contributors
* ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* using the name VLEN instead of CNT
* Update ggml/include/ggml-cpu.h
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit removes the maximum difference check from the
compare-logits.py which would stop early if the difference between
the logits exceeded a threshold.
The motivation for removing this is that it can be useful to be able to
get the complete log for debugging/reporting purposes.
* enable mmf for RDNA3
* disable mmf for some shape
* move some mmvf to mmf
* more mmfv to mmf
* 3 is good in mmvf
---------
Co-authored-by: zhang hui <you@example.com>
* webui: add search field to model selector and fixes mobile viewport overflow
* webui: simplify model search style and code
* refacor: Search Input component & consistent UI for Models Selector search
* feat: Use Popover component + improve interactions
* fix: Fetching props for only loaded models in ROUTER mode
* webui: prevent models selector popover from overflowing viewport
Use Floating UI's auto-positioning with 50dvh height limit and proper
collision detection instead of forcing top positioning. Fixes overflow
on desktop and mobile keyboard issues
* webui: keep search field near trigger in models selector
Place search at the 'near end' (closest to trigger) by swapping layout
with CSS flexbox order based on popover direction. Prevents input from
moving during typing as list shrinks
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Extended TRI
* Fix whitespace
* chore: update webui build output
* Just use cuBLAS for everything...
* Merge both versions
* Remove incorrect imports causing failures for CI
* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"
* Defines for hipBlas
* Aaaand MUSA defines...
* I hate this job...
* Stupid typo...
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>