Commit Graph

7835 Commits

Author SHA1 Message Date
Ed Addario cb4777e40f
Minor cosmetic code change 2026-01-17 12:01:02 +00:00
Ed Addario b6fc86b32b
Display NaN if statistic is uninterpretable 2026-01-17 11:43:29 +00:00
Ed Addario 2fd301e02c
Don't display layer statistics if there are gaps in the sequence 2026-01-17 11:42:25 +00:00
Ed Addario 91d31bd732
Refactor variable name 2026-01-17 11:41:02 +00:00
Ed Addario 4e5bf4187a
Merge branch 'master' into imatrix 2026-01-11 18:37:33 +00:00
Ed Addario a297a158e3
Fix typo 2026-01-11 18:34:10 +00:00
Ed Addario c88f288df9
Update README.md 2026-01-11 18:25:40 +00:00
Ed Addario fdc2def79c
Refactor show_statistics() 2026-01-11 17:41:04 +00:00
Ed Addario e69058c6ec
Refactor compute_layer_statistics() 2026-01-11 17:40:29 +00:00
Ed Addario 309dc1231a
Refactor compute_tensor_statistics() 2026-01-11 17:38:47 +00:00
Ed Addario 395367f210
Refactor compute_vector_statistics() 2026-01-11 17:37:17 +00:00
Ed Addario d488bbb7c7
Refactor compute_tensor_averages() 2026-01-11 17:36:10 +00:00
Ed Addario 6d82fa825a
Add intermediate computation variables and Pearson 2026-01-11 17:34:05 +00:00
Ruben Ortlam 0e76501e1d
Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749)
* vulkan: Enable and optimize large matmul parameter combination for AMD

* limit tuning to AMD GPUs with coopmat support

* use tx_m values instead of _l
2026-01-11 17:33:33 +01:00
Xuan-Son Nguyen 4b060bf240
security: make it clear about subtopics in server (#18754)
* security: make it clear about subtopics in server

* exclude DoS
2026-01-11 16:51:03 +01:00
Daniel Bevenius 9789e28459
debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check (#18692)
* debug : include LLAMA_POOLING_TYPE_UNSPECIFIED in pooling check

This commit updates the pooling check in the debug example to
also include LLAMA_POOLING_TYPE_UNSPECIFIED and not just
LLAMA_POOLING_TYPE_NONE.

* debug : normalize both pooled and token embeddings

This commit updates debug.cpp to normalize embeddings for both pooled
and non-pooled outputs. For pooled embeddings, normalization is applied
to the single vector, and for non-pooled embeddings, normalization is
applied to each token embedding vector individually.

The motivation for this is to enable non-pooled embeddings to be
normalized which was not possible previously.
2026-01-11 16:34:41 +01:00
Georgi Gerganov 84ae04f163
tests : refactor test-backend-sampler (#18753)
* tests : use "auto", use std::string

* tests : refactor test-backend-sampler.cpp

* cmake : remove redundant declarations

* ci : use smaller model

* tests : add struct test_params

* tests : reduce logit bias 100.0f -> 10.0f
2026-01-11 17:31:03 +02:00
Xuan-Son Nguyen 506bb6e010
model: try to improve Qwen3 Next (#18683)
* qwen3next: simplify qkvz projection

* use ggml_swiglu_split

* revert swiglu_split, but remove redundant repeat()

* fix missing reshape

* rm 2 redundant transposes

* move mul_mat(k,q) to outside of chunking

* rm redundant cont

* improve g_cs_chunk

* add comments about no cont

* use std::pair instead of ggml_concat

* vectorize key_gdiff calculation

* rm unused tensor

* avoid ggml_concat inside loop

* bring back ggml_concat as it may not work on other backend

* nits
2026-01-11 12:53:33 +01:00
thom-dev-fr 79456a690a
readme : update UIs (#18751) 2026-01-11 13:46:50 +02:00
Xuan-Son Nguyen 28068af789
security: narrow down the scope of what we consider a vulnerability (#18752)
* security: narrow down the scope of what we consider a vulnerability

* fix typo
2026-01-11 12:23:36 +01:00
shaofeiqi 707cbafcaa
opencl: add SOFTPLUS op support (#18726) 2026-01-10 21:57:44 -08:00
Aman Gupta b137718878
test-backend-ops: fix mxfp4 tests on blackwell (#18736) 2026-01-11 01:12:57 +08:00
Johannes Gäßler d2ff4e23ac
HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666) 2026-01-10 17:19:01 +01:00
Perry Naseck 657a2e644b
cmake : update blas logic (#18205) 2026-01-10 18:00:54 +02:00
Georgi Gerganov f307926482
server : adjust unified KV cache tests (#18716) 2026-01-10 17:51:56 +02:00
Sigbjørn Skjæret 7fdc8c893d
scripts : follow api redirects in pr2wt.sh (#18739) 2026-01-10 16:04:05 +01:00
Xuan-Son Nguyen 23f82f2420
preset: allow named remote preset (#18728)
* preset: allow named remote preset

* nits: fix docs

* cont docs
2026-01-10 15:12:29 +01:00
Aaron Teo 2656c0d265
docs(ggml): update backend ops (#18734)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-01-10 18:48:17 +08:00
Michael Wand 600a366478
Corrected: changed s13 = src1->nb[3] instead of nb[2] (#18724) 2026-01-10 10:16:07 +01:00
Adrien Gallouët ea23c15990
common : add --license to display embedded licenses (#18696)
This commit introduces a mechanism to embed all licenses directly
into the compiled binaries.

This eliminates the need to distribute separate LICENSE files alongside
the executable, making the binaries self-contained and simplifying
deployment.
2026-01-10 09:46:24 +01:00
Xuan-Son Nguyen 9ac2693a30
server: fix n_cmpl not skipping processing prompt (#18663)
* server: fix n_cmpl not skipping processing

* fix infinite loop on empty batch

* cont : init child samplers + modify child logic

* cont : cleanup

* cont : improve n_cmpl logic

- launch the parent task first so it finds the slot with best cache
- parent task waits for child tasks to be launched
- when a child task finishes - remove its cache

* cont : remove redundant function

* cont : reduce parent checks

* fix : nullptr task dereference

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-10 00:00:41 +01:00
Simranjeet Singh a61c8bc3bf
mtmd: Add Gemma3n multimodal support with MobileNetV5 vision encoder (#18256)
* Add Gemma3nVisionModel - MobileNetV5 vision encoder convertor to convert_hf_to_gguf.py. Add gemma3n to vision projectors in gguf-py/gguf/constants.py.

* Add mobilenetv5 impl

* Fix comments, remove unused vars

* Fix permute and remove transpose of projection weights

* Fix comments, remove debugging prints from hf_to_gguf

* 1. Hard-code image_mean = 0 and image_std = 1
2. Use available tensor mapping logic
3. Remove redundant chat template replacement of soft tokens placeholder with media placeholder

* 1. Move mobilenetv5 helpers declarations to `clip_graph_mobilenetv5` struct and definitions to mobilenetv5.cpp
2.Remove unused `clip_is_gemma3n` func declarations and definitions
3. Remove redundant `rescale_image_u8_to_f32` func and use `normalize_image_u8_to_f32` with zero mean and unit std
4. Calculate n_patches using image_size / patch_size

* Remove obsolete comments

* - convert_hf_to_gguf.py & constants.py & tensor_mapping.py: Use explicit mapping: Custom map for double indexed blocks and tensor_mapping.py for rest
- convert_hf_to_gguf.py: Unsqueeze Stem Bias and Layer scale tensors to correct shape while converting to gguf
- mobilenetv5.cpp: Remove explicit reshaping of Stem Bias and Layer scale which are now handled while converting to gguf, replace fprintf with LOG_*
- clip.cpp: Remove unused embedding and hard_emb_norm tensor loading

* - Rename tensors to v.conv..., v.blk..., v.msfa... to better align with already existing terminology

* Fix stem conv bias name

* Remove explicit handling of bias term for stem conv

* - Change order of addition in "project_per_layer_inputs" to support broadcasting of vision inp_per_layer
- Simplify the vision embeddings path of "get_per_layer_inputs" to output [n_embd_altup, n_layer, 1], broadcastable

* clean up conversion script

* fix code style

* also preserve audio tensors

* trailing space

* split arch A and V

* rm unused gemma3 func

* fix alignment

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-01-09 23:42:38 +01:00
shaofeiqi 593da7fa49
opencl: add EXPM1 op (#18704) 2026-01-09 10:13:13 -08:00
Reese Levine 9e41884dce
Updates to webgpu get_memory (#18707) 2026-01-09 08:17:18 -08:00
Pascal ec8fd7876b
Webui/file upload (#18694)
* webui: fix restrictive file type validation

* webui: simplify file processing logic

* chore: update webui build output

* webui: remove file picker extension whitelist (1/2)

* webui: remove file picker extension whitelist (2/2)

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* fix: update ChatForm storybook test after removing accept attribute

* chore: update webui build output

* refactor: more cleanup

* chore: update webui build output
2026-01-09 16:45:32 +01:00
Asbjørn Olling a180ba78c7
cmake: only build cli when server is enabled (#18670) 2026-01-09 16:43:26 +01:00
Georgi Gerganov 53eb9435da
server : fix timing of prompt/generation (#18713) 2026-01-09 12:59:50 +02:00
Georgi Gerganov d3435efc8a
scripts : pr2wt.sh reset to remote head (#18695)
* scripts : pr2wt.sh reset to remote head

* cont : cleaner

* cont : restore --set-upstream-to
2026-01-09 12:16:40 +02:00
Georgi Gerganov f5f8812f7c
server : use different seeds for child completions (#18700)
* server : use different seeds for child completions

* cont : handle default seed

* cont : note
2026-01-09 09:33:50 +02:00
Xuan-Son Nguyen 8ece3836b4
common: support remote preset (#18520)
* arg: support remote preset

* proof reading

* allow one HF repo to point to multiple HF repos

* docs: mention about multiple GGUF use case

* correct clean_file_name

* download: also return HTTP status code

* fix case with cache file used

* fix --offline option
2026-01-08 22:35:40 +01:00
Aaron Teo 046d5fd44e
llama: use host memory if device reports 0 memory (#18587) 2026-01-09 05:34:56 +08:00
Masashi Yoshimura 480160d472
ggml-webgpu: Fix GGML_MEM_ALIGN to 8 for emscripten. (#18628)
* Fix GGML_MEM_ALIGN to 8 for emscripten.

* Add a comment explaining the need for GGML_MEM_ALIGN == 8 in 64-bit wasm with emscripten
2026-01-08 08:36:42 -08:00
Reese Levine 15bff84bf5
ggml webgpu: initial flashattention implementation (#18610)
* FlashAttention (#13)

* Add inplace softmax

* Move rms_norm to split row approach

* Update debug for supports_op

* clean up debug statements

* neg f16xf32xip builds and runs, havent actually ran a model that uses neg kernel yet though

* neg passes backend test

* unary operators pass ggml tests

* rms_norm double declaration bug atoned

* abides by editor-config

* removed vestigial files

* fixed autoconfig

* All operators (inlcluding xielu) working

* removed unnecesarry checking if node->src[1] exists for unary operators

* responded and dealt with PR comments

* implemented REPL_Template support and removed bug in unary operators kernel

* formatted embed wgsl and ggml-webgpu.cpp

* Faster tensors (#8)

Add fast matrix and matrix/vector multiplication.

* Use map for shader replacements instead of pair of strings

* Wasm (#9)

* webgpu : fix build on emscripten

* more debugging stuff

* test-backend-ops: force single thread on wasm

* fix single-thread case for init_tensor_uniform

* use jspi

* add pthread

* test: remember to set n_thread for cpu backend

* Add buffer label and enable dawn-specific toggles to turn off some checks

* Intermediate state

* Fast working f16/f32 vec4

* Working float fast mul mat

* Clean up naming of mul_mat to match logical model, start work on q mul_mat

* Setup for subgroup matrix mat mul

* Basic working subgroup matrix

* Working subgroup matrix tiling

* Handle weirder sg matrix sizes (but still % sg matrix size)

* Working start to gemv

* working f16 accumulation with shared memory staging

* Print out available subgroup matrix configurations

* Vectorize dst stores for sg matrix shader

* Gemv working scalar

* Minor set_rows optimization (#4)

* updated optimization, fixed errors

* non vectorized version now dispatches one thread per element

* Simplify

* Change logic for set_rows pipelines

---------

Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Comment on dawn toggles

* Working subgroup matrix code for (semi)generic sizes

* Remove some comments

* Cleanup code

* Update dawn version and move to portable subgroup size

* Try to fix new dawn release

* Update subgroup size comment

* Only check for subgroup matrix configs if they are supported

* Add toggles for subgroup matrix/f16 support on nvidia+vulkan

* Make row/col naming consistent

* Refactor shared memory loading

* Move sg matrix stores to correct file

* Working q4_0

* Formatting

* Work with emscripten builds

* Fix test-backend-ops emscripten for f16/quantized types

* Use emscripten memory64 to support get_memory

* Add build flags and try ci

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* Remove extra whitespace

* Move wasm single-thread logic out of test-backend-ops for cpu backend

* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan

* Refactored pipelines and workgroup calculations (#10)

* refactored pipelines

* refactored workgroup calculation

* removed commented out block of prior maps

* Clean up ceiling division pattern

---------

Co-authored-by: Neha Abbas <nehaabbas@eduroam-169-233-141-223.ucsc.edu>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>

* Start work on flash attention

* Shader structure set up (many bugs still)

* debugging

* Working first test

* Working with head grouping, head sizes to 128, logit softcap, mask/sinks enabled, f32

* Generalize softmax to work with multiple subgroups, f16 accumulation, mask shared memory tiling

* Start work on integrating pre-wgsl

* Separate structs/initial shader compilation library into separate files

* Work on compilation choices for flashattention

* Work on subgroup matrix/tile size portability

* subgroup size agnostic online softmax

* Cleanups, quantization types

* more cleanup

* fix wasm build

* Refactor flashattention to increase parallelism, use direct loads for KV in somce cases

* Checkpoint

* formatting

* Update to account for default kv cache padding

* formatting shader

* Add workflow for ggml-ci webgpu

* Try passing absolute path to dawn in ggml-ci

* Avoid error on device destruction, add todos for proper cleanup

* Fix unused warning

* Forgot one parameter unused

* Move some flashattn computation to f32 for correctness
2026-01-08 08:23:39 -08:00
Jeff Bolz 2524c26164
vulkan: fix push constant size for quantize_q8_1 (#18687)
I added an assert to catch further mismatches, and it found several.
Fix those, too.
2026-01-08 15:40:58 +01:00
Jeff Bolz cb14b06995
vulkan: optimize ssm_scan (#18630)
* vulkan: optimize ssm_scan

* fix warp vs subgroup naming
2026-01-08 15:16:54 +01:00
Adrien Gallouët 55abc39355
vendor : update cpp-httplib to 0.30.0 (#18660)
* vendor : update cpp-httplib to 0.30.0
* common : allow custom headers when downloading
2026-01-08 13:53:54 +01:00
Georgi Gerganov f2f6c88067
scripts : support chaining commands in pr2wt.sh (#18671) 2026-01-08 13:40:23 +02:00
도로로도로또 945bf10627
metal : add MoE kernel specialization for ne20=5 (#18667)
Add template specialization for kernel_mul_mm_id_map0 with ne20=5
to support models using 5 active experts (e.g., VAETKI).
2026-01-08 12:37:45 +02:00
Johannes Gäßler 64848deb18
llama-fit-params: free memory target per device (#18679) 2026-01-08 10:07:58 +01:00
Doctor Shotgun 9a5724dee2
ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH (#18535)
* ggml: add env var GGML_OP_OFFLOAD_MIN_BATCH
* makes the min_batch_size for triggering op offload configurable via env var, defaulting to the prior hardcoded value of 32

* ggml: read GGML_OP_OFFLOAD_MIN_BATCH once and store to dev ctx

* cann: forward declaration of device context struct

* cann: move offload op check after device context declaration

* cuda: fix whitespace

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-01-08 11:03:21 +02:00