Commit Graph

8532 Commits

Author SHA1 Message Date
Yihao Wang 0a524f2404
CUDA & CPU: support F32 kernel type for `CONV_TRANSPOSE_2D` (#17094)
* Refactor CUDA 2D transpose implementation to support multiple kernel types and improve parameter handling

- Introduced a `conv2d_transpose_params` struct for better parameter management.
- Updated `conv2d_transpose_kernel` to be templated for different kernel types (float and half).
- Modified `ggml_cuda_conv_2d_transpose_p0` to handle both F16 and F32 kernel types.
- Enhanced test cases to validate functionality for both kernel types.

* Refactor test cases for 2D convolution transpose to support dynamic kernel types

- Updated `test_conv_transpose_2d` structure to improve parameter handling by reordering constructor arguments.
- Enhanced test case generation to iterate over kernel types, allowing for flexible testing of different configurations.
- Removed hardcoded kernel type instances in favor of a loop for better maintainability and scalability.

* Refactor ggml_compute_forward_conv_transpose_2d to support both F16 and F32 tensor types.

* Refactor conv2d transpose kernel to use a template for kernel type, enhancing flexibility for different data types.
Update test cases to include both F16 and F32 tensor types for comprehensive coverage.

* Update ggml/src/ggml-cuda/conv2d-transpose.cu

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Update ggml/src/ggml-cpu/ggml-cpu.c

Co-authored-by: Aman Gupta <amangupta052@gmail.com>

* Refactor conv2d transpose implementation by removing the conv2d_transpose_params struct and dispatching with direct kernel launch.

* Enhance cpu conv2d transpose implementation by introducing a templated kernel type for improved flexibility with F16 and F32 data types.

---------

Co-authored-by: Aman Gupta <amangupta052@gmail.com>
2026-03-26 10:19:14 +08:00
Adrien Gallouët c0159f9c1f
common : do not delete old files from the old cache when updating (#21000)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 22:28:04 +01:00
Saba Fallah a970515bdb
mtmd: Add DeepSeekOCR Support (#17400)
* mtmd: llama.cpp DeepSeekOCR support
init commit

* loading sam tensors

* mtmd: fix vision model processing

* deepseek-ocr clip-vit model impl

* mtmd: add DeepSeek-OCR LM support with standard attention

* mtmd: successfully runs DeepSeek-OCR LM in llama-cli

* mtmd: Fix RoPE type for DeepSeek-OCR LM.

* loading LM
testing Vision model loading

* sam warmup working

* sam erroneous return corrected

* clip-vit:  corrected cls_embd concat

* clip-vit: model convert  qkv_proj split

* corrected combining of image encoders' results

* fix: update callback for ffn_moe_weighted and add callback for attn_out in deepseek2 model

* concat image_newline and image_seperator tokens

* visual_model warmup (technically) works

* window partitioning using standard ggml ops

* sam implementation without using CPU only ops

* clip: fixed warnings

* Merge branch 'sf/deepseek-ocr' of github.com:sfallah/llama.cpp into sf/deepseek-ocr

* mtmd: fix get_rel_pos

* mtmd: fixed the wrong scaler for get_rel_pos

* image encoding technically works but the output can't be checked singe image decoding fails

* mtmd: minor changed

* mtmd: add native resolution support

* - image encoding debugged
- issues fixed mainly related wrong config like n_patches etc.
- configs need to be corrected in the converter

* mtmd: correct token order

* - dynamic resizing
- changes are concerning PR https://github.com/sfallah/llama.cpp/pull/4

* mtmd: quick fix token order

* mtmd: fix danling pointer

* mtmd: SAM numerically works

* mtmd: debug CLIP-L (vit_pre_ln)

* mtmd: debug CLIP-L & first working DeepSeek-OCR model

* mtmd : add --dsocr-mode CLI argument for DeepSeek-OCR resolution control & all native resolution modes work

* mtmd: simplify SAM patch embedding

* mtmd: adapt Pillow image resizing function

* mtmd:  simplify DeepSeek-OCR dynamic resolution preprocessing

* mtmd: remove --dsocr-mode argument

* mtmd: refactor code & remove unused helper functions

* mtmd: fix tensor names for image newlines and view separator

* clean up

* reverting automatically removed spaces

* reverting automatically removed spaces

* mtmd: fixed bad ocr check in Deepseek2 (LM)

* mtmd: support combined QKV projection in buid_vit

* using common build_attn in sam

* corrected code-branch when flash-attn disabled
enabling usage of --flash-attn option

* mtmd: minor fix

* minor formatting and style

* fixed flake8 lint issues

* minor editorconfig-check fixes

* minor editorconfig-check fixes

* mtmd: simplify get_rel_pos

* mtmd: make sam hparams configurable

* mtmd: add detailed comments for resize_bicubic_pillow

* mtmd: fixed wrong input setting

* mtmd: convert model in FP16

* mtmd: minor fix

* mtmd: remove tweak to llama-mtmd-cli & deepseek-ocr template

* fix: test-1.jpg ORC issue with small (640) resolution
setting min-resolution base (1024) max large (1280) for dynamic-resolution

* minor: editconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/17909
added new opt to tests.sh to disable flash-attn

* minor: editconfig-check fix

* testing deepseek-ocr
quick and dirty test script comparing results of Qwen2.5-VL vs DeepSeek-OCR

* quick and (potential) dirty merge with https://github.com/ggml-org/llama.cpp/pull/17909

* refactoring, one single builder function and static helpers

* added deepseek-ocr test to tests.sh

* minor formatting fixes

* check with fixed expected resutls

* minor formatting

* editorconfig-check fix

* merge with changes from https://github.com/ggml-org/llama.cpp/pull/18042

* minor
- added GLM-4.6V to big tests
- added missing deps for python test

* convert: minor fix

* mtmd: format code

* convert: quick fix

* convert: quick fix

* minor python formatting

* fixed merge build issue

* merge resolved
- fixed issues in convert
- tested several deepseek models

* minor fix

* minor

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* - removed clip_is_deepseekocr
- removed redundant RESIZE_ALGO_BICUBIC_PILLOW resize-algo
- simplified image-preprocessing
- removed/simplified debug functions

* - cleaning commented out code

* fixing instabilities issues reintroducing resize_bicubic_pillow

* - use f16 model for deepseek-ocr test
- ignore llama-arch test for deepseek-ocr

* rename fc_w --> mm_fc_w

* add links to OCR discussion

* cleaner loading code

* add missing .weight to some tensors

* add default jinja template (to be used by server)

* move test model to ggml-org

* rolling back upscale change

* Update convert_hf_to_gguf.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: bluebread <hotbread70127@gmail.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2026-03-25 19:57:40 +01:00
Adrien Gallouët 056b50c319
common : fix verbosity setup (#20989)
The verbosity threshold was set at the end of common_params_parse_ex(),
after doing many things (like downloading files..)

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 19:41:01 +01:00
Adrien Gallouët f2c72b8f1f
common : fix gguf selection in common_list_cached_models (#20996)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-25 19:18:06 +01:00
uvos ec54ac13a8
ci : fix parsing of vgpr counts in hip-quality-check (#20987)
* scripts: hip: gcn-cdna-vgpr-check: fix parsing of vgpr counts when an amdclang Remark block is interlieved with another from a different process

* Return warning ignore

* obay pep8 inline double space before inline commets

* add # noqa: NP100 for other prints too

* Add script changes to cause autotrigger
2026-03-25 19:00:37 +01:00
Saba Fallah 80322ebdaf
model: codefuse-ai/F2LLM-v2 support 2026-03-25 18:33:42 +01:00
Dowon 44c51e526b
model : allow causal_attn and pooling_type on all architectures (#20973)
* models : allow causal_attn and pooling_type on all architectures

* fix: move location
2026-03-25 18:12:38 +01:00
Aparna M P 1922f87c2f
snapdragon: add missing features to WoS scripts to achieve parity with ADB scripts (#20884)
* Add missing features to WoS scripts to achieve parity with ADB scripts

* Fix line-ending in run-mtmd.ps1

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>

---------

Signed-off-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-25 09:43:12 -07:00
Shreya Jain 345de3cd87
Use docker in build-android.yml (#20928)
* use docker instead of SDK separately

* fix whitespaces

* Update .github/workflows/build-android.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-25 09:36:27 -07:00
Aman Gupta 9c600bcd4b
llama-bench: print `-n-cpu-moe` when offloaded layers > 1 (#20984) 2026-03-25 21:17:27 +08:00
Masato Nakasaka b2704f9028
ci: Allow ninja to be used during unit test (#20742)
* Remove make dependency

* Added option to specify Ninja generator

* use ninja-build as default for several CI

* Revert "use ninja-build as default for several CI"

This reverts commit f552c4559b.

* changed use plain string rather than arrays

* Enabled ninja build by default for experimentation

* ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Enabled ninja build by default on self-hosted envs for experimentation

* ci: revert generator to ninja instead of ninja multi-config

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: install ninja-build for self-hosted workflows

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: revert ninja from self-hosted runners

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: missed one self-hosted step

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ci: fix windows ci errors from an errenous revert

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Added explicit build types for Ninja

Also reverted some needless change

* ci: use ninja multi-config for vulkan-x64 build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* added time command to measure build time

* Keeping some configs to use Ninja which show improvement

* minor fix based on review

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

* ci: rm `time` from custom containers

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-25 21:00:49 +08:00
Georgi Gerganov 3fab96cd04
ci : disable self-hosted mac jobs (#20985) 2026-03-25 14:46:40 +02:00
Xuan-Son Nguyen 914eb5ff0c
jinja: fix macro with kwargs (#20960)
* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-25 12:22:48 +01:00
Francisco Herrera 8fc17493c3
gguf-split : clarify operation of gguf-split (#19749)
* clarify operation of gguf-split

so that you don't have to find out by trial and error

* formatting
2026-03-25 13:12:50 +02:00
Johannes Gäßler 36dafba5c4
llama: fix llama-model-saver (#20503)
* llama : add fd-based model loading via llama_model_load_from_fd

* llama : address review feedback for fd-based model loading

* llama : use FILE pointer instead of fd in public API

* llama : use FILE pointer consistently, address review feedback

* fixup

* fix tensor names

* fix llama-model-saver

* roundtrip tests

* fixup

* refactor tests

* fix prints

* fix model saving

* fix CI, disable Chameleon

* print seed

---------

Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>
2026-03-25 12:53:16 +02:00
Aleksander Grygier 69e0ecef06
webui: Fix editing assistant message without branching (#20944)
* fix: Editing assistant response without branching

* chore: update webui build output
2026-03-25 12:47:33 +02:00
Pascal 062cca58fc
Add SLEEPING status to the WebUI model selector (#20949)
* webui: handle sleeping model status, fix favourite -> favorite

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: fix optional event parameter in sleeping model onclick

* typo

* webui: restore orange sleeping indicator dot with hover unload

* chore: update webui build output

* webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore

* chore: update webui build output

* webui: fix favourite -> favorite (UK -> US spelling) everywhere

Address review feedback from WhyNotHugo

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-25 11:02:32 +01:00
yikechayedan 406f4e3f61
android : fix-pointer-dangling (#20974) 2026-03-25 11:51:26 +02:00
Neo Zhang 53dc8b59bf
sycl : fix wrong variable check by assert (#20903)
* fix wrong variable check by assert

* use GGML api
2026-03-25 11:48:37 +02:00
Sigbjørn Skjæret 403c9c9cef
ci : bump gguf publish python version (#20982) 2026-03-25 11:04:59 +02:00
Sigbjørn Skjæret 8fc85db9d2
ci : limit requirements versions (#20980)
* set requests version

* limit versions outside requirements
2026-03-25 10:55:37 +02:00
Dowon 3a60d06ad9
convert : register Qwen3Model architecture (#20967) 2026-03-25 10:37:59 +02:00
Ravi Panchumarthy abd86ef175
docs : Update OpenVINO backend docs (#20968)
* OpenVINO doc updates

* Update docs/backend/OPENVINO.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-25 10:33:51 +02:00
Georgi Gerganov 9f102a1407
models : move the token embedding norms to the first layer (#20943)
* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing
2026-03-24 17:00:30 +02:00
Aman Gupta 3fc6f1aed1
ggml-backend: re-enable graph reuse with pipeline parallelism (#20927) 2026-03-24 20:47:00 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 29771a0a4c
vendor : update cpp-httplib to 0.39.0 (#20933) 2026-03-24 13:33:33 +01:00
Adrien Gallouët 42ebce3beb
common : fix get_gguf_split_info (#20946)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 13:33:14 +01:00
BlueMöhre a94fdb090a
WebUI: fix edit msg form textarea height (#20830)
* autoresize textarea on mount

* allow textarea to grow to same height as rendered messages

* add UI build file
2026-03-24 13:17:45 +01:00
Adrien Gallouët c9dc43333f
readme : clarify MODEL_ENDPOINT usage (#20941)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 10:35:07 +01:00
Adrien Gallouët 2d2d9c2062
common : add a WARNING for HF cache migration (#20935)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 09:24:39 +01:00
nuri 92080b4396
metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-03-24 10:13:07 +02:00
Georgi Gerganov 342d6125bc
metal : add FA instantiations for HSK=512, HSV=512 (#20902) 2026-03-24 10:03:09 +02:00
Aaron Teo c2e224d829
issues: add openvino backends (#20932)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-24 14:41:10 +08:00
Adrien Gallouët 8c7957ca33
common : add standard Hugging Face cache support (#20775)
* common : add standard Hugging Face cache support

- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check with the quant tag

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve error handling and report API errors

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Restore common_cached_model_info and align mmproj filtering

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Prefer main when getting cached ref

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use cached files when HF API fails

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use final_path..

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check all inputs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 07:30:33 +01:00
Aman Gupta e852eb4901
llama-fit: fix regex pattern for gate_up tensors (#20910)
* llama-fit: fix regex pattern for gate_up tensors

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-24 12:57:57 +08:00
Aldehir Rojas 312d870a89
common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912) 2026-03-23 22:21:47 -05:00
Max Krasnyansky 7cadbfce10
hexagon: general DMA and Binary Op fixes for large strides (#20918)
* hex-dma: make chained dma the default to handle newer models

This also includes some new instrumentation that we can remove later.

* hexagon: add uint32 dump helper

* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv

ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.

* hexagon: update ssm-conv to make base-addr compute a bit easier to read

* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)

* hex-bin: fix incorrect stride logic

* hexagon: make sure repack buffs are dumped for verbose > 2

* hex-bin: consistently use dma_queue_push even for dummy dst transactions

* hex-dma: start using 2d-wide mode on v75 and up

The removes the need to deal with the 16-bit limitaion for the strides.

* hex-bin: cleanup kernel selection logic

* hex-bin: cleanup binary op core and fix transposed tensor handling

* snapdragon: update run-bench to use larger ubatch and fa-on
2026-03-23 15:33:49 -07:00
Max Krasnyansky 1fb2290a51
Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)
* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon
2026-03-23 14:57:18 -07:00
lhez 1772701f99
opencl: add q6_K gemm and gemv kernels for Adreno (#20089)
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
2026-03-23 12:44:18 -07:00
las7 39bf0d3c6a
rpc : RCE patch (#20908) 2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen bd6992180b
contrib: add "Requirements" section to PR template (#20841)
* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-23 16:59:02 +01:00
Davi Henrique Linhares fd18364755
devops: upgraded default oneAPI version (#20731) 2026-03-23 21:47:34 +08:00
Aleksander Grygier 11fb11b901
webui: Improve chat form positioning (#20901) 2026-03-23 14:30:55 +01:00
Geo Maciolek 35b662bb5d
docs: Fix typo in reasoning flag documentation (#20780)
Tested to verify - the typo is just in the docs, not the actual flag.
2026-03-23 21:24:55 +08:00
Georgi Gerganov f93c09e267
memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887) 2026-03-23 14:08:46 +02:00
Eric Zhang 841bc203e2
docs : rerun llama-gen-docs to include new CLI args (#20892) 2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen 31a5cf4c3f
server: use httplib dynamic threads (#20817)
* server: use httplib dynamic threads

* change to n_threads_http + 1024
2026-03-23 12:22:46 +01:00
Georgi Gerganov e32d243849
ai : update gh permissions (#20895) 2026-03-23 13:21:41 +02:00
Pascal c44a932cf4
webui: fix --webui-config-file settings not applied on load (#20823)
* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output
2026-03-23 11:25:35 +01:00