Commit Graph

8520 Commits

Author SHA1 Message Date
Georgi Gerganov 3fab96cd04
ci : disable self-hosted mac jobs (#20985) 2026-03-25 14:46:40 +02:00
Xuan-Son Nguyen 914eb5ff0c
jinja: fix macro with kwargs (#20960)
* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2026-03-25 12:22:48 +01:00
Francisco Herrera 8fc17493c3
gguf-split : clarify operation of gguf-split (#19749)
* clarify operation of gguf-split

so that you don't have to find out by trial and error

* formatting
2026-03-25 13:12:50 +02:00
Johannes Gäßler 36dafba5c4
llama: fix llama-model-saver (#20503)
* llama : add fd-based model loading via llama_model_load_from_fd

* llama : address review feedback for fd-based model loading

* llama : use FILE pointer instead of fd in public API

* llama : use FILE pointer consistently, address review feedback

* fixup

* fix tensor names

* fix llama-model-saver

* roundtrip tests

* fixup

* refactor tests

* fix prints

* fix model saving

* fix CI, disable Chameleon

* print seed

---------

Co-authored-by: Siddhesh2377 <siddheshsonar2377@gmail.com>
2026-03-25 12:53:16 +02:00
Aleksander Grygier 69e0ecef06
webui: Fix editing assistant message without branching (#20944)
* fix: Editing assistant response without branching

* chore: update webui build output
2026-03-25 12:47:33 +02:00
Pascal 062cca58fc
Add SLEEPING status to the WebUI model selector (#20949)
* webui: handle sleeping model status, fix favourite -> favorite

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/models/ModelsSelectorOption.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: fix optional event parameter in sleeping model onclick

* typo

* webui: restore orange sleeping indicator dot with hover unload

* chore: update webui build output

* webui: move stopPropagation into ActionIcon onclick, remove svelte-ignore

* chore: update webui build output

* webui: fix favourite -> favorite (UK -> US spelling) everywhere

Address review feedback from WhyNotHugo

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
2026-03-25 11:02:32 +01:00
yikechayedan 406f4e3f61
android : fix-pointer-dangling (#20974) 2026-03-25 11:51:26 +02:00
Neo Zhang 53dc8b59bf
sycl : fix wrong variable check by assert (#20903)
* fix wrong variable check by assert

* use GGML api
2026-03-25 11:48:37 +02:00
Sigbjørn Skjæret 403c9c9cef
ci : bump gguf publish python version (#20982) 2026-03-25 11:04:59 +02:00
Sigbjørn Skjæret 8fc85db9d2
ci : limit requirements versions (#20980)
* set requests version

* limit versions outside requirements
2026-03-25 10:55:37 +02:00
Dowon 3a60d06ad9
convert : register Qwen3Model architecture (#20967) 2026-03-25 10:37:59 +02:00
Ravi Panchumarthy abd86ef175
docs : Update OpenVINO backend docs (#20968)
* OpenVINO doc updates

* Update docs/backend/OPENVINO.md

Co-authored-by: Aaron Teo <taronaeo@gmail.com>

---------

Co-authored-by: Aaron Teo <taronaeo@gmail.com>
2026-03-25 10:33:51 +02:00
Georgi Gerganov 9f102a1407
models : move the token embedding norms to the first layer (#20943)
* models : move the token embedding norms to the first layer

* cont : fix LLM_TENSOR_CONV1D + fix il indexing
2026-03-24 17:00:30 +02:00
Aman Gupta 3fc6f1aed1
ggml-backend: re-enable graph reuse with pipeline parallelism (#20927) 2026-03-24 20:47:00 +08:00
Alessandro de Oliveira Faria (A.K.A.CABELO) 29771a0a4c
vendor : update cpp-httplib to 0.39.0 (#20933) 2026-03-24 13:33:33 +01:00
Adrien Gallouët 42ebce3beb
common : fix get_gguf_split_info (#20946)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 13:33:14 +01:00
BlueMöhre a94fdb090a
WebUI: fix edit msg form textarea height (#20830)
* autoresize textarea on mount

* allow textarea to grow to same height as rendered messages

* add UI build file
2026-03-24 13:17:45 +01:00
Adrien Gallouët c9dc43333f
readme : clarify MODEL_ENDPOINT usage (#20941)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 10:35:07 +01:00
Adrien Gallouët 2d2d9c2062
common : add a WARNING for HF cache migration (#20935)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 09:24:39 +01:00
nuri 92080b4396
metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930)
Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>
2026-03-24 10:13:07 +02:00
Georgi Gerganov 342d6125bc
metal : add FA instantiations for HSK=512, HSV=512 (#20902) 2026-03-24 10:03:09 +02:00
Aaron Teo c2e224d829
issues: add openvino backends (#20932)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2026-03-24 14:41:10 +08:00
Adrien Gallouët 8c7957ca33
common : add standard Hugging Face cache support (#20775)
* common : add standard Hugging Face cache support

- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check with the quant tag

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Cleanup

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Improve error handling and report API errors

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Restore common_cached_model_info and align mmproj filtering

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Prefer main when getting cached ref

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use cached files when HF API fails

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Use final_path..

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

* Check all inputs

Signed-off-by: Adrien Gallouët <angt@huggingface.co>

---------

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-03-24 07:30:33 +01:00
Aman Gupta e852eb4901
llama-fit: fix regex pattern for gate_up tensors (#20910)
* llama-fit: fix regex pattern for gate_up tensors

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2026-03-24 12:57:57 +08:00
Aldehir Rojas 312d870a89
common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912) 2026-03-23 22:21:47 -05:00
Max Krasnyansky 7cadbfce10
hexagon: general DMA and Binary Op fixes for large strides (#20918)
* hex-dma: make chained dma the default to handle newer models

This also includes some new instrumentation that we can remove later.

* hexagon: add uint32 dump helper

* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv

ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.

* hexagon: update ssm-conv to make base-addr compute a bit easier to read

* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)

* hex-bin: fix incorrect stride logic

* hexagon: make sure repack buffs are dumped for verbose > 2

* hex-bin: consistently use dma_queue_push even for dummy dst transactions

* hex-dma: start using 2d-wide mode on v75 and up

The removes the need to deal with the 16-bit limitaion for the strides.

* hex-bin: cleanup kernel selection logic

* hex-bin: cleanup binary op core and fix transposed tensor handling

* snapdragon: update run-bench to use larger ubatch and fa-on
2026-03-23 15:33:49 -07:00
Max Krasnyansky 1fb2290a51
Add codeowners for scripts/snapdragon and docs/snapdragon (#20915)
* Add codeowners for scripts/snapdragon

* Also add docs/backends/snapdragon
2026-03-23 14:57:18 -07:00
lhez 1772701f99
opencl: add q6_K gemm and gemv kernels for Adreno (#20089)
* opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code

* opencl: add q6_K transpose

* opencl: fix cvt kernel name

* opencl: add call to q6_K gemv

* opencl: fix q6_K scale transpose

* opencl: fix loading for gemv q6_K, refactor

* opencl: fix transpose_8_buf kernel assignment, refactor

* opencl: refactor q6_K transpose

* opencl: add gemm_noshuffle_q6_k_f32

* opencl: fix qh loading

* opencl: refactor q6_K gemv host side, release bufs and imgs

* opencl: refactor

* opencl: fix q6_K dequant and scale selection

* opencl: workaround compiler bug, fix dump_tensor

* opencl: refactor q6_K convert kernels

* opencl: unpack transformed q6_K in get_tensor

* opencl: refactor, handle non-uniform workgroups

* opencl: support non-vector subgroup bcast
2026-03-23 12:44:18 -07:00
las7 39bf0d3c6a
rpc : RCE patch (#20908) 2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen bd6992180b
contrib: add "Requirements" section to PR template (#20841)
* contrib: add "Requirements" section to PR template

* typo [no ci]

* use h2, add "Additional information"

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
2026-03-23 16:59:02 +01:00
Davi Henrique Linhares fd18364755
devops: upgraded default oneAPI version (#20731) 2026-03-23 21:47:34 +08:00
Aleksander Grygier 11fb11b901
webui: Improve chat form positioning (#20901) 2026-03-23 14:30:55 +01:00
Geo Maciolek 35b662bb5d
docs: Fix typo in reasoning flag documentation (#20780)
Tested to verify - the typo is just in the docs, not the actual flag.
2026-03-23 21:24:55 +08:00
Georgi Gerganov f93c09e267
memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887) 2026-03-23 14:08:46 +02:00
Eric Zhang 841bc203e2
docs : rerun llama-gen-docs to include new CLI args (#20892) 2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen 31a5cf4c3f
server: use httplib dynamic threads (#20817)
* server: use httplib dynamic threads

* change to n_threads_http + 1024
2026-03-23 12:22:46 +01:00
Georgi Gerganov e32d243849
ai : update gh permissions (#20895) 2026-03-23 13:21:41 +02:00
Pascal c44a932cf4
webui: fix --webui-config-file settings not applied on load (#20823)
* webui: fix --webui-config-file settings not applied on load

* chore: update webui build output
2026-03-23 11:25:35 +01:00
Rashid Ul Islam 177c75852a
metal: add CONV_3D (#19927)
* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* metal:add conv_3d backend

Rebased with master and resolved conflicts.

* Resolved issues related to changes in variable names

* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 09:45:34 +02:00
Jhen-Jie Hong 7a0b6a635e
common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859) 2026-03-23 08:35:27 +01:00
Chenguang Li 07ff000551
CANN: add RoPE cache preload before ACL graph capture (#20747)
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
  and allocation branches are skipped
2026-03-23 15:24:06 +08:00
Dan Hoffman cc18f965b6
fix(openvino): explicit memset in buffer_context allocation (#20857)
* fix(openvino): explicit memset in buffer_context allocation

* minor

---------

Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-23 08:05:37 +02:00
shaofeiqi 84ffd0c192
opencl: add flattened Q4_K mv and general Q4_K mm (#20773) 2026-03-22 22:45:11 -07:00
bssrdf ec2b787ebe
mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847)
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)

* add min/max dynamic patch to gguf meta

* clean up

* simplified handling min/max dynamic patch

* reuse llava_uhd logic for slice images

* provide default values for older models

* flake8

* prevent writing 0 value to gguf

* remove duplicated resolution candidates with a better algorithm

* fix indentation

* format

* add protection from divide by zero

* change to 0 to be safe

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2026-03-23 01:06:30 +01:00
DorianRudolph d3ac030a5d
mtmd : fix LightOnOCR image preprocessing (#20877) 2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen 49bfddeca1
server: allow router to report child instances sleep status (#20849)
* server: allow router to report child instances sleep status

* refactor

* move sleeping to state

* nits
2026-03-22 18:33:52 +01:00
Johannes Gäßler bd3f1d9d65
CUDA: fix BF16 FA compilation (#20865) 2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret 23c9182ce8
jinja : refactor token advancement (#20864)
* refactor token advancement

* exercise sub-expressions
2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky 81bc4d3ddc
server: fix Host header (#20843)
It should include port when it's not default.
2026-03-22 22:29:22 +08:00
Neo Zhang f40a80b4f3
support bf16 and quantized type (#20803) 2026-03-22 22:06:27 +08:00