llama.cpp

Commit Graph

Author	SHA1	Message	Date
Masato Nakasaka	cbb52582ef	Merge `b9cb6b651b` into `2d2d9c2062`	2026-03-24 08:50:45 +00:00
Aaron Teo	b9cb6b651b	ci: use ninja multi-config for vulkan-x64 build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 16:49:43 +08:00
Adrien Gallouët	2d2d9c2062	common : add a WARNING for HF cache migration (#20935 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 09:24:39 +01:00
nuri	92080b4396	metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930 ) Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 10:13:07 +02:00
Georgi Gerganov	342d6125bc	metal : add FA instantiations for HSK=512, HSV=512 (#20902 )	2026-03-24 10:03:09 +02:00
Nakasaka, Masato	a987c02a56	Added explicit build types for Ninja Also reverted some needless change	2026-03-24 16:50:38 +09:00
Aaron Teo	c2e224d829	issues: add openvino backends (#20932 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:41:10 +08:00
Aaron Teo	7842cf622c	ci: fix windows ci errors from an errenous revert Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:33:58 +08:00
Adrien Gallouët	8c7957ca33	common : add standard Hugging Face cache support (#20775 ) * common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 07:30:33 +01:00
Aaron Teo	231d441a4a	ci: missed one self-hosted step Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:25:10 +08:00
Aaron Teo	c06031138f	ci: revert ninja from self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:23:47 +08:00
Aaron Teo	3908b9675b	ci: install ninja-build for self-hosted workflows Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:18:31 +08:00
Aaron Teo	0a6263b1ee	ci: revert generator to ninja instead of ninja multi-config Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:13:52 +08:00
Nakasaka, Masato	a923ae3977	Enabled ninja build by default on self-hosted envs for experimentation	2026-03-24 15:11:11 +09:00
Aaron Teo	5bc9a63ae3	ci: add run.sh to test conditions to trigger GitHub CI and self-hosted runners Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 13:50:53 +08:00
Nakasaka, Masato	e7304e2f0e	Enabled ninja build by default for experimentation	2026-03-24 14:44:59 +09:00
Aman Gupta	e852eb4901	llama-fit: fix regex pattern for gate_up tensors (#20910 ) * llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-24 12:57:57 +08:00
Nakasaka, Masato	29487c45f0	changed use plain string rather than arrays	2026-03-24 12:49:35 +09:00
Nakasaka, Masato	60b8cb0a2f	Revert "use ninja-build as default for several CI" This reverts commit `f552c4559b`.	2026-03-24 12:39:26 +09:00
Aldehir Rojas	312d870a89	common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912 )	2026-03-23 22:21:47 -05:00
Nakasaka, Masato	90264ca716	Merge remote-tracking branch 'origin/master' into remove-make-from-ci	2026-03-24 11:25:35 +09:00
Max Krasnyansky	7cadbfce10	hexagon: general DMA and Binary Op fixes for large strides (#20918 ) * hex-dma: make chained dma the default to handle newer models This also includes some new instrumentation that we can remove later. * hexagon: add uint32 dump helper * hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset spans page boundaries. * hexagon: update ssm-conv to make base-addr compute a bit easier to read * hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB) * hex-bin: fix incorrect stride logic * hexagon: make sure repack buffs are dumped for verbose > 2 * hex-bin: consistently use dma_queue_push even for dummy dst transactions * hex-dma: start using 2d-wide mode on v75 and up The removes the need to deal with the 16-bit limitaion for the strides. * hex-bin: cleanup kernel selection logic * hex-bin: cleanup binary op core and fix transposed tensor handling * snapdragon: update run-bench to use larger ubatch and fa-on	2026-03-23 15:33:49 -07:00
Max Krasnyansky	1fb2290a51	Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 ) * Add codeowners for scripts/snapdragon * Also add docs/backends/snapdragon	2026-03-23 14:57:18 -07:00
lhez	1772701f99	opencl: add q6_K gemm and gemv kernels for Adreno (#20089 ) * opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code * opencl: add q6_K transpose * opencl: fix cvt kernel name * opencl: add call to q6_K gemv * opencl: fix q6_K scale transpose * opencl: fix loading for gemv q6_K, refactor * opencl: fix transpose_8_buf kernel assignment, refactor * opencl: refactor q6_K transpose * opencl: add gemm_noshuffle_q6_k_f32 * opencl: fix qh loading * opencl: refactor q6_K gemv host side, release bufs and imgs * opencl: refactor * opencl: fix q6_K dequant and scale selection * opencl: workaround compiler bug, fix dump_tensor * opencl: refactor q6_K convert kernels * opencl: unpack transformed q6_K in get_tensor * opencl: refactor, handle non-uniform workgroups * opencl: support non-vector subgroup bcast	2026-03-23 12:44:18 -07:00
las7	39bf0d3c6a	rpc : RCE patch (#20908 )	2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen	bd6992180b	contrib: add "Requirements" section to PR template (#20841 ) * contrib: add "Requirements" section to PR template * typo [no ci] * use h2, add "Additional information" --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-03-23 16:59:02 +01:00
Davi Henrique Linhares	fd18364755	devops: upgraded default oneAPI version (#20731 )	2026-03-23 21:47:34 +08:00
Aleksander Grygier	11fb11b901	webui: Improve chat form positioning (#20901 )	2026-03-23 14:30:55 +01:00
Geo Maciolek	35b662bb5d	docs: Fix typo in reasoning flag documentation (#20780 ) Tested to verify - the typo is just in the docs, not the actual flag.	2026-03-23 21:24:55 +08:00
Georgi Gerganov	f93c09e267	memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887 )	2026-03-23 14:08:46 +02:00
Eric Zhang	841bc203e2	docs : rerun llama-gen-docs to include new CLI args (#20892 )	2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen	31a5cf4c3f	server: use httplib dynamic threads (#20817 ) * server: use httplib dynamic threads * change to n_threads_http + 1024	2026-03-23 12:22:46 +01:00
Georgi Gerganov	e32d243849	ai : update gh permissions (#20895 )	2026-03-23 13:21:41 +02:00
Pascal	c44a932cf4	webui: fix --webui-config-file settings not applied on load (#20823 ) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output	2026-03-23 11:25:35 +01:00
Rashid Ul Islam	177c75852a	metal: add CONV_3D (#19927 ) * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * metal:add conv_3d backend Rebased with master and resolved conflicts. * Resolved issues related to changes in variable names * kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 09:45:34 +02:00
Jhen-Jie Hong	7a0b6a635e	common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859 )	2026-03-23 08:35:27 +01:00
Chenguang Li	07ff000551	CANN: add RoPE cache preload before ACL graph capture (#20747 ) ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped	2026-03-23 15:24:06 +08:00
Dan Hoffman	cc18f965b6	fix(openvino): explicit memset in buffer_context allocation (#20857 ) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 08:05:37 +02:00
shaofeiqi	84ffd0c192	opencl: add flattened Q4_K mv and general Q4_K mm (#20773 )	2026-03-22 22:45:11 -07:00
Nakasaka, Masato	f552c4559b	use ninja-build as default for several CI	2026-03-23 12:15:45 +09:00
Nakasaka, Masato	80a9985943	Added option to specify Ninja generator	2026-03-23 11:42:49 +09:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
DorianRudolph	d3ac030a5d	mtmd : fix LightOnOCR image preprocessing (#20877 )	2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Johannes Gäßler	bd3f1d9d65	CUDA: fix BF16 FA compilation (#20865 )	2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret	23c9182ce8	jinja : refactor token advancement (#20864 ) * refactor token advancement * exercise sub-expressions	2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky	81bc4d3ddc	server: fix Host header (#20843 ) It should include port when it's not default.	2026-03-22 22:29:22 +08:00
Neo Zhang	f40a80b4f3	support bf16 and quantized type (#20803 )	2026-03-22 22:06:27 +08:00
Patrick Buckley	db9d8aa428	ggml-cuda: native bf16 flash attention for vec kernel (#20525 ) * ggml-cuda: native bf16 flash attention for vec and tile kernels mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo * ggml-cuda: address code owner review feedback reverted tile kernel changes to avoid larger refactor * fix ci failures on turing and hip * fix bf16 vec kernel compile on hip v_dot2 platforms * add comments --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-22 11:05:51 +01:00
Gaurav Garg	ccb87fa3ee	[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635 ) * Increase per-thread work if the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases. * Limit this change to ncols_dst = 1 * tab to space	2026-03-22 16:49:35 +08:00

1 2 3 4 5 ...

8519 Commits All Branches Search

8519 Commits

All Branches