llama.cpp

Commit Graph

Author	SHA1	Message	Date
Johannes Gäßler	ed115841c0	Merge `60312f6a46` into `a94fdb090a`	2026-03-24 14:20:14 +02:00
BlueMöhre	a94fdb090a	WebUI: fix edit msg form textarea height (#20830 ) * autoresize textarea on mount * allow textarea to grow to same height as rendered messages * add UI build file	2026-03-24 13:17:45 +01:00
Adrien Gallouët	c9dc43333f	readme : clarify MODEL_ENDPOINT usage (#20941 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 10:35:07 +01:00
Adrien Gallouët	2d2d9c2062	common : add a WARNING for HF cache migration (#20935 ) Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 09:24:39 +01:00
nuri	92080b4396	metal : add FLOOR, CEIL, ROUND, TRUNC unary ops (#20930 ) Co-authored-by: nryoo <nryoo@nryooui-MacBookPro.local>	2026-03-24 10:13:07 +02:00
Georgi Gerganov	342d6125bc	metal : add FA instantiations for HSK=512, HSV=512 (#20902 )	2026-03-24 10:03:09 +02:00
Aaron Teo	c2e224d829	issues: add openvino backends (#20932 ) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2026-03-24 14:41:10 +08:00
Adrien Gallouët	8c7957ca33	common : add standard Hugging Face cache support (#20775 ) * common : add standard Hugging Face cache support - Use HF API to find all files - Migrate all manifests to hugging face cache at startup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check with the quant tag Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Cleanup Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Improve error handling and report API errors Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Restore common_cached_model_info and align mmproj filtering Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Prefer main when getting cached ref Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use cached files when HF API fails Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Use final_path.. Signed-off-by: Adrien Gallouët <angt@huggingface.co> * Check all inputs Signed-off-by: Adrien Gallouët <angt@huggingface.co> --------- Signed-off-by: Adrien Gallouët <angt@huggingface.co>	2026-03-24 07:30:33 +01:00
Aman Gupta	e852eb4901	llama-fit: fix regex pattern for gate_up tensors (#20910 ) * llama-fit: fix regex pattern for gate_up tensors * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-24 12:57:57 +08:00
Aldehir Rojas	312d870a89	common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912 )	2026-03-23 22:21:47 -05:00
Max Krasnyansky	7cadbfce10	hexagon: general DMA and Binary Op fixes for large strides (#20918 ) * hex-dma: make chained dma the default to handle newer models This also includes some new instrumentation that we can remove later. * hexagon: add uint32 dump helper * hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset spans page boundaries. * hexagon: update ssm-conv to make base-addr compute a bit easier to read * hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB) * hex-bin: fix incorrect stride logic * hexagon: make sure repack buffs are dumped for verbose > 2 * hex-bin: consistently use dma_queue_push even for dummy dst transactions * hex-dma: start using 2d-wide mode on v75 and up The removes the need to deal with the 16-bit limitaion for the strides. * hex-bin: cleanup kernel selection logic * hex-bin: cleanup binary op core and fix transposed tensor handling * snapdragon: update run-bench to use larger ubatch and fa-on	2026-03-23 15:33:49 -07:00
Johannes Gäßler	60312f6a46	fix model saving	2026-03-23 22:59:37 +01:00
Johannes Gäßler	dd2564bc38	fix prints	2026-03-23 22:59:37 +01:00
Johannes Gäßler	445fc0bf21	refactor tests	2026-03-23 22:59:37 +01:00
Johannes Gäßler	e7f31055b3	fixup	2026-03-23 22:59:37 +01:00
Johannes Gäßler	c66fd8a227	roundtrip tests	2026-03-23 22:59:37 +01:00
Johannes Gäßler	e8e2f634e7	fix llama-model-saver	2026-03-23 22:59:37 +01:00
Johannes Gäßler	e0ee16ce77	fix tensor names	2026-03-23 22:59:37 +01:00
Johannes Gäßler	f76e53108c	fixup	2026-03-23 22:59:37 +01:00
Siddhesh2377	6de1857936	llama : use FILE pointer consistently, address review feedback	2026-03-23 22:59:37 +01:00
Siddhesh2377	c44d34ee73	llama : use FILE pointer instead of fd in public API	2026-03-23 22:59:37 +01:00
Siddhesh2377	2c3223177d	llama : address review feedback for fd-based model loading	2026-03-23 22:59:37 +01:00
Siddhesh2377	4101758ab6	llama : add fd-based model loading via llama_model_load_from_fd	2026-03-23 22:59:37 +01:00
Max Krasnyansky	1fb2290a51	Add codeowners for scripts/snapdragon and docs/snapdragon (#20915 ) * Add codeowners for scripts/snapdragon * Also add docs/backends/snapdragon	2026-03-23 14:57:18 -07:00
lhez	1772701f99	opencl: add q6_K gemm and gemv kernels for Adreno (#20089 ) * opencl: add q6_K noshuffle kernels, initial q6_K gemv, some host code * opencl: add q6_K transpose * opencl: fix cvt kernel name * opencl: add call to q6_K gemv * opencl: fix q6_K scale transpose * opencl: fix loading for gemv q6_K, refactor * opencl: fix transpose_8_buf kernel assignment, refactor * opencl: refactor q6_K transpose * opencl: add gemm_noshuffle_q6_k_f32 * opencl: fix qh loading * opencl: refactor q6_K gemv host side, release bufs and imgs * opencl: refactor * opencl: fix q6_K dequant and scale selection * opencl: workaround compiler bug, fix dump_tensor * opencl: refactor q6_K convert kernels * opencl: unpack transformed q6_K in get_tensor * opencl: refactor, handle non-uniform workgroups * opencl: support non-vector subgroup bcast	2026-03-23 12:44:18 -07:00
las7	39bf0d3c6a	rpc : RCE patch (#20908 )	2026-03-23 19:54:57 +02:00
Xuan-Son Nguyen	bd6992180b	contrib: add "Requirements" section to PR template (#20841 ) * contrib: add "Requirements" section to PR template * typo [no ci] * use h2, add "Additional information" --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-03-23 16:59:02 +01:00
Davi Henrique Linhares	fd18364755	devops: upgraded default oneAPI version (#20731 )	2026-03-23 21:47:34 +08:00
Aleksander Grygier	11fb11b901	webui: Improve chat form positioning (#20901 )	2026-03-23 14:30:55 +01:00
Geo Maciolek	35b662bb5d	docs: Fix typo in reasoning flag documentation (#20780 ) Tested to verify - the typo is just in the docs, not the actual flag.	2026-03-23 21:24:55 +08:00
Georgi Gerganov	f93c09e267	memory : fix seq_id bounds in llama_memory_recurrent::state_read_meta() (#20887 )	2026-03-23 14:08:46 +02:00
Eric Zhang	841bc203e2	docs : rerun llama-gen-docs to include new CLI args (#20892 )	2026-03-23 12:33:38 +01:00
Xuan-Son Nguyen	31a5cf4c3f	server: use httplib dynamic threads (#20817 ) * server: use httplib dynamic threads * change to n_threads_http + 1024	2026-03-23 12:22:46 +01:00
Georgi Gerganov	e32d243849	ai : update gh permissions (#20895 )	2026-03-23 13:21:41 +02:00
Pascal	c44a932cf4	webui: fix --webui-config-file settings not applied on load (#20823 ) * webui: fix --webui-config-file settings not applied on load * chore: update webui build output	2026-03-23 11:25:35 +01:00
Rashid Ul Islam	177c75852a	metal: add CONV_3D (#19927 ) * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * metal:add conv_3d backend Rebased with master and resolved conflicts. * Resolved issues related to changes in variable names * kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 09:45:34 +02:00
Jhen-Jie Hong	7a0b6a635e	common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859 )	2026-03-23 08:35:27 +01:00
Chenguang Li	07ff000551	CANN: add RoPE cache preload before ACL graph capture (#20747 ) ACL graph capture disallows host-to-device memcpy and device memory malloc/free on the captured stream. Pre-load the RoPE cache before capture so that: - Host-to-device copies and allocations run on the non-captured stream - Cache metadata is populated and memory pool is warmed up - During capture, only on-device computations are recorded; host-side and allocation branches are skipped	2026-03-23 15:24:06 +08:00
Dan Hoffman	cc18f965b6	fix(openvino): explicit memset in buffer_context allocation (#20857 ) * fix(openvino): explicit memset in buffer_context allocation * minor --------- Co-authored-by: Dan Hoffman <dhoffman@cyket.net> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-23 08:05:37 +02:00
shaofeiqi	84ffd0c192	opencl: add flattened Q4_K mv and general Q4_K mm (#20773 )	2026-03-22 22:45:11 -07:00
bssrdf	ec2b787ebe	mtmd: Add dynamic high-resolution image preprocessing for InternVL model (#20847 ) * added support for internvl's dynamic high-resolution (Qianfan-OCR needed) * add min/max dynamic patch to gguf meta * clean up * simplified handling min/max dynamic patch * reuse llava_uhd logic for slice images * provide default values for older models * flake8 * prevent writing 0 value to gguf * remove duplicated resolution candidates with a better algorithm * fix indentation * format * add protection from divide by zero * change to 0 to be safe --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co>	2026-03-23 01:06:30 +01:00
DorianRudolph	d3ac030a5d	mtmd : fix LightOnOCR image preprocessing (#20877 )	2026-03-23 01:04:14 +01:00
Xuan-Son Nguyen	49bfddeca1	server: allow router to report child instances sleep status (#20849 ) * server: allow router to report child instances sleep status * refactor * move sleeping to state * nits	2026-03-22 18:33:52 +01:00
Johannes Gäßler	bd3f1d9d65	CUDA: fix BF16 FA compilation (#20865 )	2026-03-22 17:53:33 +01:00
Sigbjørn Skjæret	23c9182ce8	jinja : refactor token advancement (#20864 ) * refactor token advancement * exercise sub-expressions	2026-03-22 17:45:10 +01:00
Evgeny Kurnevsky	81bc4d3ddc	server: fix Host header (#20843 ) It should include port when it's not default.	2026-03-22 22:29:22 +08:00
Neo Zhang	f40a80b4f3	support bf16 and quantized type (#20803 )	2026-03-22 22:06:27 +08:00
Patrick Buckley	db9d8aa428	ggml-cuda: native bf16 flash attention for vec kernel (#20525 ) * ggml-cuda: native bf16 flash attention for vec and tile kernels mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo * ggml-cuda: address code owner review feedback reverted tile kernel changes to avoid larger refactor * fix ci failures on turing and hip * fix bf16 vec kernel compile on hip v_dot2 platforms * add comments --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-03-22 11:05:51 +01:00
Gaurav Garg	ccb87fa3ee	[CUDA] Increase number of output elements per-thread block if the K-dimension is small (#20635 ) * Increase per-thread work if the K-dimension is small With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536. The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices. This change increases the number of output elements per block for such cases. * Limit this change to ncols_dst = 1 * tab to space	2026-03-22 16:49:35 +08:00
ddh0	3306dbaef7	misc : prefer ggml-org models in docs and examples (#20827 ) * misc : prefer ggml-org models in docs and examples Prefer referring to known-good quantizations under ggml-org rather than 3rd-party uploaders. * remove accidentally committed file	2026-03-21 22:00:26 +01:00

1 2 3 4 5 ...

8517 Commits All Branches Search

8517 Commits

All Branches