llama.cpp

Commit Graph

Author	SHA1	Message	Date
Xuejun Zhai	42ca27f714	Removed API get_input_type	2026-01-15 11:39:08 -08:00
Xuejun Zhai	891a3beb2d	Removed API get_input_type	2026-01-15 11:39:08 -08:00
Xuejun Zhai	cd611782ef	Removed API GgmlOvDecoder::get_input_stride(const std::string& name)	2026-01-15 11:39:08 -08:00
Xuejun Zhai	95c3071906	Removed API GgmlOvDecoder::get_input_names()	2026-01-15 11:39:08 -08:00
Xuejun Zhai	197ed992c0	Removed m_output_names	2026-01-15 11:39:08 -08:00
Xuejun Zhai	8ff73e5d53	Removed API m_outputs	2026-01-15 11:39:08 -08:00
Xuejun Zhai	111c96c266	Removed API get_output_ggml_tensor(const std::string & name)	2026-01-15 11:39:08 -08:00
Xuejun Zhai	ba852f2a60	Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)	2026-01-15 11:39:08 -08:00
Xuejun Zhai	6d7a0d6047	Modified API GgmlOvDecoder::get_output_type(const std::string & name)	2026-01-15 11:39:08 -08:00
Xuejun Zhai	f516db1db5	remove unused API get_output_shape(const std::string & name)	2026-01-15 11:39:08 -08:00
Xuejun Zhai	497964afbb	remove unused API GgmlOvDecoder::get_output_names()	2026-01-15 11:39:08 -08:00
Yu, Zijun	8f4ee4eee2	minor update due to ov 2025.4	2026-01-15 11:39:08 -08:00
Xuejun Zhai	0ea8238ad0	remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)	2026-01-15 11:39:08 -08:00
Yu, Zijun	2a9d4ca836	Refactor: split ov_graph_compute for dynamic and static	2026-01-15 11:39:08 -08:00
Yu, Zijun	808619e274	NPU support llma-perplexity -b 512 --no-warmup	2026-01-15 11:39:08 -08:00
Yu, Zijun	65348b5d20	fallback naive run with accuracy issue	2026-01-15 11:39:08 -08:00
Yu, Zijun	59e7e7c47d	NPU fix llama-bench	2026-01-15 11:39:08 -08:00
Yu, Zijun	38254cf592	NPU prefill chunking	2026-01-15 11:39:08 -08:00
XuejunZhai	992dea73fd	Fix error for naive	2026-01-15 11:39:08 -08:00
XuejunZhai	ae936519d2	Remove the second decoder for node. Moving the function into the model decoder	2026-01-15 11:39:05 -08:00
Arshath	4400b5cb4b	Update ggml-decoder.cpp	2026-01-15 11:38:13 -08:00
Arshath	98396b275a	Update ggml-decoder.cpp	2026-01-15 11:38:13 -08:00
Arshath	4a57b37d4d	Update ggml-decoder.cpp	2026-01-15 11:38:13 -08:00
Arshath	bed495226d	Update ggml-decoder.cpp	2026-01-15 11:38:13 -08:00
Arshath	11b4cc5a67	Update ggml-decoder.cpp	2026-01-15 11:38:13 -08:00
Arshath	047bfb5c90	Update ggml-decoder.cpp Hitting error while compiling on windows: error C3861: 'unsetenv': identifier not found Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it. Proposed fix: Use _putenv_s() (Windows equivalent) This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment. This keeps cross-platform compatibility.	2026-01-15 11:38:07 -08:00
Yu, Zijun	531941b348	Fix NPU	2026-01-15 11:28:48 -08:00
Yu, Zijun	ae404f7cbb	Fix llama-bench	2026-01-15 11:28:48 -08:00
Yu, Zijun	072dde0b2b	change graph to 4d, support multi sequences	2026-01-15 11:28:48 -08:00
Yu, Zijun	ea2c99be1c	NPU unify PD (handled internally)	2026-01-15 11:28:48 -08:00
Yu, Zijun	303923aba7	Clean placeholders in ggml-openvino.cpp	2026-01-15 11:27:30 -08:00
Zijun Yu	b8690bc055	NPU Unify PD (#14 ) * Stateless. Fix llama-cli llama-server * Simplify broadcast op in attention * Replace get_output_tensor+memcpy with set_output_tensor * NPU unify PD. Unify dynamic and static dims	2026-01-15 11:27:30 -08:00
Yu, Zijun	eba8113dc4	Style: middle ptr and ref align, omit optional struct keyword	2026-01-15 11:27:30 -08:00
Yu, Zijun	bd3093f90c	Style: use switch in supports_ops	2026-01-15 11:27:30 -08:00
Ravi Panchumarthy	841d673bd0	Update to OV-2025.3 and CMakeLists.txt	2026-01-15 11:26:00 -08:00
Yu, Zijun	2d2f00a41f	Fix llama-3-8b and phi3-mini q4_0 NPU	2026-01-15 11:26:00 -08:00
Yu, Zijun	299f4923bb	fix after rebasing	2026-01-15 11:26:00 -08:00
Yu, Zijun	8b82d1153b	Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working	2026-01-15 11:26:00 -08:00
Yu, Zijun	a9371ea646	Fix llama-cli (need to run with --no-warmup)	2026-01-15 11:26:00 -08:00
cavusmustafa	05d7abae8c	Fix for Phi3	2026-01-15 11:26:00 -08:00
cavusmustafa	e7252920e1	env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added	2026-01-15 11:26:00 -08:00
cavusmustafa	c112bc4e73	kvcachefusion support	2026-01-15 11:26:00 -08:00
Yu, Zijun	973a80fd02	Always apply Eliminate_ZP to fix GPU compile issue on some platforms	2026-01-15 11:26:00 -08:00
Yu, Zijun	fdadca1e89	Fix after rebasing	2026-01-15 11:26:00 -08:00
Yu, Zijun	f3afa7b914	Requantize Q6_K (gs16) to gs32 on GPU	2026-01-15 11:26:00 -08:00
Yu, Zijun	e4bfe5a20d	Add Q5_K to support phi-3-q4_k_m	2026-01-15 11:26:00 -08:00
Yu, Zijun	2f1d50fb07	Minor refactor	2026-01-15 11:26:00 -08:00
Yu, Zijun	67e178a2f6	Minor: not add attention_size_swa for non-swa model	2026-01-15 11:26:00 -08:00
Yu, Zijun	1a38339cea	Fix ROPE accuracy when freq_scale != 1	2026-01-15 11:26:00 -08:00
Yu, Zijun	602f9ca4af	Fix NPU accuracy	2026-01-15 11:26:00 -08:00
Yu, Zijun	9de874cb7b	Support iSWA	2026-01-15 11:25:58 -08:00
Yu, Zijun	7d81861a18	Fix Hunyuan	2026-01-15 11:20:31 -08:00
Yu, Zijun	597561242f	Add GeGLU	2026-01-15 11:20:31 -08:00
Yu, Zijun	be07073e0e	Apply EliminateZP only for npu	2026-01-15 11:20:31 -08:00
Yu, Zijun	da2cc993bc	WA for npu 1st token acc issue	2026-01-15 11:20:31 -08:00
Yu, Zijun	434059aef7	Fix NPU compile	2026-01-15 11:20:31 -08:00
Yu, Zijun	bcc343af00	Support BF16 model	2026-01-15 11:20:31 -08:00
Yu, Zijun	dc77cbb3f6	STYLE: make get_types_to_requant a function	2026-01-15 11:20:31 -08:00
Yu, Zijun	2ad1147b9b	Improve debug util; Eliminate nop ReshapeReshape	2026-01-15 11:20:31 -08:00
Yu, Zijun	0f7b253cb3	Fix after rebasing	2026-01-15 11:20:31 -08:00
Yu, Zijun	810eb480f5	Simpilfy translation of get_rows	2026-01-15 11:20:31 -08:00
Yu, Zijun	c5231a2448	Set m_is_static=false as default in decoder	2026-01-15 11:20:31 -08:00
Yu, Zijun	6926655f5b	Add custom quant type: q8_1_c, q4_0_128	2026-01-15 11:20:31 -08:00
Yu, Zijun	b593428eb3	Dequantize q4_1 q4_k q6_k for NPU	2026-01-15 11:20:31 -08:00
Yu, Zijun	82c98335d3	NPU perf: eliminate zp	2026-01-15 11:20:31 -08:00
Yu, Zijun	9ca53c7991	Add NPU Q4_0 support	2026-01-15 11:20:31 -08:00
Yu, Zijun	9900245e0b	Fix test-backend-ops: Treat quantized tensors as weights	2026-01-15 11:20:31 -08:00
Yu, Zijun	a1ce428004	Fix Q4_1	2026-01-15 11:19:15 -08:00
Yu, Zijun	dd80b04235	Fix CI; Disable test-backend-ops	2026-01-15 11:19:15 -08:00
Yu, Zijun	6ab76ed10a	Fix accuracy: disable cpu_repack	2026-01-15 11:19:15 -08:00
Yu, Zijun	663a0b8cce	Quant models run with accuracy issue	2026-01-15 11:19:15 -08:00
Yu, Zijun	d4ca760da8	Add quant weight conversion functions from genai gguf reader	2026-01-15 11:19:15 -08:00
Yu, Zijun	3e897df51c	Update supports_buft and supports_op for quantized models	2026-01-15 11:19:15 -08:00
Yu, Zijun	56d596775d	Change openvino device_type to GPU; Enable flash_attn	2026-01-15 11:19:15 -08:00
Yu, Zijun	65e1b1af6d	Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run	2026-01-15 11:19:15 -08:00
Yu, Zijun	14c8a85c32	Perf: RMS fused to OV internal RMS op	2026-01-15 11:19:15 -08:00
Yu, Zijun	a7b611bc93	Minor updates for raising PR	2026-01-15 11:19:15 -08:00
Yu, Zijun	f4123be967	Fix test-backend-ops	2026-01-15 11:19:15 -08:00
Yu, Zijun	839f8c66a0	Remove CPY	2026-01-15 11:19:15 -08:00
Yu, Zijun	7bda5021f9	Fix NPU	2026-01-15 11:19:15 -08:00
Yu, Zijun	63d000ba40	Support op SET_ROWS	2026-01-15 11:19:15 -08:00
Yu, Zijun	9a91ca6ef9	Optimize tensor conversion, improve TTFT	2026-01-15 11:19:15 -08:00
Yu, Zijun	37ff226bb6	Use CiD for NPU	2026-01-15 11:19:15 -08:00
shaofeiqi	785a710085	OpenCL: add SOLVE_TRI op support (#18846 )	2026-01-15 11:17:17 -08:00
Georgi Gerganov	6e7fc8a146	cuda : print less debug logs when disabling cuda graphs (#18868 )	2026-01-15 20:53:01 +02:00
Yu, Zijun	fc865340d5	Fix test-backend-ops	2026-01-15 10:26:28 -08:00
Yu, Zijun	43489bbfaa	Revert changes in fuse_to_sdpa	2026-01-15 10:26:28 -08:00
Cavus Mustafa	1a19566b23	add mark decomp pass	2026-01-15 10:26:28 -08:00
Cavus Mustafa	93b2d09a2d	mulmat type conversion update	2026-01-15 10:26:28 -08:00
Cavus Mustafa	e2fdc1b988	mulmat input conversion fix	2026-01-15 10:26:28 -08:00
Yu, Zijun	01cdf4a9cc	matmul in fp32	2026-01-15 10:26:28 -08:00
Cavus Mustafa	9cf56d6837	temp. changes for mark decomp	2026-01-15 10:26:28 -08:00
Yu, Zijun	4e7f04a307	Fix llama-perplexity	2026-01-15 10:26:28 -08:00
Yu, Zijun	75eec6265f	Fix llama-bench; Clang-format	2026-01-15 10:26:28 -08:00
Yu, Zijun	6dc4b90635	Fix NPU	2026-01-15 10:26:28 -08:00
Yu, Zijun	44f4cf34b1	Fix Phi3 ROPE; Add test-backend-ops	2026-01-15 10:26:28 -08:00
Yu, Zijun	1ed49bbfaf	Fix llama-cli	2026-01-15 10:26:28 -08:00
Yu, Zijun	d61f83c9b7	Fix CPY due to cgraph change	2026-01-15 10:23:35 -08:00
Yu, Zijun	f3c0519096	Reduce memory: free ov weights node after graph conversion	2026-01-15 10:20:18 -08:00
Yu, Zijun	a80da69448	Pull out sin cos from rope	2026-01-15 10:20:18 -08:00
Yu, Zijun	3533c14cf6	Fix Phi3 SwiGLU and SoftMax	2026-01-15 10:20:18 -08:00
Yu, Zijun	0fa7a5efef	Refactor: remove past_token_len from extra_inputs	2026-01-15 10:20:18 -08:00
Yu, Zijun	acf358d1ce	Pull out indices creation for kv cache update	2026-01-15 10:20:18 -08:00
Yu, Zijun	bf5414c95e	Replace Concat with Broadcast in MulMat for GQA	2026-01-15 10:20:18 -08:00
Yu, Zijun	ebc4fc9f95	Fuse to SDPA	2026-01-15 10:20:18 -08:00
Yu, Zijun	73ee84fffe	Add SwiGLU	2026-01-15 10:20:18 -08:00
Yu, Zijun	4c582ac7a3	Statful transformation for CPU GPU	2026-01-15 10:20:18 -08:00
Yu, Zijun	8afee795ad	Update clang-format	2026-01-15 10:20:18 -08:00
Yu, Zijun	593484ce5f	Refactor: clean, fix warning	2026-01-15 10:20:18 -08:00
Yu, Zijun	42d4240937	Change due to ggml cgraph changes, all device work	2026-01-15 10:20:18 -08:00
Yu, Zijun	e27738a987	Add AMD64 to CMakeLists	2026-01-15 10:20:18 -08:00
Yu, Zijun	592d7f8bbb	Change due to ggml cgraph changes, llama-3.2 CPU work	2026-01-15 10:20:18 -08:00
Yu, Zijun	f7ad77930e	Change due to ggml cgraph changes, not correct yet	2026-01-15 10:20:18 -08:00
Yu, Zijun	d9ca8f5dbe	NPU support version 2: prefill + kvcache	2026-01-15 10:20:18 -08:00
Yu, Zijun	34531abce4	draft NPU support version 2: prefill + kvcache	2026-01-15 10:20:18 -08:00
Yu, Zijun	7fec223334	Add initial NPU support	2026-01-15 10:20:18 -08:00
Yu, Zijun	8ce5cc597a	Add cgraph tensor output name to OV op name	2026-01-15 10:20:18 -08:00
Yu, Zijun	d7cc802292	PERF: use Slice+Concat in writing cache_v	2026-01-15 10:20:18 -08:00
Yu, Zijun	8ac5c225aa	FIX: set_max_token_len	2026-01-15 10:20:18 -08:00
Yu, Zijun	a30dc6e726	PERF: add weight constant in parallel	2026-01-15 10:20:18 -08:00
Yu, Zijun	c57f61494a	FIX: input shape of KQ_mask	2026-01-15 10:20:18 -08:00
Yu, Zijun	041d220dfa	FIX: Re-add tensor names in cgraph, Add another case for RESHAPE	2026-01-15 10:20:13 -08:00
Yu, Zijun	0d505b4e56	STYLE and minor REFACTOR	2026-01-15 10:10:00 -08:00
Yu, Zijun	cdf5370cb5	PERF: favor low precision matmul	2026-01-15 10:10:00 -08:00
Yu, Zijun	0d009fe61a	FEAT: Add all conversion code from ov side	2026-01-15 10:10:00 -08:00
Yu, Zijun	f15a2cc057	STYLE: clang-format	2026-01-15 10:10:00 -08:00
Yu, Zijun	a0b30529bf	FIX: backend buffer type issue	2026-01-15 10:10:00 -08:00
Zijun Yu	4c905b2b25	fix build error	2026-01-15 10:10:00 -08:00
Viraj Wadhwa	ffabe95e2a	Rebase - Bring up to date and fix build process	2026-01-15 10:09:23 -08:00
Yu, Zijun	a8e5efa44e	PERF: compile once (dynamic graph + cache)	2026-01-15 10:05:41 -08:00
Yu, Zijun	7d5e234254	FEAT: improve debug capability	2026-01-15 10:05:41 -08:00
Yu, Zijun	0a8cc9ab03	BUILD: update build doc, add cmake preset, add CACHE_DIR env var	2026-01-15 10:05:41 -08:00
Yu, Zijun	d3bdca25bd	PERF: share const nodes for weights for diff infer	2026-01-15 10:05:41 -08:00
Yu, Zijun	96ba47dd43	STYLE: minor refactor	2026-01-15 10:05:41 -08:00
Yu, Zijun	c04966cda6	REFACTOR: support weigts as constant	2026-01-15 10:05:41 -08:00
Yu, Zijun	0c7b026ecc	FEAT: Add interleaved mode for ROPE	2026-01-15 10:05:41 -08:00
Yu, Zijun	6ed44a3dff	FEAT: do PERMUTE eagerly	2026-01-15 10:05:41 -08:00
Yu, Zijun	8b408869ae	Arbitrary token len (>32) work; Fix bug in mulmat	2026-01-15 10:05:41 -08:00
Yu, Zijun	8d263bd6a5	2nd+ token correct by fix CPY in OV, remove single op backend compute code	2026-01-15 10:05:41 -08:00
Yu, Zijun	91d2a195b5	change op mappings to list in openvino_supports_op	2026-01-15 10:05:41 -08:00
Yu, Zijun	651b2c06cb	* Use find_package in CMake to configure OpenVINO * Remove OPENVINO_OP_DEBUG * Simplify set_input_output in decoder * Fix CPY in set_input_output * Use params from converted ov model in setting input	2026-01-15 10:05:41 -08:00
zhanmyz	84be5c6f15	1. Delete some comments 2. Process Prompt and predict first token is OK	2026-01-15 10:05:41 -08:00
zhanmyz	eac9a99530	1. Solve the AC issue of Permute+VIEW and MULMAL issue in the phase of “1. Process Prompt and predict the first token”. 2. There is still an AC issue in the "2. Predict the subsequent tokens phase" and it is being debugged. A deviation has been detected in the computation of OpenVINO's CPY Node at stage 2, and it is currently being fixed.	2026-01-15 10:05:41 -08:00
zhanmyz	8ae700ae11	Process Prompt and predict first token is OK	2026-01-15 10:05:41 -08:00
zhanmyz	8020138406	add debug info	2026-01-15 10:05:41 -08:00
zhanmyz	b02265a507	1. In the Prompt process and predict first token stage, the PERMUTE node needs to be integrated into the OV Frontend 2. In the predict latest token stage, the VIEW, CONT, Reshape need to be integrated into the OV Frontend.	2026-01-15 10:05:41 -08:00
zhanmyz	19ec9b6bf5	Try to add VIEW node to OV Frontend and have some issues that need to be dealt with	2026-01-15 10:05:41 -08:00
zhanmyz	b14b49d5f6	Minor Update	2026-01-15 10:05:41 -08:00
zhanmyz	467a5ddf04	1. Update the implementation of CPY node when it's non-contiguous 2. Remove duplicate get node operation function	2026-01-15 10:05:41 -08:00
zhanmyz	cff473a9e2	1. All operators implemented using OpenVINO can be successfully executed individually. 2. VIEW op output tensor shape is not same with CONT(non-contiguous) input tensor shape 3. CPY(non-contiguous) can't be implemented with original input/output tensor shape and data(need change the original shape when create input/output tensor) Currently. VIEW op executed in the ggml backend and others executed in the OpenVINO Frontend.	2026-01-15 10:05:41 -08:00
zhanmyz	e08a7fda33	All adjacent ops can conversion but calculation result is wrong and need debugging	2026-01-15 10:05:41 -08:00
zhanmyz	d05c458421	change CONT and MULMAT input node shape	2026-01-15 10:05:41 -08:00
zhanmyz	246a2d1021	Change the input and ouput node shape of MUL_MAT operator	2026-01-15 10:05:41 -08:00
zhanmyz	f37fa21a5c	Change the input and ouput node shape of MUL_MAT operator	2026-01-15 10:05:41 -08:00
zhanmyz	f98d215162	Change the input parameter shape of CONT operator	2026-01-15 10:05:41 -08:00
zhanmyz	9a7b7d8d6d	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT/ROPE/SCALE/SOFTMAX/ADD adjacent op graph conversion	2026-01-15 10:05:41 -08:00
zhanmyz	95ae982d59	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT graph conversion of consecutive OPs	2026-01-15 10:05:41 -08:00
zhanmyz	901f7347ff	Execute CONT & VIEW operators in OV Frontend is OK	2026-01-15 10:05:41 -08:00
zhanmyz	081b52667b	Execute singel CONT operator is OK	2026-01-15 10:05:41 -08:00
zhanmyz	afb8594194	add tmp source code files	2026-01-15 10:05:41 -08:00
zhanmyz	57582fda39	add implementation of CPY when the output tensor is non-contiguous	2026-01-15 10:05:41 -08:00
zhanmyz	8484769981	add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops	2026-01-15 10:05:41 -08:00
zhanmyz	cb2729bc4a	Move CPY from GGML OV Backend to OV Frontend	2026-01-15 10:05:41 -08:00
zhanmyz	2b04bd43be	Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML backend	2026-01-15 10:05:41 -08:00
zhanmyz	0f7d07de7d	Add support for RMS_NORM OP	2026-01-15 10:05:41 -08:00
yumengbo	2353c73f53	Support ROPE op.	2026-01-15 10:05:41 -08:00
yumengbo	8aba03bac6	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	d218c61e6d	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	590f587b27	Add support for UNARY SILU op . Fix pytorch impl bugs.	2026-01-15 10:05:41 -08:00
yumengbo	b100f89bad	Change to implementation following pytorch frontend	2026-01-15 10:05:41 -08:00
yumengbo	e95f29cbc0	Fix issue for output memory copy of infer request	2026-01-15 10:05:41 -08:00
zhanmyz	8c5a609f8d	add the rms_norm operator implemented using OpenVINO to the GGML backend of llama.cpp	2026-01-15 10:05:41 -08:00
zhanmyz	80c330a469	Update build.md and add operation mapping(GGML to OpenVINO)	2026-01-15 10:05:41 -08:00
zhanmyz	49804f43fc	add GET_ROWS operator of OpenVINO to GGML of llama.cpp	2026-01-15 10:05:41 -08:00
yumengbo	5b46dc23be	Change output for infer request to set output tensor. Support scale, view op.	2026-01-15 10:05:41 -08:00
yumengbo	31bd816426	Add GGML_OV_FRONTEND option. Add readme.	2026-01-15 10:05:41 -08:00
yumengbo	9b7b63d12c	Convert subgraph with add, sub, mul, div op to ov model and do infer on openvino device	2026-01-15 10:05:41 -08:00
yumengbo	34e826ac14	Implement GgmlOvDecoder. Add dump functions.	2026-01-15 10:05:41 -08:00
yumengbo	171c4681f4	Add PoC of integration of openvino frontend. Main changes: ggml-ov-frontend-utils, GraphIterator, Decoder	2026-01-15 10:05:41 -08:00
zhanmyz	ee31dc1c1b	add get openvino available ops function	2026-01-15 10:05:41 -08:00
zhanmyz	77d68146a8	add OpenVINO frontend convert process steps	2026-01-15 10:05:41 -08:00
zhanmyz	0a81aa19f7	Add compile options	2026-01-15 10:05:40 -08:00
zhanmyz	adc2c70f44	Add OpenVINO MUL operator to GGML of Llama.cpp.	2026-01-15 10:05:40 -08:00
zhanmyz	faa4a7de76	Solve the issue of abnormal model output caused by using OpenVINO ADD operator	2026-01-15 10:05:40 -08:00
zhanmyz	9b9d51dddf	* Configure the device(default CPU) that uses OpenVINO to compile the model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.	2026-01-15 10:05:40 -08:00
zhanmyz	5294402b50	add openvino as optional backend for Llama.cpp ggml	2026-01-15 10:05:40 -08:00
Yanglei Zou	fe5720e684	Add ggml-openvino base files	2026-01-15 10:05:40 -08:00
Johannes Gäßler	5c662d21a3	CUDA: fix allignment on register spill for FA (#18815 )	2026-01-15 15:14:50 +01:00
shalinib-ibm	8cc0ba957b	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837 )	2026-01-15 17:31:18 +08:00
Max Krasnyansky	cff777f226	hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822 ) * hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars * hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32 Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication. Update HTP ops infra to support OP_CPY * hexagon: cleanup and refactor hex/hvx/htp headers and helper libs hex is basically all scalar/core platform stuff (L2, DMA, basic utils) hvx is all hvx related utils, helpers, etc htp is higher level stuff like Ops, etc hvx-utils library got a nice round of cleanup and refactoring to reduce duplication use hvx_vec_store_a where possible * hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h Moved sigmoid and tanh vector functions from hvx-utils.h to a new header hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid array processing using a macro pattern similar to hvx-copy.h. Updated act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed unused hvx-sigmoid.c. * hexagon: factor out hvx-sqrt.h * hexagon: mintor update to hvx-utils.h * hexagon: remove spurios log * hexagon: factor out and optimize hvx_add/sub/mul * hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions * hexagon: refactor reduction functions to hvx-reduce.h Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`. Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`. Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic. Updated `softmax-ops.c` to use the new functions. * hexagon: refactor the rest of arithmetic functions to hvx-arith.h Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`. Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested. * hexagon: refactor hvx_sum_of_squares_f32 - Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature. - Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`. - Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`. - Update existing reduction functions in `hvx-reduce.h` to use the updated macro. - Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature. * hexagon: use hvx_splat instead of memset * hexagon: consistent use of f32/f16 in all function names to match the rest of GGML * hexagon: fix hvx_copy_f16_f32 on v75 and older * hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL * scripts: update snapdragon/adb scripts to enable host param	2026-01-14 21:46:12 -08:00
Oliver Simons	36f0132464	CUDA: Factor out and re-use `block_reduce` function (#18785 ) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-15 10:44:54 +08:00
Jeff Bolz	3e4bb29666	vulkan: Check maxStorageBufferRange in supports_op (#18709 ) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-14 10:59:05 +01:00
Daniel Bevenius	01cbdfd7eb	CUDA : fix typo in clang pragma comment [no ci] (#18830 )	2026-01-14 10:31:49 +01:00
Ruben Ortlam	635ef78ec5	vulkan: work around Intel fp16 bug in mmq (#18814 )	2026-01-14 09:41:23 +01:00
Perry Naseck	7d587e5544	ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705 )	2026-01-14 09:22:25 +02:00
yulo	ea4a321f2a	HIP: add fattn-mma-f16 for RDNA4 (#18481 ) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-13 13:52:16 +01:00
Georgi Gerganov	0a57271ab6	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800 )	2026-01-13 12:25:53 +02:00
Jeff Bolz	8e2da778da	vulkan: change memory_logger to be controlled by an env var (#18769 )	2026-01-12 13:32:55 +01:00
Jeff Bolz	2bbe4c2cf8	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-12 12:32:13 +01:00
Ruben Ortlam	1051ecd289	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763 ) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-12 07:29:35 +01:00

... 2 3 4 5 6 ...

2234 Commits