llama.cpp

Commit Graph

Author	SHA1	Message	Date
Yu, Zijun	f3afa7b914	Requantize Q6_K (gs16) to gs32 on GPU	2026-01-15 11:26:00 -08:00
Yu, Zijun	e4bfe5a20d	Add Q5_K to support phi-3-q4_k_m	2026-01-15 11:26:00 -08:00
Yu, Zijun	2f1d50fb07	Minor refactor	2026-01-15 11:26:00 -08:00
Yu, Zijun	67e178a2f6	Minor: not add attention_size_swa for non-swa model	2026-01-15 11:26:00 -08:00
Yu, Zijun	1a38339cea	Fix ROPE accuracy when freq_scale != 1	2026-01-15 11:26:00 -08:00
Yu, Zijun	602f9ca4af	Fix NPU accuracy	2026-01-15 11:26:00 -08:00
Yu, Zijun	9de874cb7b	Support iSWA	2026-01-15 11:25:58 -08:00
Yu, Zijun	7d81861a18	Fix Hunyuan	2026-01-15 11:20:31 -08:00
Yu, Zijun	597561242f	Add GeGLU	2026-01-15 11:20:31 -08:00
Yu, Zijun	be07073e0e	Apply EliminateZP only for npu	2026-01-15 11:20:31 -08:00
Yu, Zijun	da2cc993bc	WA for npu 1st token acc issue	2026-01-15 11:20:31 -08:00
Yu, Zijun	434059aef7	Fix NPU compile	2026-01-15 11:20:31 -08:00
Yu, Zijun	bcc343af00	Support BF16 model	2026-01-15 11:20:31 -08:00
Yu, Zijun	dc77cbb3f6	STYLE: make get_types_to_requant a function	2026-01-15 11:20:31 -08:00
Yu, Zijun	2ad1147b9b	Improve debug util; Eliminate nop ReshapeReshape	2026-01-15 11:20:31 -08:00
Yu, Zijun	0f7b253cb3	Fix after rebasing	2026-01-15 11:20:31 -08:00
Yu, Zijun	810eb480f5	Simpilfy translation of get_rows	2026-01-15 11:20:31 -08:00
Yu, Zijun	c5231a2448	Set m_is_static=false as default in decoder	2026-01-15 11:20:31 -08:00
Yu, Zijun	6926655f5b	Add custom quant type: q8_1_c, q4_0_128	2026-01-15 11:20:31 -08:00
Yu, Zijun	b593428eb3	Dequantize q4_1 q4_k q6_k for NPU	2026-01-15 11:20:31 -08:00
Yu, Zijun	82c98335d3	NPU perf: eliminate zp	2026-01-15 11:20:31 -08:00
Yu, Zijun	9ca53c7991	Add NPU Q4_0 support	2026-01-15 11:20:31 -08:00
Yu, Zijun	9900245e0b	Fix test-backend-ops: Treat quantized tensors as weights	2026-01-15 11:20:31 -08:00
Yu, Zijun	a1ce428004	Fix Q4_1	2026-01-15 11:19:15 -08:00
Yu, Zijun	dd80b04235	Fix CI; Disable test-backend-ops	2026-01-15 11:19:15 -08:00
Yu, Zijun	6ab76ed10a	Fix accuracy: disable cpu_repack	2026-01-15 11:19:15 -08:00
Yu, Zijun	663a0b8cce	Quant models run with accuracy issue	2026-01-15 11:19:15 -08:00
Yu, Zijun	d4ca760da8	Add quant weight conversion functions from genai gguf reader	2026-01-15 11:19:15 -08:00
Yu, Zijun	3e897df51c	Update supports_buft and supports_op for quantized models	2026-01-15 11:19:15 -08:00
Yu, Zijun	56d596775d	Change openvino device_type to GPU; Enable flash_attn	2026-01-15 11:19:15 -08:00
Yu, Zijun	65e1b1af6d	Fix after rebasing - Layout of cache k and cache v are unified: [seq, n_head, head_size] - Add CPY and FLASH_ATTN_EXT, flash attn is not used yet - Skip test-backend-ops due to flash attn test crash - Add mutex around graph conversion to avoid test-thread-safety fali in the future - Update NPU config - Update GPU config to disable SDPA opt to make phi-3 run	2026-01-15 11:19:15 -08:00
Yu, Zijun	14c8a85c32	Perf: RMS fused to OV internal RMS op	2026-01-15 11:19:15 -08:00
Yu, Zijun	a7b611bc93	Minor updates for raising PR	2026-01-15 11:19:15 -08:00
Yu, Zijun	f4123be967	Fix test-backend-ops	2026-01-15 11:19:15 -08:00
Yu, Zijun	839f8c66a0	Remove CPY	2026-01-15 11:19:15 -08:00
Yu, Zijun	7bda5021f9	Fix NPU	2026-01-15 11:19:15 -08:00
Yu, Zijun	63d000ba40	Support op SET_ROWS	2026-01-15 11:19:15 -08:00
Yu, Zijun	9a91ca6ef9	Optimize tensor conversion, improve TTFT	2026-01-15 11:19:15 -08:00
Yu, Zijun	37ff226bb6	Use CiD for NPU	2026-01-15 11:19:15 -08:00
Yu, Zijun	fc865340d5	Fix test-backend-ops	2026-01-15 10:26:28 -08:00
Yu, Zijun	43489bbfaa	Revert changes in fuse_to_sdpa	2026-01-15 10:26:28 -08:00
Cavus Mustafa	1a19566b23	add mark decomp pass	2026-01-15 10:26:28 -08:00
Cavus Mustafa	93b2d09a2d	mulmat type conversion update	2026-01-15 10:26:28 -08:00
Cavus Mustafa	e2fdc1b988	mulmat input conversion fix	2026-01-15 10:26:28 -08:00
Yu, Zijun	01cdf4a9cc	matmul in fp32	2026-01-15 10:26:28 -08:00
Cavus Mustafa	9cf56d6837	temp. changes for mark decomp	2026-01-15 10:26:28 -08:00
Yu, Zijun	4e7f04a307	Fix llama-perplexity	2026-01-15 10:26:28 -08:00
Yu, Zijun	75eec6265f	Fix llama-bench; Clang-format	2026-01-15 10:26:28 -08:00
Yu, Zijun	6dc4b90635	Fix NPU	2026-01-15 10:26:28 -08:00
Yu, Zijun	44f4cf34b1	Fix Phi3 ROPE; Add test-backend-ops	2026-01-15 10:26:28 -08:00
Yu, Zijun	1ed49bbfaf	Fix llama-cli	2026-01-15 10:26:28 -08:00
Yu, Zijun	d61f83c9b7	Fix CPY due to cgraph change	2026-01-15 10:23:35 -08:00
Yu, Zijun	f3c0519096	Reduce memory: free ov weights node after graph conversion	2026-01-15 10:20:18 -08:00
Yu, Zijun	a80da69448	Pull out sin cos from rope	2026-01-15 10:20:18 -08:00
Yu, Zijun	3533c14cf6	Fix Phi3 SwiGLU and SoftMax	2026-01-15 10:20:18 -08:00
Yu, Zijun	0fa7a5efef	Refactor: remove past_token_len from extra_inputs	2026-01-15 10:20:18 -08:00
Yu, Zijun	acf358d1ce	Pull out indices creation for kv cache update	2026-01-15 10:20:18 -08:00
Yu, Zijun	bf5414c95e	Replace Concat with Broadcast in MulMat for GQA	2026-01-15 10:20:18 -08:00
Yu, Zijun	ebc4fc9f95	Fuse to SDPA	2026-01-15 10:20:18 -08:00
Yu, Zijun	73ee84fffe	Add SwiGLU	2026-01-15 10:20:18 -08:00
Yu, Zijun	4c582ac7a3	Statful transformation for CPU GPU	2026-01-15 10:20:18 -08:00
Yu, Zijun	8afee795ad	Update clang-format	2026-01-15 10:20:18 -08:00
Yu, Zijun	593484ce5f	Refactor: clean, fix warning	2026-01-15 10:20:18 -08:00
Yu, Zijun	42d4240937	Change due to ggml cgraph changes, all device work	2026-01-15 10:20:18 -08:00
Yu, Zijun	e27738a987	Add AMD64 to CMakeLists	2026-01-15 10:20:18 -08:00
Yu, Zijun	592d7f8bbb	Change due to ggml cgraph changes, llama-3.2 CPU work	2026-01-15 10:20:18 -08:00
Yu, Zijun	f7ad77930e	Change due to ggml cgraph changes, not correct yet	2026-01-15 10:20:18 -08:00
Yu, Zijun	d9ca8f5dbe	NPU support version 2: prefill + kvcache	2026-01-15 10:20:18 -08:00
Yu, Zijun	34531abce4	draft NPU support version 2: prefill + kvcache	2026-01-15 10:20:18 -08:00
Yu, Zijun	7fec223334	Add initial NPU support	2026-01-15 10:20:18 -08:00
Yu, Zijun	8ce5cc597a	Add cgraph tensor output name to OV op name	2026-01-15 10:20:18 -08:00
Yu, Zijun	d7cc802292	PERF: use Slice+Concat in writing cache_v	2026-01-15 10:20:18 -08:00
Yu, Zijun	8ac5c225aa	FIX: set_max_token_len	2026-01-15 10:20:18 -08:00
Yu, Zijun	a30dc6e726	PERF: add weight constant in parallel	2026-01-15 10:20:18 -08:00
Yu, Zijun	c57f61494a	FIX: input shape of KQ_mask	2026-01-15 10:20:18 -08:00
Yu, Zijun	041d220dfa	FIX: Re-add tensor names in cgraph, Add another case for RESHAPE	2026-01-15 10:20:13 -08:00
Yu, Zijun	0d505b4e56	STYLE and minor REFACTOR	2026-01-15 10:10:00 -08:00
Yu, Zijun	cdf5370cb5	PERF: favor low precision matmul	2026-01-15 10:10:00 -08:00
Yu, Zijun	0d009fe61a	FEAT: Add all conversion code from ov side	2026-01-15 10:10:00 -08:00
Yu, Zijun	f15a2cc057	STYLE: clang-format	2026-01-15 10:10:00 -08:00
Yu, Zijun	a0b30529bf	FIX: backend buffer type issue	2026-01-15 10:10:00 -08:00
Zijun Yu	4c905b2b25	fix build error	2026-01-15 10:10:00 -08:00
Viraj Wadhwa	ffabe95e2a	Rebase - Bring up to date and fix build process	2026-01-15 10:09:23 -08:00
Yu, Zijun	a8e5efa44e	PERF: compile once (dynamic graph + cache)	2026-01-15 10:05:41 -08:00
Yu, Zijun	7d5e234254	FEAT: improve debug capability	2026-01-15 10:05:41 -08:00
Yu, Zijun	0a8cc9ab03	BUILD: update build doc, add cmake preset, add CACHE_DIR env var	2026-01-15 10:05:41 -08:00
Yu, Zijun	d3bdca25bd	PERF: share const nodes for weights for diff infer	2026-01-15 10:05:41 -08:00
Yu, Zijun	96ba47dd43	STYLE: minor refactor	2026-01-15 10:05:41 -08:00
Yu, Zijun	c04966cda6	REFACTOR: support weigts as constant	2026-01-15 10:05:41 -08:00
Yu, Zijun	0c7b026ecc	FEAT: Add interleaved mode for ROPE	2026-01-15 10:05:41 -08:00
Yu, Zijun	6ed44a3dff	FEAT: do PERMUTE eagerly	2026-01-15 10:05:41 -08:00
Yu, Zijun	8b408869ae	Arbitrary token len (>32) work; Fix bug in mulmat	2026-01-15 10:05:41 -08:00
Yu, Zijun	8d263bd6a5	2nd+ token correct by fix CPY in OV, remove single op backend compute code	2026-01-15 10:05:41 -08:00
Yu, Zijun	91d2a195b5	change op mappings to list in openvino_supports_op	2026-01-15 10:05:41 -08:00
Yu, Zijun	651b2c06cb	* Use find_package in CMake to configure OpenVINO * Remove OPENVINO_OP_DEBUG * Simplify set_input_output in decoder * Fix CPY in set_input_output * Use params from converted ov model in setting input	2026-01-15 10:05:41 -08:00
zhanmyz	84be5c6f15	1. Delete some comments 2. Process Prompt and predict first token is OK	2026-01-15 10:05:41 -08:00
zhanmyz	eac9a99530	1. Solve the AC issue of Permute+VIEW and MULMAL issue in the phase of “1. Process Prompt and predict the first token”. 2. There is still an AC issue in the "2. Predict the subsequent tokens phase" and it is being debugged. A deviation has been detected in the computation of OpenVINO's CPY Node at stage 2, and it is currently being fixed.	2026-01-15 10:05:41 -08:00
zhanmyz	8ae700ae11	Process Prompt and predict first token is OK	2026-01-15 10:05:41 -08:00
zhanmyz	8020138406	add debug info	2026-01-15 10:05:41 -08:00
zhanmyz	b02265a507	1. In the Prompt process and predict first token stage, the PERMUTE node needs to be integrated into the OV Frontend 2. In the predict latest token stage, the VIEW, CONT, Reshape need to be integrated into the OV Frontend.	2026-01-15 10:05:41 -08:00

1 2 3 4 5 ...

2088 Commits