Commit Graph

74 Commits

Author SHA1 Message Date
Xuejun Zhai 91a1b20c82 Fix error for decoder cache 2026-01-15 11:39:08 -08:00
Xuejun Zhai 42ca27f714 Removed API get_input_type 2026-01-15 11:39:08 -08:00
Yu, Zijun 8f4ee4eee2 minor update due to ov 2025.4 2026-01-15 11:39:08 -08:00
Yu, Zijun 2a9d4ca836 Refactor: split ov_graph_compute for dynamic and static 2026-01-15 11:39:08 -08:00
Yu, Zijun 808619e274 NPU support llma-perplexity -b 512 --no-warmup 2026-01-15 11:39:08 -08:00
Yu, Zijun 65348b5d20 fallback naive run with accuracy issue 2026-01-15 11:39:08 -08:00
Yu, Zijun 59e7e7c47d NPU fix llama-bench 2026-01-15 11:39:08 -08:00
Yu, Zijun 38254cf592 NPU prefill chunking 2026-01-15 11:39:08 -08:00
Yu, Zijun 531941b348 Fix NPU 2026-01-15 11:28:48 -08:00
Yu, Zijun ae404f7cbb Fix llama-bench 2026-01-15 11:28:48 -08:00
Yu, Zijun 072dde0b2b change graph to 4d, support multi sequences 2026-01-15 11:28:48 -08:00
Yu, Zijun ea2c99be1c NPU unify PD (handled internally) 2026-01-15 11:28:48 -08:00
Zijun Yu b8690bc055 NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims
2026-01-15 11:27:30 -08:00
Yu, Zijun eba8113dc4 Style: middle ptr and ref align, omit optional struct keyword 2026-01-15 11:27:30 -08:00
cavusmustafa e7252920e1 env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added 2026-01-15 11:26:00 -08:00
Yu, Zijun f3afa7b914 Requantize Q6_K (gs16) to gs32 on GPU 2026-01-15 11:26:00 -08:00
Yu, Zijun e4bfe5a20d Add Q5_K to support phi-3-q4_k_m 2026-01-15 11:26:00 -08:00
Yu, Zijun 2f1d50fb07 Minor refactor 2026-01-15 11:26:00 -08:00
Yu, Zijun 602f9ca4af Fix NPU accuracy 2026-01-15 11:26:00 -08:00
Yu, Zijun 9de874cb7b Support iSWA 2026-01-15 11:25:58 -08:00
Yu, Zijun da2cc993bc WA for npu 1st token acc issue 2026-01-15 11:20:31 -08:00
Yu, Zijun 434059aef7 Fix NPU compile 2026-01-15 11:20:31 -08:00
Yu, Zijun bcc343af00 Support BF16 model 2026-01-15 11:20:31 -08:00
Yu, Zijun dc77cbb3f6 STYLE: make get_types_to_requant a function 2026-01-15 11:20:31 -08:00
Yu, Zijun 2ad1147b9b Improve debug util; Eliminate nop ReshapeReshape 2026-01-15 11:20:31 -08:00
Yu, Zijun 6926655f5b Add custom quant type: q8_1_c, q4_0_128 2026-01-15 11:20:31 -08:00
Yu, Zijun b593428eb3 Dequantize q4_1 q4_k q6_k for NPU 2026-01-15 11:20:31 -08:00
Yu, Zijun 9900245e0b Fix test-backend-ops: Treat quantized tensors as weights 2026-01-15 11:20:31 -08:00
Yu, Zijun 65e1b1af6d Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
2026-01-15 11:19:15 -08:00
Yu, Zijun f4123be967 Fix test-backend-ops 2026-01-15 11:19:15 -08:00
Yu, Zijun 7bda5021f9 Fix NPU 2026-01-15 11:19:15 -08:00
Yu, Zijun 37ff226bb6 Use CiD for NPU 2026-01-15 11:19:15 -08:00
Yu, Zijun fc865340d5 Fix test-backend-ops 2026-01-15 10:26:28 -08:00
Yu, Zijun 4e7f04a307 Fix llama-perplexity 2026-01-15 10:26:28 -08:00
Yu, Zijun 6dc4b90635 Fix NPU 2026-01-15 10:26:28 -08:00
Yu, Zijun 44f4cf34b1 Fix Phi3 ROPE; Add test-backend-ops 2026-01-15 10:26:28 -08:00
Yu, Zijun f3c0519096 Reduce memory: free ov weights node after graph conversion 2026-01-15 10:20:18 -08:00
Yu, Zijun ebc4fc9f95 Fuse to SDPA 2026-01-15 10:20:18 -08:00
Yu, Zijun 4c582ac7a3 Statful transformation for CPU GPU 2026-01-15 10:20:18 -08:00
Yu, Zijun 8afee795ad Update clang-format 2026-01-15 10:20:18 -08:00
Yu, Zijun 593484ce5f Refactor: clean, fix warning 2026-01-15 10:20:18 -08:00
Yu, Zijun 592d7f8bbb Change due to ggml cgraph changes, llama-3.2 CPU work 2026-01-15 10:20:18 -08:00
Yu, Zijun d9ca8f5dbe NPU support version 2: prefill + kvcache 2026-01-15 10:20:18 -08:00
Yu, Zijun 34531abce4 draft NPU support version 2: prefill + kvcache 2026-01-15 10:20:18 -08:00
Yu, Zijun 7fec223334 Add initial NPU support 2026-01-15 10:20:18 -08:00
Yu, Zijun 8ac5c225aa FIX: set_max_token_len 2026-01-15 10:20:18 -08:00
Yu, Zijun 0d505b4e56 STYLE and minor REFACTOR 2026-01-15 10:10:00 -08:00
Yu, Zijun 0d009fe61a FEAT: Add all conversion code from ov side 2026-01-15 10:10:00 -08:00
Viraj Wadhwa ffabe95e2a Rebase - Bring up to date and fix build process 2026-01-15 10:09:23 -08:00
Yu, Zijun a8e5efa44e PERF: compile once (dynamic graph + cache) 2026-01-15 10:05:41 -08:00