Commit Graph

2088 Commits

Author SHA1 Message Date
Yu, Zijun f3afa7b914 Requantize Q6_K (gs16) to gs32 on GPU 2026-01-15 11:26:00 -08:00
Yu, Zijun e4bfe5a20d Add Q5_K to support phi-3-q4_k_m 2026-01-15 11:26:00 -08:00
Yu, Zijun 2f1d50fb07 Minor refactor 2026-01-15 11:26:00 -08:00
Yu, Zijun 67e178a2f6 Minor: not add attention_size_swa for non-swa model 2026-01-15 11:26:00 -08:00
Yu, Zijun 1a38339cea Fix ROPE accuracy when freq_scale != 1 2026-01-15 11:26:00 -08:00
Yu, Zijun 602f9ca4af Fix NPU accuracy 2026-01-15 11:26:00 -08:00
Yu, Zijun 9de874cb7b Support iSWA 2026-01-15 11:25:58 -08:00
Yu, Zijun 7d81861a18 Fix Hunyuan 2026-01-15 11:20:31 -08:00
Yu, Zijun 597561242f Add GeGLU 2026-01-15 11:20:31 -08:00
Yu, Zijun be07073e0e Apply EliminateZP only for npu 2026-01-15 11:20:31 -08:00
Yu, Zijun da2cc993bc WA for npu 1st token acc issue 2026-01-15 11:20:31 -08:00
Yu, Zijun 434059aef7 Fix NPU compile 2026-01-15 11:20:31 -08:00
Yu, Zijun bcc343af00 Support BF16 model 2026-01-15 11:20:31 -08:00
Yu, Zijun dc77cbb3f6 STYLE: make get_types_to_requant a function 2026-01-15 11:20:31 -08:00
Yu, Zijun 2ad1147b9b Improve debug util; Eliminate nop ReshapeReshape 2026-01-15 11:20:31 -08:00
Yu, Zijun 0f7b253cb3 Fix after rebasing 2026-01-15 11:20:31 -08:00
Yu, Zijun 810eb480f5 Simpilfy translation of get_rows 2026-01-15 11:20:31 -08:00
Yu, Zijun c5231a2448 Set m_is_static=false as default in decoder 2026-01-15 11:20:31 -08:00
Yu, Zijun 6926655f5b Add custom quant type: q8_1_c, q4_0_128 2026-01-15 11:20:31 -08:00
Yu, Zijun b593428eb3 Dequantize q4_1 q4_k q6_k for NPU 2026-01-15 11:20:31 -08:00
Yu, Zijun 82c98335d3 NPU perf: eliminate zp 2026-01-15 11:20:31 -08:00
Yu, Zijun 9ca53c7991 Add NPU Q4_0 support 2026-01-15 11:20:31 -08:00
Yu, Zijun 9900245e0b Fix test-backend-ops: Treat quantized tensors as weights 2026-01-15 11:20:31 -08:00
Yu, Zijun a1ce428004 Fix Q4_1 2026-01-15 11:19:15 -08:00
Yu, Zijun dd80b04235 Fix CI; Disable test-backend-ops 2026-01-15 11:19:15 -08:00
Yu, Zijun 6ab76ed10a Fix accuracy: disable cpu_repack 2026-01-15 11:19:15 -08:00
Yu, Zijun 663a0b8cce Quant models run with accuracy issue 2026-01-15 11:19:15 -08:00
Yu, Zijun d4ca760da8 Add quant weight conversion functions from genai gguf reader 2026-01-15 11:19:15 -08:00
Yu, Zijun 3e897df51c Update supports_buft and supports_op for quantized models 2026-01-15 11:19:15 -08:00
Yu, Zijun 56d596775d Change openvino device_type to GPU; Enable flash_attn 2026-01-15 11:19:15 -08:00
Yu, Zijun 65e1b1af6d Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
2026-01-15 11:19:15 -08:00
Yu, Zijun 14c8a85c32 Perf: RMS fused to OV internal RMS op 2026-01-15 11:19:15 -08:00
Yu, Zijun a7b611bc93 Minor updates for raising PR 2026-01-15 11:19:15 -08:00
Yu, Zijun f4123be967 Fix test-backend-ops 2026-01-15 11:19:15 -08:00
Yu, Zijun 839f8c66a0 Remove CPY 2026-01-15 11:19:15 -08:00
Yu, Zijun 7bda5021f9 Fix NPU 2026-01-15 11:19:15 -08:00
Yu, Zijun 63d000ba40 Support op SET_ROWS 2026-01-15 11:19:15 -08:00
Yu, Zijun 9a91ca6ef9 Optimize tensor conversion, improve TTFT 2026-01-15 11:19:15 -08:00
Yu, Zijun 37ff226bb6 Use CiD for NPU 2026-01-15 11:19:15 -08:00
Yu, Zijun fc865340d5 Fix test-backend-ops 2026-01-15 10:26:28 -08:00
Yu, Zijun 43489bbfaa Revert changes in fuse_to_sdpa 2026-01-15 10:26:28 -08:00
Cavus Mustafa 1a19566b23 add mark decomp pass 2026-01-15 10:26:28 -08:00
Cavus Mustafa 93b2d09a2d mulmat type conversion update 2026-01-15 10:26:28 -08:00
Cavus Mustafa e2fdc1b988 mulmat input conversion fix 2026-01-15 10:26:28 -08:00
Yu, Zijun 01cdf4a9cc matmul in fp32 2026-01-15 10:26:28 -08:00
Cavus Mustafa 9cf56d6837 temp. changes for mark decomp 2026-01-15 10:26:28 -08:00
Yu, Zijun 4e7f04a307 Fix llama-perplexity 2026-01-15 10:26:28 -08:00
Yu, Zijun 75eec6265f Fix llama-bench; Clang-format 2026-01-15 10:26:28 -08:00
Yu, Zijun 6dc4b90635 Fix NPU 2026-01-15 10:26:28 -08:00
Yu, Zijun 44f4cf34b1 Fix Phi3 ROPE; Add test-backend-ops 2026-01-15 10:26:28 -08:00
Yu, Zijun 1ed49bbfaf Fix llama-cli 2026-01-15 10:26:28 -08:00
Yu, Zijun d61f83c9b7 Fix CPY due to cgraph change 2026-01-15 10:23:35 -08:00
Yu, Zijun f3c0519096 Reduce memory: free ov weights node after graph conversion 2026-01-15 10:20:18 -08:00
Yu, Zijun a80da69448 Pull out sin cos from rope 2026-01-15 10:20:18 -08:00
Yu, Zijun 3533c14cf6 Fix Phi3 SwiGLU and SoftMax 2026-01-15 10:20:18 -08:00
Yu, Zijun 0fa7a5efef Refactor: remove past_token_len from extra_inputs 2026-01-15 10:20:18 -08:00
Yu, Zijun acf358d1ce Pull out indices creation for kv cache update 2026-01-15 10:20:18 -08:00
Yu, Zijun bf5414c95e Replace Concat with Broadcast in MulMat for GQA 2026-01-15 10:20:18 -08:00
Yu, Zijun ebc4fc9f95 Fuse to SDPA 2026-01-15 10:20:18 -08:00
Yu, Zijun 73ee84fffe Add SwiGLU 2026-01-15 10:20:18 -08:00
Yu, Zijun 4c582ac7a3 Statful transformation for CPU GPU 2026-01-15 10:20:18 -08:00
Yu, Zijun 8afee795ad Update clang-format 2026-01-15 10:20:18 -08:00
Yu, Zijun 593484ce5f Refactor: clean, fix warning 2026-01-15 10:20:18 -08:00
Yu, Zijun 42d4240937 Change due to ggml cgraph changes, all device work 2026-01-15 10:20:18 -08:00
Yu, Zijun e27738a987 Add AMD64 to CMakeLists 2026-01-15 10:20:18 -08:00
Yu, Zijun 592d7f8bbb Change due to ggml cgraph changes, llama-3.2 CPU work 2026-01-15 10:20:18 -08:00
Yu, Zijun f7ad77930e Change due to ggml cgraph changes, not correct yet 2026-01-15 10:20:18 -08:00
Yu, Zijun d9ca8f5dbe NPU support version 2: prefill + kvcache 2026-01-15 10:20:18 -08:00
Yu, Zijun 34531abce4 draft NPU support version 2: prefill + kvcache 2026-01-15 10:20:18 -08:00
Yu, Zijun 7fec223334 Add initial NPU support 2026-01-15 10:20:18 -08:00
Yu, Zijun 8ce5cc597a Add cgraph tensor output name to OV op name 2026-01-15 10:20:18 -08:00
Yu, Zijun d7cc802292 PERF: use Slice+Concat in writing cache_v 2026-01-15 10:20:18 -08:00
Yu, Zijun 8ac5c225aa FIX: set_max_token_len 2026-01-15 10:20:18 -08:00
Yu, Zijun a30dc6e726 PERF: add weight constant in parallel 2026-01-15 10:20:18 -08:00
Yu, Zijun c57f61494a FIX: input shape of KQ_mask 2026-01-15 10:20:18 -08:00
Yu, Zijun 041d220dfa FIX: Re-add tensor names in cgraph, Add another case for RESHAPE 2026-01-15 10:20:13 -08:00
Yu, Zijun 0d505b4e56 STYLE and minor REFACTOR 2026-01-15 10:10:00 -08:00
Yu, Zijun cdf5370cb5 PERF: favor low precision matmul 2026-01-15 10:10:00 -08:00
Yu, Zijun 0d009fe61a FEAT: Add all conversion code from ov side 2026-01-15 10:10:00 -08:00
Yu, Zijun f15a2cc057 STYLE: clang-format 2026-01-15 10:10:00 -08:00
Yu, Zijun a0b30529bf FIX: backend buffer type issue 2026-01-15 10:10:00 -08:00
Zijun Yu 4c905b2b25 fix build error 2026-01-15 10:10:00 -08:00
Viraj Wadhwa ffabe95e2a Rebase - Bring up to date and fix build process 2026-01-15 10:09:23 -08:00
Yu, Zijun a8e5efa44e PERF: compile once (dynamic graph + cache) 2026-01-15 10:05:41 -08:00
Yu, Zijun 7d5e234254 FEAT: improve debug capability 2026-01-15 10:05:41 -08:00
Yu, Zijun 0a8cc9ab03 BUILD: update build doc, add cmake preset, add CACHE_DIR env var 2026-01-15 10:05:41 -08:00
Yu, Zijun d3bdca25bd PERF: share const nodes for weights for diff infer 2026-01-15 10:05:41 -08:00
Yu, Zijun 96ba47dd43 STYLE: minor refactor 2026-01-15 10:05:41 -08:00
Yu, Zijun c04966cda6 REFACTOR: support weigts as constant 2026-01-15 10:05:41 -08:00
Yu, Zijun 0c7b026ecc FEAT: Add interleaved mode for ROPE 2026-01-15 10:05:41 -08:00
Yu, Zijun 6ed44a3dff FEAT: do PERMUTE eagerly 2026-01-15 10:05:41 -08:00
Yu, Zijun 8b408869ae Arbitrary token len (>32) work; Fix bug in mulmat 2026-01-15 10:05:41 -08:00
Yu, Zijun 8d263bd6a5 2nd+ token correct by fix CPY in OV, remove single op backend compute code 2026-01-15 10:05:41 -08:00
Yu, Zijun 91d2a195b5 change op mappings to list in openvino_supports_op 2026-01-15 10:05:41 -08:00
Yu, Zijun 651b2c06cb * Use find_package in CMake to configure OpenVINO
* Remove OPENVINO_OP_DEBUG
* Simplify set_input_output in decoder
* Fix CPY in set_input_output
* Use params from converted ov model in setting input
2026-01-15 10:05:41 -08:00
zhanmyz 84be5c6f15 1. Delete some comments
2. Process Prompt and predict first token is OK
2026-01-15 10:05:41 -08:00
zhanmyz eac9a99530 1. Solve the AC issue of Permute+VIEW and MULMAL issue in the phase of “1. Process Prompt and predict the first token”.
2. There is still an AC issue in the "2. Predict the subsequent tokens phase" and it is being debugged.
   A deviation has been detected in the computation of OpenVINO's CPY Node at stage 2, and it is currently being fixed.
2026-01-15 10:05:41 -08:00
zhanmyz 8ae700ae11 Process Prompt and predict first token is OK 2026-01-15 10:05:41 -08:00
zhanmyz 8020138406 add debug info 2026-01-15 10:05:41 -08:00
zhanmyz b02265a507 1. In the Prompt process and predict first token stage, the PERMUTE node needs to be integrated into the OV Frontend
2. In the predict latest token stage, the VIEW, CONT, Reshape need to be integrated into the OV Frontend.
2026-01-15 10:05:41 -08:00