Yu, Zijun
f3afa7b914
Requantize Q6_K (gs16) to gs32 on GPU
2026-01-15 11:26:00 -08:00
Yu, Zijun
e4bfe5a20d
Add Q5_K to support phi-3-q4_k_m
2026-01-15 11:26:00 -08:00
Yu, Zijun
2f1d50fb07
Minor refactor
2026-01-15 11:26:00 -08:00
Yu, Zijun
67e178a2f6
Minor: not add attention_size_swa for non-swa model
2026-01-15 11:26:00 -08:00
Yu, Zijun
1a38339cea
Fix ROPE accuracy when freq_scale != 1
2026-01-15 11:26:00 -08:00
Yu, Zijun
602f9ca4af
Fix NPU accuracy
2026-01-15 11:26:00 -08:00
Yu, Zijun
9de874cb7b
Support iSWA
2026-01-15 11:25:58 -08:00
Yu, Zijun
7d81861a18
Fix Hunyuan
2026-01-15 11:20:31 -08:00
Yu, Zijun
597561242f
Add GeGLU
2026-01-15 11:20:31 -08:00
Yu, Zijun
be07073e0e
Apply EliminateZP only for npu
2026-01-15 11:20:31 -08:00
Yu, Zijun
da2cc993bc
WA for npu 1st token acc issue
2026-01-15 11:20:31 -08:00
Yu, Zijun
434059aef7
Fix NPU compile
2026-01-15 11:20:31 -08:00
Yu, Zijun
bcc343af00
Support BF16 model
2026-01-15 11:20:31 -08:00
Yu, Zijun
dc77cbb3f6
STYLE: make get_types_to_requant a function
2026-01-15 11:20:31 -08:00
Yu, Zijun
2ad1147b9b
Improve debug util; Eliminate nop ReshapeReshape
2026-01-15 11:20:31 -08:00
Yu, Zijun
0f7b253cb3
Fix after rebasing
2026-01-15 11:20:31 -08:00
Yu, Zijun
810eb480f5
Simpilfy translation of get_rows
2026-01-15 11:20:31 -08:00
Yu, Zijun
c5231a2448
Set m_is_static=false as default in decoder
2026-01-15 11:20:31 -08:00
Yu, Zijun
6926655f5b
Add custom quant type: q8_1_c, q4_0_128
2026-01-15 11:20:31 -08:00
Yu, Zijun
b593428eb3
Dequantize q4_1 q4_k q6_k for NPU
2026-01-15 11:20:31 -08:00
Yu, Zijun
82c98335d3
NPU perf: eliminate zp
2026-01-15 11:20:31 -08:00
Yu, Zijun
9ca53c7991
Add NPU Q4_0 support
2026-01-15 11:20:31 -08:00
Yu, Zijun
9900245e0b
Fix test-backend-ops: Treat quantized tensors as weights
2026-01-15 11:20:31 -08:00
Yu, Zijun
a1ce428004
Fix Q4_1
2026-01-15 11:19:15 -08:00
Yu, Zijun
dd80b04235
Fix CI; Disable test-backend-ops
2026-01-15 11:19:15 -08:00
Yu, Zijun
6ab76ed10a
Fix accuracy: disable cpu_repack
2026-01-15 11:19:15 -08:00
Yu, Zijun
663a0b8cce
Quant models run with accuracy issue
2026-01-15 11:19:15 -08:00
Yu, Zijun
d4ca760da8
Add quant weight conversion functions from genai gguf reader
2026-01-15 11:19:15 -08:00
Yu, Zijun
3e897df51c
Update supports_buft and supports_op for quantized models
2026-01-15 11:19:15 -08:00
Yu, Zijun
56d596775d
Change openvino device_type to GPU; Enable flash_attn
2026-01-15 11:19:15 -08:00
Yu, Zijun
65e1b1af6d
Fix after rebasing
...
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
2026-01-15 11:19:15 -08:00
Yu, Zijun
14c8a85c32
Perf: RMS fused to OV internal RMS op
2026-01-15 11:19:15 -08:00
Yu, Zijun
a7b611bc93
Minor updates for raising PR
2026-01-15 11:19:15 -08:00
Yu, Zijun
f4123be967
Fix test-backend-ops
2026-01-15 11:19:15 -08:00
Yu, Zijun
839f8c66a0
Remove CPY
2026-01-15 11:19:15 -08:00
Yu, Zijun
7bda5021f9
Fix NPU
2026-01-15 11:19:15 -08:00
Yu, Zijun
63d000ba40
Support op SET_ROWS
2026-01-15 11:19:15 -08:00
Yu, Zijun
9a91ca6ef9
Optimize tensor conversion, improve TTFT
2026-01-15 11:19:15 -08:00
Yu, Zijun
37ff226bb6
Use CiD for NPU
2026-01-15 11:19:15 -08:00
Yu, Zijun
fc865340d5
Fix test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
43489bbfaa
Revert changes in fuse_to_sdpa
2026-01-15 10:26:28 -08:00
Cavus Mustafa
1a19566b23
add mark decomp pass
2026-01-15 10:26:28 -08:00
Cavus Mustafa
93b2d09a2d
mulmat type conversion update
2026-01-15 10:26:28 -08:00
Cavus Mustafa
e2fdc1b988
mulmat input conversion fix
2026-01-15 10:26:28 -08:00
Yu, Zijun
01cdf4a9cc
matmul in fp32
2026-01-15 10:26:28 -08:00
Cavus Mustafa
9cf56d6837
temp. changes for mark decomp
2026-01-15 10:26:28 -08:00
Yu, Zijun
4e7f04a307
Fix llama-perplexity
2026-01-15 10:26:28 -08:00
Yu, Zijun
75eec6265f
Fix llama-bench; Clang-format
2026-01-15 10:26:28 -08:00
Yu, Zijun
6dc4b90635
Fix NPU
2026-01-15 10:26:28 -08:00
Yu, Zijun
44f4cf34b1
Fix Phi3 ROPE; Add test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
1ed49bbfaf
Fix llama-cli
2026-01-15 10:26:28 -08:00
Yu, Zijun
d61f83c9b7
Fix CPY due to cgraph change
2026-01-15 10:23:35 -08:00
Yu, Zijun
f3c0519096
Reduce memory: free ov weights node after graph conversion
2026-01-15 10:20:18 -08:00
Yu, Zijun
a80da69448
Pull out sin cos from rope
2026-01-15 10:20:18 -08:00
Yu, Zijun
3533c14cf6
Fix Phi3 SwiGLU and SoftMax
2026-01-15 10:20:18 -08:00
Yu, Zijun
0fa7a5efef
Refactor: remove past_token_len from extra_inputs
2026-01-15 10:20:18 -08:00
Yu, Zijun
acf358d1ce
Pull out indices creation for kv cache update
2026-01-15 10:20:18 -08:00
Yu, Zijun
bf5414c95e
Replace Concat with Broadcast in MulMat for GQA
2026-01-15 10:20:18 -08:00
Yu, Zijun
ebc4fc9f95
Fuse to SDPA
2026-01-15 10:20:18 -08:00
Yu, Zijun
73ee84fffe
Add SwiGLU
2026-01-15 10:20:18 -08:00
Yu, Zijun
4c582ac7a3
Statful transformation for CPU GPU
2026-01-15 10:20:18 -08:00
Yu, Zijun
8afee795ad
Update clang-format
2026-01-15 10:20:18 -08:00
Yu, Zijun
593484ce5f
Refactor: clean, fix warning
2026-01-15 10:20:18 -08:00
Yu, Zijun
42d4240937
Change due to ggml cgraph changes, all device work
2026-01-15 10:20:18 -08:00
Yu, Zijun
e27738a987
Add AMD64 to CMakeLists
2026-01-15 10:20:18 -08:00
Yu, Zijun
592d7f8bbb
Change due to ggml cgraph changes, llama-3.2 CPU work
2026-01-15 10:20:18 -08:00
Yu, Zijun
f7ad77930e
Change due to ggml cgraph changes, not correct yet
2026-01-15 10:20:18 -08:00
Yu, Zijun
d9ca8f5dbe
NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
34531abce4
draft NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
7fec223334
Add initial NPU support
2026-01-15 10:20:18 -08:00
Yu, Zijun
8ce5cc597a
Add cgraph tensor output name to OV op name
2026-01-15 10:20:18 -08:00
Yu, Zijun
d7cc802292
PERF: use Slice+Concat in writing cache_v
2026-01-15 10:20:18 -08:00
Yu, Zijun
8ac5c225aa
FIX: set_max_token_len
2026-01-15 10:20:18 -08:00
Yu, Zijun
a30dc6e726
PERF: add weight constant in parallel
2026-01-15 10:20:18 -08:00
Yu, Zijun
c57f61494a
FIX: input shape of KQ_mask
2026-01-15 10:20:18 -08:00
Yu, Zijun
041d220dfa
FIX: Re-add tensor names in cgraph, Add another case for RESHAPE
2026-01-15 10:20:13 -08:00
Yu, Zijun
0d505b4e56
STYLE and minor REFACTOR
2026-01-15 10:10:00 -08:00
Yu, Zijun
cdf5370cb5
PERF: favor low precision matmul
2026-01-15 10:10:00 -08:00
Yu, Zijun
0d009fe61a
FEAT: Add all conversion code from ov side
2026-01-15 10:10:00 -08:00
Yu, Zijun
f15a2cc057
STYLE: clang-format
2026-01-15 10:10:00 -08:00
Yu, Zijun
a0b30529bf
FIX: backend buffer type issue
2026-01-15 10:10:00 -08:00
Zijun Yu
4c905b2b25
fix build error
2026-01-15 10:10:00 -08:00
Viraj Wadhwa
ffabe95e2a
Rebase - Bring up to date and fix build process
2026-01-15 10:09:23 -08:00
Yu, Zijun
a8e5efa44e
PERF: compile once (dynamic graph + cache)
2026-01-15 10:05:41 -08:00
Yu, Zijun
7d5e234254
FEAT: improve debug capability
2026-01-15 10:05:41 -08:00
Yu, Zijun
0a8cc9ab03
BUILD: update build doc, add cmake preset, add CACHE_DIR env var
2026-01-15 10:05:41 -08:00
Yu, Zijun
d3bdca25bd
PERF: share const nodes for weights for diff infer
2026-01-15 10:05:41 -08:00
Yu, Zijun
96ba47dd43
STYLE: minor refactor
2026-01-15 10:05:41 -08:00
Yu, Zijun
c04966cda6
REFACTOR: support weigts as constant
2026-01-15 10:05:41 -08:00
Yu, Zijun
0c7b026ecc
FEAT: Add interleaved mode for ROPE
2026-01-15 10:05:41 -08:00
Yu, Zijun
6ed44a3dff
FEAT: do PERMUTE eagerly
2026-01-15 10:05:41 -08:00
Yu, Zijun
8b408869ae
Arbitrary token len (>32) work; Fix bug in mulmat
2026-01-15 10:05:41 -08:00
Yu, Zijun
8d263bd6a5
2nd+ token correct by fix CPY in OV, remove single op backend compute code
2026-01-15 10:05:41 -08:00
Yu, Zijun
91d2a195b5
change op mappings to list in openvino_supports_op
2026-01-15 10:05:41 -08:00
Yu, Zijun
651b2c06cb
* Use find_package in CMake to configure OpenVINO
...
* Remove OPENVINO_OP_DEBUG
* Simplify set_input_output in decoder
* Fix CPY in set_input_output
* Use params from converted ov model in setting input
2026-01-15 10:05:41 -08:00
zhanmyz
84be5c6f15
1. Delete some comments
...
2. Process Prompt and predict first token is OK
2026-01-15 10:05:41 -08:00
zhanmyz
eac9a99530
1. Solve the AC issue of Permute+VIEW and MULMAL issue in the phase of “1. Process Prompt and predict the first token”.
...
2. There is still an AC issue in the "2. Predict the subsequent tokens phase" and it is being debugged.
A deviation has been detected in the computation of OpenVINO's CPY Node at stage 2, and it is currently being fixed.
2026-01-15 10:05:41 -08:00
zhanmyz
8ae700ae11
Process Prompt and predict first token is OK
2026-01-15 10:05:41 -08:00
zhanmyz
8020138406
add debug info
2026-01-15 10:05:41 -08:00
zhanmyz
b02265a507
1. In the Prompt process and predict first token stage, the PERMUTE node needs to be integrated into the OV Frontend
...
2. In the predict latest token stage, the VIEW, CONT, Reshape need to be integrated into the OV Frontend.
2026-01-15 10:05:41 -08:00