Xuejun Zhai
91a1b20c82
Fix error for decoder cache
2026-01-15 11:39:08 -08:00
Xuejun Zhai
42ca27f714
Removed API get_input_type
2026-01-15 11:39:08 -08:00
Yu, Zijun
8f4ee4eee2
minor update due to ov 2025.4
2026-01-15 11:39:08 -08:00
Yu, Zijun
2a9d4ca836
Refactor: split ov_graph_compute for dynamic and static
2026-01-15 11:39:08 -08:00
Yu, Zijun
808619e274
NPU support llma-perplexity -b 512 --no-warmup
2026-01-15 11:39:08 -08:00
Yu, Zijun
65348b5d20
fallback naive run with accuracy issue
2026-01-15 11:39:08 -08:00
Yu, Zijun
59e7e7c47d
NPU fix llama-bench
2026-01-15 11:39:08 -08:00
Yu, Zijun
38254cf592
NPU prefill chunking
2026-01-15 11:39:08 -08:00
Yu, Zijun
531941b348
Fix NPU
2026-01-15 11:28:48 -08:00
Yu, Zijun
ae404f7cbb
Fix llama-bench
2026-01-15 11:28:48 -08:00
Yu, Zijun
072dde0b2b
change graph to 4d, support multi sequences
2026-01-15 11:28:48 -08:00
Yu, Zijun
ea2c99be1c
NPU unify PD (handled internally)
2026-01-15 11:28:48 -08:00
Zijun Yu
b8690bc055
NPU Unify PD ( #14 )
...
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
2026-01-15 11:27:30 -08:00
Yu, Zijun
eba8113dc4
Style: middle ptr and ref align, omit optional struct keyword
2026-01-15 11:27:30 -08:00
cavusmustafa
e7252920e1
env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
2026-01-15 11:26:00 -08:00
Yu, Zijun
f3afa7b914
Requantize Q6_K (gs16) to gs32 on GPU
2026-01-15 11:26:00 -08:00
Yu, Zijun
e4bfe5a20d
Add Q5_K to support phi-3-q4_k_m
2026-01-15 11:26:00 -08:00
Yu, Zijun
2f1d50fb07
Minor refactor
2026-01-15 11:26:00 -08:00
Yu, Zijun
602f9ca4af
Fix NPU accuracy
2026-01-15 11:26:00 -08:00
Yu, Zijun
9de874cb7b
Support iSWA
2026-01-15 11:25:58 -08:00
Yu, Zijun
da2cc993bc
WA for npu 1st token acc issue
2026-01-15 11:20:31 -08:00
Yu, Zijun
434059aef7
Fix NPU compile
2026-01-15 11:20:31 -08:00
Yu, Zijun
bcc343af00
Support BF16 model
2026-01-15 11:20:31 -08:00
Yu, Zijun
dc77cbb3f6
STYLE: make get_types_to_requant a function
2026-01-15 11:20:31 -08:00
Yu, Zijun
2ad1147b9b
Improve debug util; Eliminate nop ReshapeReshape
2026-01-15 11:20:31 -08:00
Yu, Zijun
6926655f5b
Add custom quant type: q8_1_c, q4_0_128
2026-01-15 11:20:31 -08:00
Yu, Zijun
b593428eb3
Dequantize q4_1 q4_k q6_k for NPU
2026-01-15 11:20:31 -08:00
Yu, Zijun
9900245e0b
Fix test-backend-ops: Treat quantized tensors as weights
2026-01-15 11:20:31 -08:00
Yu, Zijun
65e1b1af6d
Fix after rebasing
...
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
2026-01-15 11:19:15 -08:00
Yu, Zijun
f4123be967
Fix test-backend-ops
2026-01-15 11:19:15 -08:00
Yu, Zijun
7bda5021f9
Fix NPU
2026-01-15 11:19:15 -08:00
Yu, Zijun
37ff226bb6
Use CiD for NPU
2026-01-15 11:19:15 -08:00
Yu, Zijun
fc865340d5
Fix test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
4e7f04a307
Fix llama-perplexity
2026-01-15 10:26:28 -08:00
Yu, Zijun
6dc4b90635
Fix NPU
2026-01-15 10:26:28 -08:00
Yu, Zijun
44f4cf34b1
Fix Phi3 ROPE; Add test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
f3c0519096
Reduce memory: free ov weights node after graph conversion
2026-01-15 10:20:18 -08:00
Yu, Zijun
ebc4fc9f95
Fuse to SDPA
2026-01-15 10:20:18 -08:00
Yu, Zijun
4c582ac7a3
Statful transformation for CPU GPU
2026-01-15 10:20:18 -08:00
Yu, Zijun
8afee795ad
Update clang-format
2026-01-15 10:20:18 -08:00
Yu, Zijun
593484ce5f
Refactor: clean, fix warning
2026-01-15 10:20:18 -08:00
Yu, Zijun
592d7f8bbb
Change due to ggml cgraph changes, llama-3.2 CPU work
2026-01-15 10:20:18 -08:00
Yu, Zijun
d9ca8f5dbe
NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
34531abce4
draft NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
7fec223334
Add initial NPU support
2026-01-15 10:20:18 -08:00
Yu, Zijun
8ac5c225aa
FIX: set_max_token_len
2026-01-15 10:20:18 -08:00
Yu, Zijun
0d505b4e56
STYLE and minor REFACTOR
2026-01-15 10:10:00 -08:00
Yu, Zijun
0d009fe61a
FEAT: Add all conversion code from ov side
2026-01-15 10:10:00 -08:00
Viraj Wadhwa
ffabe95e2a
Rebase - Bring up to date and fix build process
2026-01-15 10:09:23 -08:00
Yu, Zijun
a8e5efa44e
PERF: compile once (dynamic graph + cache)
2026-01-15 10:05:41 -08:00