Yu, Zijun
7d81861a18
Fix Hunyuan
2026-01-15 11:20:31 -08:00
Yu, Zijun
bcc343af00
Support BF16 model
2026-01-15 11:20:31 -08:00
Yu, Zijun
2ad1147b9b
Improve debug util; Eliminate nop ReshapeReshape
2026-01-15 11:20:31 -08:00
Yu, Zijun
6926655f5b
Add custom quant type: q8_1_c, q4_0_128
2026-01-15 11:20:31 -08:00
Yu, Zijun
b593428eb3
Dequantize q4_1 q4_k q6_k for NPU
2026-01-15 11:20:31 -08:00
Yu, Zijun
9900245e0b
Fix test-backend-ops: Treat quantized tensors as weights
2026-01-15 11:20:31 -08:00
Yu, Zijun
dd80b04235
Fix CI; Disable test-backend-ops
2026-01-15 11:19:15 -08:00
Yu, Zijun
6ab76ed10a
Fix accuracy: disable cpu_repack
2026-01-15 11:19:15 -08:00
Yu, Zijun
663a0b8cce
Quant models run with accuracy issue
2026-01-15 11:19:15 -08:00
Yu, Zijun
d4ca760da8
Add quant weight conversion functions from genai gguf reader
2026-01-15 11:19:15 -08:00
Yu, Zijun
56d596775d
Change openvino device_type to GPU; Enable flash_attn
2026-01-15 11:19:15 -08:00
Yu, Zijun
65e1b1af6d
Fix after rebasing
...
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
2026-01-15 11:19:15 -08:00
Yu, Zijun
a7b611bc93
Minor updates for raising PR
2026-01-15 11:19:15 -08:00
Yu, Zijun
f4123be967
Fix test-backend-ops
2026-01-15 11:19:15 -08:00
Yu, Zijun
839f8c66a0
Remove CPY
2026-01-15 11:19:15 -08:00
Yu, Zijun
7bda5021f9
Fix NPU
2026-01-15 11:19:15 -08:00
Yu, Zijun
63d000ba40
Support op SET_ROWS
2026-01-15 11:19:15 -08:00
Yu, Zijun
9a91ca6ef9
Optimize tensor conversion, improve TTFT
2026-01-15 11:19:15 -08:00
Yu, Zijun
fc865340d5
Fix test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
01cdf4a9cc
matmul in fp32
2026-01-15 10:26:28 -08:00
Yu, Zijun
4e7f04a307
Fix llama-perplexity
2026-01-15 10:26:28 -08:00
Yu, Zijun
75eec6265f
Fix llama-bench; Clang-format
2026-01-15 10:26:28 -08:00
Yu, Zijun
6dc4b90635
Fix NPU
2026-01-15 10:26:28 -08:00
Yu, Zijun
44f4cf34b1
Fix Phi3 ROPE; Add test-backend-ops
2026-01-15 10:26:28 -08:00
Yu, Zijun
1ed49bbfaf
Fix llama-cli
2026-01-15 10:26:28 -08:00
Yu, Zijun
f3c0519096
Reduce memory: free ov weights node after graph conversion
2026-01-15 10:20:18 -08:00
Yu, Zijun
a80da69448
Pull out sin cos from rope
2026-01-15 10:20:18 -08:00
Yu, Zijun
0fa7a5efef
Refactor: remove past_token_len from extra_inputs
2026-01-15 10:20:18 -08:00
Yu, Zijun
bf5414c95e
Replace Concat with Broadcast in MulMat for GQA
2026-01-15 10:20:18 -08:00
Yu, Zijun
ebc4fc9f95
Fuse to SDPA
2026-01-15 10:20:18 -08:00
Yu, Zijun
73ee84fffe
Add SwiGLU
2026-01-15 10:20:18 -08:00
Yu, Zijun
4c582ac7a3
Statful transformation for CPU GPU
2026-01-15 10:20:18 -08:00
Yu, Zijun
593484ce5f
Refactor: clean, fix warning
2026-01-15 10:20:18 -08:00
Yu, Zijun
42d4240937
Change due to ggml cgraph changes, all device work
2026-01-15 10:20:18 -08:00
Yu, Zijun
592d7f8bbb
Change due to ggml cgraph changes, llama-3.2 CPU work
2026-01-15 10:20:18 -08:00
Yu, Zijun
f7ad77930e
Change due to ggml cgraph changes, not correct yet
2026-01-15 10:20:18 -08:00
Yu, Zijun
d9ca8f5dbe
NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
34531abce4
draft NPU support version 2: prefill + kvcache
2026-01-15 10:20:18 -08:00
Yu, Zijun
7fec223334
Add initial NPU support
2026-01-15 10:20:18 -08:00
Yu, Zijun
8ac5c225aa
FIX: set_max_token_len
2026-01-15 10:20:18 -08:00
Yu, Zijun
a30dc6e726
PERF: add weight constant in parallel
2026-01-15 10:20:18 -08:00
Yu, Zijun
c57f61494a
FIX: input shape of KQ_mask
2026-01-15 10:20:18 -08:00
Yu, Zijun
041d220dfa
FIX: Re-add tensor names in cgraph, Add another case for RESHAPE
2026-01-15 10:20:13 -08:00
Zijun Yu
4c905b2b25
fix build error
2026-01-15 10:10:00 -08:00
Viraj Wadhwa
ffabe95e2a
Rebase - Bring up to date and fix build process
2026-01-15 10:09:23 -08:00
Yu, Zijun
a8e5efa44e
PERF: compile once (dynamic graph + cache)
2026-01-15 10:05:41 -08:00
Yu, Zijun
7d5e234254
FEAT: improve debug capability
2026-01-15 10:05:41 -08:00
Yu, Zijun
d3bdca25bd
PERF: share const nodes for weights for diff infer
2026-01-15 10:05:41 -08:00
Yu, Zijun
96ba47dd43
STYLE: minor refactor
2026-01-15 10:05:41 -08:00
Yu, Zijun
c04966cda6
REFACTOR: support weigts as constant
2026-01-15 10:05:41 -08:00