Add dequantize4() implementations for Q4_1, Q5_0, Q5_1, and IQ4_NL
in the flash attention base shader. Register them in the shader
generator, pipeline creation, and enable in the scalar/coopmat1 FA
support check.
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Ensure bidirectional text support for mixed Arabic/English content
* Clean up commented duplicate function
Remove the commented-out duplicate transformMdastNode function
that was left over from refactoring.
* Fix Arabic RTL text rendering in web UI
- Add dir='auto' attributes to markdown containers and blocks
- Implement post-processing to add dir='auto' to all text elements
- Replace directional CSS properties with logical properties for proper RTL list alignment
- Minor code formatting improvements
This ensures bidirectional text support for mixed Arabic/English content in the llama.cpp web UI.
* Implement rehype plugin for comprehensive RTL text support
- Add rehypeRtlSupport plugin that applies dir='auto' to all elements with children
- Replace DOMParser-based approach with efficient HAST tree processing
- Remove hardcoded element lists for better maintainability
- Ensure proper bidirectional text rendering for mixed RTL/LTR content
* Fix RTL text rendering with rehype plugin and cleanup
* fix: prettier formatting
Extend the existing reorder optimization to Q8_0. The reorder
separates scale factors from weight data for coalesced memory
access -- was implemented for Q4_0/Q4_K/Q6_K but Q8_0 was missing.
On Arc Pro B70 (Xe2), Q8_0 tg goes from 4.88 to 15.24 t/s (3.1x)
on Qwen3.5-27B. BW utilization: 21% -> 66%.
The key fix beyond the kernels: Q8_0 was missing from the type
check in ggml_backend_sycl_buffer_init_tensor() that allocates
the extra struct carrying the reorder flag -- so the optimization
was silently skipped.
AI (Claude) was used to assist with root cause investigation and
writing the kernel code. All code was human-reviewed and tested
on real hardware.
Fixes: #21517
* Write an optimized flash_attn_stream_k_fixup kernel
Write a specialized and more optimized kernel for cases where nblocks_stream_k is multiple of ntiles_dst.
Make nblocks_stream_k to multiple of ntiles_dst if nblocks_stream_k > 2 * ntiles_dst
* Use the new kernel only for nblocks_stream_k_raw > 4 * ntiles_dst to make sure we have enough concurrency on GPUs
* Address review comments
* Address review comments
* Revert variable names to original
Check the return value of sink.write() in the chunked content provider
and return false when the write fails, matching cpp-httplib's own
streaming contract. This prevents logging chunks as sent when the sink
rejected them and properly aborts the stream on connection failure.
This PR changes the logging that occurs at startup of llama-server.
Currently, it is redundant (including CPU information twice) and it is
missing the build + commit info.
* Work towards removing bitcast
* Move rest of existing types over
* Add timeout back to wait and remove synchronous set_tensor/memset_tensor
* move to unpackf16 for wider compatibility
* cleanup
* Remove deadlock condition in free_bufs
* Start work on removing parameter buffer pools
* Simplify and optimize further
* simplify profile futures
* Fix stride
* Try using a single command buffer per batch
* formatting
* experimenting CI
* Experimenting CI fix for MinGW
* experimenting CI on Windows
* modified script for integration with VisualStudio
* added proxy handling
* adding python version for Windows execution
* fix iterator::end() dereference
* fixed proxy handling
* Fix errors occurring on Windows
* fixed ci script
* Reverted to master
* Stripping test items to simplify Windows test
* adjusting script for windows testing
* Changed shell
* Fixed shell
* Fixed shell
* Fix CI setting
* Fix CI setting
* Fix CI setting
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* Experimenting ci fix
* experimenting fix for unit test error
* Changed to use BUILD_LOW_PERF to skip python tests
* Fix CI
* Added option to specify Ninja generator
* Reverted proxy related changes
* common : fix tool call type detection for nullable and enum schemas
* common, tests : fix grammar delegation for nullable/enum schemas and add tests
Fix enum type inference to scan all enum values (not just index 0) so
schemas like {"enum": [0, "celsius"]} correctly detect string type.
Fix schema_delegates in peg-parser to handle nullable type arrays
(["string", "null"]) and typeless enum schemas in raw mode, allowing
the tagged parser to use raw text instead of JSON-formatted strings.
Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format):
- nullable string ["string", "null"]
- nullable string with null first ["null", "string"]
- nullable integer ["integer", "null"]
- enum without explicit type key
The `HSA_OVERRIDE_GFX_VERSION` variable can be used in ROCm to override an unsupported target architecture with a similar but supported target architecture.
This does not and has never worked on Windows. I think the clarification could avoid driving Windows people towards this solution that does not work.
* ggml-zendnn : add MUL_MAT_ID op support for MoE models
- Add MUL_MAT_ID op acceleration for Mixture-of-Experts models
- MUL_MAT_ID op fallback to CPU backend if total experts > 32
- Point ZenDNN lib to latest bits ZenDNN-2026-WW13
* ggml-zendnn : add braces to sgemm failure condition for consistency
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
* seems to work
* fix case with new line
Co-authored-by: sayap <sokann@gmail.com>
* gemma 4: fix pre tok regex
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: sayap <sokann@gmail.com>
Reuse the buffer for the ggml context which is used for creating the
compute graph on the server side. This partially addresses a memory leak
created by the CUDA backend due to using buffer addresses as cache
keys.
ref: #21265
ref: #20315
* ci : add AMD CPU label to PR labeler
Add automatic labeling for PRs that modify AMD CPU (ZenDNN) backend files
* ci : rename label AMD CPU to AMD ZenDNN in labeler config
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
---------
Co-authored-by: Aaron Teo <taronaeo@gmail.com>
Bump ROCm version on Linux from 7.2 to 7.2.1
Add gfx1102 target
Delete LLVM workaround since ROCm 7.2.1 has fix for ROCm 7.2 perf regression https://github.com/ROCm/rocm-systems/issues/2865
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>