Adds support for Qwen3-Omni, Alibaba's multimodal LLM that handles
text and vision. This enables the main LLM architecture and vision
encoder support.
Main LLM changes:
- Add LLM_ARCH_QWEN3OMNI enum and architecture registration
- Add hparams loading for MoE-based architecture (48 layers, 128 experts)
- Reuse llm_build_qwen3moe graph builder
- Add IMROPE type for multimodal position encoding
Vision encoder changes (via mtmd):
- Add PROJECTOR_TYPE_QWEN3O with auto-conversion to QWEN3VL for vision
- Support different embedding dimensions (vision=8192, audio=2048)
- Add separate Q/K/V tensor support in qwen3vl graph builder
Tested with Qwen3-Omni-30B-Q8_0.gguf on distributed 5-GPU setup:
- 41-44 tokens/sec inference speed
- Text and vision inference working
Note: Audio encoder support is WIP and will follow in a separate PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
* add count equal for metal
* remove trailing whitespace
* updated doc ops table
* changed shmem to i32
* added multi tg and templating
* removed BLAS support from Metal docs
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add memset to set dst to 0
* metal : cleanup
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* cmake: work around broken IntelSYCLConfig.cmake in oneAPI 2025.x
* [AI] sycl: auto-detect and skip incompatible IntelSYCL package
Automatically detect compiler versions with incompatible IntelSYCL
CMake configuration files and fall back to manual SYCL flags instead
of requiring users to set options manually.
Fixes build failures with oneAPI 2025.x where IntelSYCLConfig.cmake
has SYCL_FEATURE_TEST_EXTRACT invocation errors.
* refactor: improve SYCL provider handling and error messages in CMake configuration
* refactor: enhance SYCL provider validation and error handling in CMake configuration
* ggml-sycl: wrap find_package(IntelSYCL) to prevent build crashes
* sampling: reuse token data buffer in llama_sampler_sample
* move cur buffer before timing section, after samplers
* minor : fix build
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
This commit updates the causal model verification script to use the
CONVERTED_MODEL environment variable instead of using the MODEL_PATH
(the original model path) as the basis for the converted model file
name.
The motivation for this that currently if the converted model file name
differs from the original model directory/name the verification script
will look for the wrong .bin file that was generating when running
the converted model.
This similar to the change made for the embeddings models script in
Commit db81d5ec4b ("model-conversion :
use CONVERTED_EMBEDDING_MODEL for embedding_verify_logits (#18079)"),
but we also verify the embeddings of for causal models as well.
* Prevent crash if TTFT >300sec, boosted to 90 days
* server : allow configurable HTTP timeouts for child models
* server : pass needed timeouts from params only
---------
Co-authored-by: Greg Slocum <fromgit@wbtek.slocum.net>
* Fix `msg` typo
* Fix thread safety in destroy() to support generation abortion in lifecycle callbacks.
* UI polish: stack new message change from below; fix GGUF margin not in view port
* Bug fixes: rare racing condition when main thread updating view and and default thread updating messages at the same time; user input not disabled during generation.
* Bump dependencies' versions; Deprecated outdated dsl usage.
This commit refactors the original model embedding script to include a
device selection option. Users can now specify the device (cpu, cuda,
mps, auto) via command-line arguments. It also refactors the code to be
more structured.
This was disabled in #9340 due to compiler crash, but seems to build now as confirmed by the latest comments in #11913.
I've also managed to build the image with `docker build -f .devops/rocm.Dockerfile .` (for all three stages, `full`, `server` and `light`).
A quick attempt at trying to build an arm64 image failed. Since none of the other images are build for arm, I only enabled the amd64 one.
The `runs_on` option was added to match the other entries.
* vulkan: Use BK=32 for coopmat2 mul_mat_id
* vulkan: optimize decodeFuncB in coopmat2 mul_mat_id shader
Disable robustness, remove the OOB check in decodeFuncB, and initialize the
row_ids to zero to avoid OOB access.
Don't slice/offset the B matrix to ic * BN, only to adjust the coord back down
to the range [0, BN) in decodeFuncB. Instead just slice with a row offset of
zero and remove the '& (BN - 1)'. This allows the compiler to common some of
the shared memory loads.