Adds support for Qwen3-Omni, Alibaba's multimodal LLM that handles
text and vision. This enables the main LLM architecture and vision
encoder support.
Main LLM changes:
- Add LLM_ARCH_QWEN3OMNI enum and architecture registration
- Add hparams loading for MoE-based architecture (48 layers, 128 experts)
- Reuse llm_build_qwen3moe graph builder
- Add IMROPE type for multimodal position encoding
Vision encoder changes (via mtmd):
- Add PROJECTOR_TYPE_QWEN3O with auto-conversion to QWEN3VL for vision
- Support different embedding dimensions (vision=8192, audio=2048)
- Add separate Q/K/V tensor support in qwen3vl graph builder
Tested with Qwen3-Omni-30B-Q8_0.gguf on distributed 5-GPU setup:
- 41-44 tokens/sec inference speed
- Text and vision inference working
Note: Audio encoder support is WIP and will follow in a separate PR.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>