Adds support for Qwen3-Omni, Alibaba's multimodal LLM that handles text and vision. This enables the main LLM architecture and vision encoder support. Main LLM changes: - Add LLM_ARCH_QWEN3OMNI enum and architecture registration - Add hparams loading for MoE-based architecture (48 layers, 128 experts) - Reuse llm_build_qwen3moe graph builder - Add IMROPE type for multimodal position encoding Vision encoder changes (via mtmd): - Add PROJECTOR_TYPE_QWEN3O with auto-conversion to QWEN3VL for vision - Support different embedding dimensions (vision=8192, audio=2048) - Add separate Q/K/V tensor support in qwen3vl graph builder Tested with Qwen3-Omni-30B-Q8_0.gguf on distributed 5-GPU setup: - 41-44 tokens/sec inference speed - Text and vision inference working Note: Audio encoder support is WIP and will follow in a separate PR. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> |
||
|---|---|---|
| .. | ||
| scripts | ||
| __init__.py | ||
| constants.py | ||
| gguf.py | ||
| gguf_reader.py | ||
| gguf_writer.py | ||
| lazy.py | ||
| metadata.py | ||
| py.typed | ||
| quants.py | ||
| tensor_mapping.py | ||
| utility.py | ||
| vocab.py | ||