llama.cpp/README.md

# llama.cpp + Qwen3-Omni

> **Fork with Qwen3-Omni multimodal architecture support**

[![Qwen3-Omni](https://img.shields.io/badge/Qwen3--Omni-Supported-green)](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF)
[![Models](https://img.shields.io/badge/GGUF%20Models-HuggingFace-yellow)](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF)

## What's Added

Support for **Qwen3-Omni**, Alibaba's multimodal LLM:

- `LLM_ARCH_QWEN3OMNI` - Main LLM architecture (MoE: 48 layers, 128 experts)
- `PROJECTOR_TYPE_QWEN3O` - Vision encoder support
- IMROPE position encoding for multimodal inputs

**Note:** Audio encoder support is WIP.

## Quick Start

```bash
# Clone this fork
git clone https://github.com/phnxsystms/llama.cpp.git
cd llama.cpp

# Build with CUDA
mkdir build && cd build
cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . -j

# Download models from HuggingFace
huggingface-cli download phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF --local-dir models/

# Text inference
./bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99

# Vision inference
./bin/llama-mtmd-cli \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --image your_image.jpg \
    -p "What's in this image?"
```

## Distributed Inference (RPC)

For large models, use llama.cpp's RPC backend to distribute across multiple machines:

```bash
# On worker nodes - start RPC server
./bin/llama-rpc-server --host 0.0.0.0 --port 50052

# On main node - connect to workers
./bin/llama-cli \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --rpc worker1:50052,worker2:50052 \
    -ngl 99 \
    -p "Hello!"
```

## Models

| Model | Size | Description |
|-------|------|-------------|
| [qwen3-omni-30B-Q8_0.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF/resolve/main/qwen3-omni-30B-Q8_0.gguf) | 31GB | Main LLM (Q8_0) |
| [mmproj-qwen3-omni-30B-F16-fixed.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF/resolve/main/mmproj-qwen3-omni-30B-F16-fixed.gguf) | 2.3GB | Vision projector (F16) |

## Performance

Tested on multi-GPU distributed setup:
- **41-44 tokens/sec** inference speed
- Text and vision inference working

## Files Changed

```
src/llama-arch.cpp      # Architecture registration
src/llama-model.cpp     # Model loading & graph building
tools/mtmd/clip.cpp     # Vision projector support
tools/mtmd/mtmd.cpp     # Multimodal pipeline
```

## License

MIT (same as llama.cpp)