llama.cpp/README.md

1.8 KiB

llama.cpp + Qwen3-Omni

Fork with Qwen3-Omni multimodal architecture support

Quick Start

# Clone and build
git clone https://github.com/phnxsystms/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build -j

# Download models
huggingface-cli download phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF --local-dir models/

Spin up an OpenAI-compatible API server:

./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Then use it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

CLI Usage

# Text
./build/bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99

# Vision
./build/bin/llama-mtmd-cli \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --image photo.jpg \
    -p "Describe this image"

Multi-GPU / Distributed

Model is 31GB - for multi-GPU or distributed inference:

# Distributed: start RPC on worker machines
./build/bin/llama-rpc-server --host 0.0.0.0 --port 50052

# Main: connect to workers
./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --rpc worker1:50052,worker2:50052 \
    -ngl 99

Models

File Size
qwen3-omni-30B-Q8_0.gguf 31GB
mmproj-qwen3-omni-30B-F16-fixed.gguf 2.3GB

Status

  • Text inference
  • Vision inference
  • 🚧 Audio (WIP)

License

MIT