From 650c8f31ae75f26d20973ec23f0ccbf7be75542c Mon Sep 17 00:00:00 2001 From: EliteGPT AI Date: Wed, 31 Dec 2025 20:41:56 +1000 Subject: [PATCH] docs: simplify - server-first approach --- README.md | 110 +++++++++++++++++++++++++++--------------------------- 1 file changed, 54 insertions(+), 56 deletions(-) diff --git a/README.md b/README.md index 92dbccf9dd..9fe32c1433 100644 --- a/README.md +++ b/README.md @@ -2,83 +2,81 @@ > **Fork with Qwen3-Omni multimodal architecture support** -[![Qwen3-Omni](https://img.shields.io/badge/Qwen3--Omni-Supported-green)](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF) -[![Models](https://img.shields.io/badge/GGUF%20Models-HuggingFace-yellow)](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF) - -## What's Added - -Support for **Qwen3-Omni**, Alibaba's multimodal LLM: - -- `LLM_ARCH_QWEN3OMNI` - Main LLM architecture (MoE: 48 layers, 128 experts) -- `PROJECTOR_TYPE_QWEN3O` - Vision encoder support -- IMROPE position encoding for multimodal inputs - -**Note:** Audio encoder support is WIP. - ## Quick Start ```bash -# Clone this fork +# Clone and build git clone https://github.com/phnxsystms/llama.cpp.git cd llama.cpp +cmake -B build -DGGML_CUDA=ON +cmake --build build -j -# Build with CUDA -mkdir build && cd build -cmake .. -DGGML_CUDA=ON -DCMAKE_BUILD_TYPE=Release -cmake --build . -j - -# Download models from HuggingFace +# Download models huggingface-cli download phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF --local-dir models/ - -# Text inference -./bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99 - -# Vision inference -./bin/llama-mtmd-cli \ - -m models/qwen3-omni-30B-Q8_0.gguf \ - --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \ - --image your_image.jpg \ - -p "What's in this image?" ``` -## Distributed Inference (RPC) +## Run Server (Recommended) -For large models, use llama.cpp's RPC backend to distribute across multiple machines: +Spin up an OpenAI-compatible API server: ```bash -# On worker nodes - start RPC server -./bin/llama-rpc-server --host 0.0.0.0 --port 50052 +./build/bin/llama-server \ + -m models/qwen3-omni-30B-Q8_0.gguf \ + --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \ + --host 0.0.0.0 \ + --port 8080 \ + -ngl 99 +``` -# On main node - connect to workers -./bin/llama-cli \ +Then use it: +```bash +curl http://localhost:8080/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages":[{"role":"user","content":"Hello!"}]}' +``` + +## CLI Usage + +```bash +# Text +./build/bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99 + +# Vision +./build/bin/llama-mtmd-cli \ + -m models/qwen3-omni-30B-Q8_0.gguf \ + --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \ + --image photo.jpg \ + -p "Describe this image" +``` + +## Multi-GPU / Distributed + +Model is 31GB - for multi-GPU or distributed inference: + +```bash +# Distributed: start RPC on worker machines +./build/bin/llama-rpc-server --host 0.0.0.0 --port 50052 + +# Main: connect to workers +./build/bin/llama-server \ -m models/qwen3-omni-30B-Q8_0.gguf \ --rpc worker1:50052,worker2:50052 \ - -ngl 99 \ - -p "Hello!" + -ngl 99 ``` ## Models -| Model | Size | Description | -|-------|------|-------------| -| [qwen3-omni-30B-Q8_0.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF/resolve/main/qwen3-omni-30B-Q8_0.gguf) | 31GB | Main LLM (Q8_0) | -| [mmproj-qwen3-omni-30B-F16-fixed.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF/resolve/main/mmproj-qwen3-omni-30B-F16-fixed.gguf) | 2.3GB | Vision projector (F16) | +| File | Size | +|------|------| +| [qwen3-omni-30B-Q8_0.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF) | 31GB | +| [mmproj-qwen3-omni-30B-F16-fixed.gguf](https://huggingface.co/phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF) | 2.3GB | -## Performance +## Status -Tested on multi-GPU distributed setup: -- **41-44 tokens/sec** inference speed -- Text and vision inference working - -## Files Changed - -``` -src/llama-arch.cpp # Architecture registration -src/llama-model.cpp # Model loading & graph building -tools/mtmd/clip.cpp # Vision projector support -tools/mtmd/mtmd.cpp # Multimodal pipeline -``` +- ✅ Text inference +- ✅ Vision inference +- 🚧 Audio (WIP) ## License -MIT (same as llama.cpp) +MIT