LLM inference in C/C++
Go to file
EliteGPT AI 650c8f31ae docs: simplify - server-first approach 2025-12-31 20:41:56 +10:00
.devops docker : add CUDA 13.1 image build (#18441) 2025-12-30 22:28:53 +01:00
.gemini contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
.github docker : add CUDA 13.1 image build (#18441) 2025-12-30 22:28:53 +01:00
benches/dgx-spark benches : add eval results (#17139) 2025-11-10 10:44:10 +02:00
ci llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653) 2025-12-15 09:24:59 +01:00
cmake cmake : simplify build info detection using standard variables (#17423) 2025-12-04 12:42:13 +02:00
common common : default content to an empty string (#18485) 2025-12-30 12:00:57 -06:00
docs metal : add count_equal op (#18314) 2025-12-31 10:39:48 +02:00
examples model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461) 2025-12-30 10:13:12 +01:00
ggml metal : add count_equal op (#18314) 2025-12-31 10:39:48 +02:00
gguf-py model : add Qwen3-Omni multimodal architecture support 2025-12-31 20:25:55 +10:00
grammars docs : document that JSON Schema is not available to model when using response_format (#18492) 2025-12-30 15:13:49 -06:00
include lora: count lora nodes in graph_max_nodes (#18469) 2025-12-30 15:53:12 +01:00
licenses cmake : enable curl by default (#12761) 2025-04-07 13:35:19 +02:00
media media : add transparent icon svg and png [no ci] (#15891) 2025-09-10 14:51:28 +03:00
models common : default content to an empty string (#18485) 2025-12-30 12:00:57 -06:00
pocs ggml : move AMX to the CPU backend (#10570) 2024-11-29 21:54:58 +01:00
requirements convert : update transformers requirements (#16866) 2025-10-30 23:15:03 +01:00
scripts ggml-hexagon: create generalized functions for cpu side op (#17500) 2025-12-22 23:13:24 -08:00
src model : add Qwen3-Omni multimodal architecture support 2025-12-31 20:25:55 +10:00
tests common : default content to an empty string (#18485) 2025-12-30 12:00:57 -06:00
tools model : add Qwen3-Omni multimodal architecture support 2025-12-31 20:25:55 +10:00
vendor cmake: correct scope - link ws2_32 for MinGW/w64devkit builds in cpp-httplib (#17972) 2025-12-13 12:46:36 +01:00
.clang-format fix: apply clang-format to CUDA macros (#16017) 2025-09-16 08:59:19 +02:00
.clang-tidy clang-tidy : disable warning about performance enum size (#16127) 2025-09-22 19:57:46 +02:00
.dockerignore ci : fix docker build number and tag name (#9638) 2024-09-25 17:26:01 +02:00
.ecrc common : Update stb_image.h to latest version (#9161) 2024-08-27 08:58:50 +03:00
.editorconfig editorconfig : ignore benches/ (#17140) 2025-11-10 12:17:19 +02:00
.flake8 llama : move end-user examples to tools directory (#13249) 2025-05-02 20:27:13 +02:00
.gitignore vulkan: faster q6_k matmul (#17813) 2025-12-14 08:29:37 +01:00
.gitmodules ggml : remove kompute backend (#14501) 2025-07-03 07:48:32 +03:00
.pre-commit-config.yaml convert.py : add python logging instead of print() (#6511) 2024-05-03 22:36:41 +03:00
AGENTS.md contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
AUTHORS authors : update (#12271) 2025-03-08 18:26:00 +02:00
CLAUDE.md contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
CMakeLists.txt build : move _WIN32_WINNT definition to headers (#17736) 2025-12-04 07:04:02 +01:00
CMakePresets.json cmake : Add CMake presets for Linux and GCC (#14656) 2025-07-13 08:12:36 +03:00
CODEOWNERS llama.android : Rewrite Android binding (w/o cpu_features dep) (#17413) 2025-12-17 10:14:47 +02:00
CONTRIBUTING.md contributing: tighten AI usage policy (#18388) 2025-12-29 16:01:32 +01:00
LICENSE license : update copyright notice + add AUTHORS (#6405) 2024-04-09 09:23:19 +03:00
Makefile make : remove make in favor of CMake (#15449) 2025-08-20 13:31:16 +03:00
README.md docs: simplify - server-first approach 2025-12-31 20:41:56 +10:00
SECURITY.md security : add collaborator guidance (#18081) 2025-12-16 11:17:11 +02:00
build-xcframework.sh cmake : move OpenSSL linking to vendor/cpp-httplib (#17177) 2025-11-12 12:32:50 +01:00
convert_hf_to_gguf.py model : add Qwen3-Omni multimodal architecture support 2025-12-31 20:25:55 +10:00
convert_hf_to_gguf_update.py model : Granite Embedding support (#15641) 2025-12-23 00:28:19 +01:00
convert_llama_ggml_to_gguf.py py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928) 2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py convert : allow quantizing lora again (#17453) 2025-11-24 15:50:55 +01:00
flake.lock flake.lock: Update (#10470) 2024-11-24 08:03:25 -08:00
flake.nix fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295) 2025-08-13 11:21:31 -07:00
mypy.ini convert : partially revert PR #4818 (#5041) 2024-01-20 18:14:18 -05:00
poetry.lock build(python): Package scripts with pip-0517 compliance 2024-07-04 15:39:13 +00:00
pyproject.toml gguf-py : avoid requiring pyside6 for other scripts (#13036) 2025-05-05 22:27:31 -04:00
pyrightconfig.json model-conversion : use CONVERTED_MODEL value for converted model [no ci] (#17984) 2025-12-13 08:34:26 +01:00
requirements.txt `tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034) 2025-03-05 13:05:13 +00:00

README.md

llama.cpp + Qwen3-Omni

Fork with Qwen3-Omni multimodal architecture support

Quick Start

# Clone and build
git clone https://github.com/phnxsystms/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build -j

# Download models
huggingface-cli download phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF --local-dir models/

Spin up an OpenAI-compatible API server:

./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Then use it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

CLI Usage

# Text
./build/bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99

# Vision
./build/bin/llama-mtmd-cli \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --image photo.jpg \
    -p "Describe this image"

Multi-GPU / Distributed

Model is 31GB - for multi-GPU or distributed inference:

# Distributed: start RPC on worker machines
./build/bin/llama-rpc-server --host 0.0.0.0 --port 50052

# Main: connect to workers
./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --rpc worker1:50052,worker2:50052 \
    -ngl 99

Models

File Size
qwen3-omni-30B-Q8_0.gguf 31GB
mmproj-qwen3-omni-30B-F16-fixed.gguf 2.3GB

Status

  • Text inference
  • Vision inference
  • 🚧 Audio (WIP)

License

MIT