LLM inference in C/C++

Go to file

EliteGPT AI 650c8f31ae docs: simplify - server-first approach		2025-12-31 20:41:56 +10:00
.devops	docker : add CUDA 13.1 image build (#18441 )	2025-12-30 22:28:53 +01:00
.gemini	contributing: tighten AI usage policy (#18388 )	2025-12-29 16:01:32 +01:00
.github	docker : add CUDA 13.1 image build (#18441 )	2025-12-30 22:28:53 +01:00
benches/dgx-spark	benches : add eval results (#17139 )	2025-11-10 10:44:10 +02:00
ci	llama: automatically set parameters not set by the user in such a way that maximizes GPU utilization (#16653 )	2025-12-15 09:24:59 +01:00
cmake	cmake : simplify build info detection using standard variables (#17423 )	2025-12-04 12:42:13 +02:00
common	common : default content to an empty string (#18485 )	2025-12-30 12:00:57 -06:00
docs	metal : add count_equal op (#18314 )	2025-12-31 10:39:48 +02:00
examples	model-conversion : use CONVERTED_MODEL for compare-embeddings (#18461 )	2025-12-30 10:13:12 +01:00
ggml	metal : add count_equal op (#18314 )	2025-12-31 10:39:48 +02:00
gguf-py	model : add Qwen3-Omni multimodal architecture support	2025-12-31 20:25:55 +10:00
grammars	docs : document that JSON Schema is not available to model when using response_format (#18492 )	2025-12-30 15:13:49 -06:00
include	lora: count lora nodes in graph_max_nodes (#18469 )	2025-12-30 15:53:12 +01:00
licenses	cmake : enable curl by default (#12761 )	2025-04-07 13:35:19 +02:00
media	media : add transparent icon svg and png [no ci] (#15891 )	2025-09-10 14:51:28 +03:00
models	common : default content to an empty string (#18485 )	2025-12-30 12:00:57 -06:00
pocs	ggml : move AMX to the CPU backend (#10570 )	2024-11-29 21:54:58 +01:00
requirements	convert : update transformers requirements (#16866 )	2025-10-30 23:15:03 +01:00
scripts	ggml-hexagon: create generalized functions for cpu side op (#17500 )	2025-12-22 23:13:24 -08:00
src	model : add Qwen3-Omni multimodal architecture support	2025-12-31 20:25:55 +10:00
tests	common : default content to an empty string (#18485 )	2025-12-30 12:00:57 -06:00
tools	model : add Qwen3-Omni multimodal architecture support	2025-12-31 20:25:55 +10:00
vendor	cmake: correct scope - link ws2_32 for MinGW/w64devkit builds in cpp-httplib (#17972 )	2025-12-13 12:46:36 +01:00
.clang-format	fix: apply clang-format to CUDA macros (#16017 )	2025-09-16 08:59:19 +02:00
.clang-tidy	clang-tidy : disable warning about performance enum size (#16127 )	2025-09-22 19:57:46 +02:00
.dockerignore	ci : fix docker build number and tag name (#9638 )	2024-09-25 17:26:01 +02:00
.ecrc	common : Update stb_image.h to latest version (#9161 )	2024-08-27 08:58:50 +03:00
.editorconfig	editorconfig : ignore benches/ (#17140 )	2025-11-10 12:17:19 +02:00
.flake8	llama : move end-user examples to tools directory (#13249 )	2025-05-02 20:27:13 +02:00
.gitignore	vulkan: faster q6_k matmul (#17813 )	2025-12-14 08:29:37 +01:00
.gitmodules	ggml : remove kompute backend (#14501 )	2025-07-03 07:48:32 +03:00
.pre-commit-config.yaml	convert.py : add python logging instead of print() (#6511 )	2024-05-03 22:36:41 +03:00
AGENTS.md	contributing: tighten AI usage policy (#18388 )	2025-12-29 16:01:32 +01:00
AUTHORS	authors : update (#12271 )	2025-03-08 18:26:00 +02:00
CLAUDE.md	contributing: tighten AI usage policy (#18388 )	2025-12-29 16:01:32 +01:00
CMakeLists.txt	build : move _WIN32_WINNT definition to headers (#17736 )	2025-12-04 07:04:02 +01:00
CMakePresets.json	cmake : Add CMake presets for Linux and GCC (#14656 )	2025-07-13 08:12:36 +03:00
CODEOWNERS	llama.android : Rewrite Android binding (w/o cpu_features dep) (#17413 )	2025-12-17 10:14:47 +02:00
CONTRIBUTING.md	contributing: tighten AI usage policy (#18388 )	2025-12-29 16:01:32 +01:00
LICENSE	license : update copyright notice + add AUTHORS (#6405 )	2024-04-09 09:23:19 +03:00
Makefile	make : remove make in favor of CMake (#15449 )	2025-08-20 13:31:16 +03:00
README.md	docs: simplify - server-first approach	2025-12-31 20:41:56 +10:00
SECURITY.md	security : add collaborator guidance (#18081 )	2025-12-16 11:17:11 +02:00
build-xcframework.sh	cmake : move OpenSSL linking to vendor/cpp-httplib (#17177 )	2025-11-12 12:32:50 +01:00
convert_hf_to_gguf.py	model : add Qwen3-Omni multimodal architecture support	2025-12-31 20:25:55 +10:00
convert_hf_to_gguf_update.py	model : Granite Embedding support (#15641 )	2025-12-23 00:28:19 +01:00
convert_llama_ggml_to_gguf.py	py : fix wrong input type for raw_dtype in ggml to gguf scripts (#8928 )	2024-08-16 13:36:30 +03:00
convert_lora_to_gguf.py	convert : allow quantizing lora again (#17453 )	2025-11-24 15:50:55 +01:00
flake.lock	flake.lock: Update (#10470 )	2024-11-24 08:03:25 -08:00
flake.nix	fix(nix): remove non-functional llama-cpp cachix cache from flake.nix (#15295 )	2025-08-13 11:21:31 -07:00
mypy.ini	convert : partially revert PR #4818 (#5041 )	2024-01-20 18:14:18 -05:00
poetry.lock	build(python): Package scripts with pip-0517 compliance	2024-07-04 15:39:13 +00:00
pyproject.toml	gguf-py : avoid requiring pyside6 for other scripts (#13036 )	2025-05-05 22:27:31 -04:00
pyrightconfig.json	model-conversion : use CONVERTED_MODEL value for converted model [no ci] (#17984 )	2025-12-13 08:34:26 +01:00
requirements.txt	`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034 )	2025-03-05 13:05:13 +00:00

README.md

llama.cpp + Qwen3-Omni

Fork with Qwen3-Omni multimodal architecture support

Quick Start

# Clone and build
git clone https://github.com/phnxsystms/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build -j

# Download models
huggingface-cli download phnxsystms/Qwen3-Omni-30B-A3B-Instruct-GGUF --local-dir models/

Run Server (Recommended)

Spin up an OpenAI-compatible API server:

./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --host 0.0.0.0 \
    --port 8080 \
    -ngl 99

Then use it:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages":[{"role":"user","content":"Hello!"}]}'

CLI Usage

# Text
./build/bin/llama-cli -m models/qwen3-omni-30B-Q8_0.gguf -p "Hello!" -ngl 99

# Vision
./build/bin/llama-mtmd-cli \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --mmproj models/mmproj-qwen3-omni-30B-F16-fixed.gguf \
    --image photo.jpg \
    -p "Describe this image"

Multi-GPU / Distributed

Model is 31GB - for multi-GPU or distributed inference:

# Distributed: start RPC on worker machines
./build/bin/llama-rpc-server --host 0.0.0.0 --port 50052

# Main: connect to workers
./build/bin/llama-server \
    -m models/qwen3-omni-30B-Q8_0.gguf \
    --rpc worker1:50052,worker2:50052 \
    -ngl 99

Models

File	Size
qwen3-omni-30B-Q8_0.gguf	31GB
mmproj-qwen3-omni-30B-F16-fixed.gguf	2.3GB

Status

✅ Text inference
✅ Vision inference
🚧 Audio (WIP)

License

MIT