llama.cpp/examples/openai/README.md

1.4 KiB

examples.openai: OpenAI API-compatible server

A simple Python server that sits above the C++ ../server and offers improved OAI compatibility.

Usage

python -m examples.openai -m some-model.gguf


Features

The new examples/openai/server.py:

  • Uses llama.cpp C++ server as a backend (spawns it or connects to existing)

  • Uses actual jinja2 chat templates read from the models

  • Supports grammar-constrained output for both JSON response format and tool calls

  • Tool calling “works” w/ all models (even non-specialized ones like Mixtral 7x8B)

    • Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling fine-tunes

TODO

  • Embedding endpoint w/ distinct server subprocess

  • Automatic/manual session caching

    • Spawns the main C++ CLI under the hood

    • Support precaching long prompts from CLI

    • Instant incremental inference in long threads

  • Improve examples/agent:

    • Interactive agent CLI that auto-discovers tools from OpenAPI endpoints

    • Script that wraps any Python source as a container-sandboxed OpenAPI endpoint (allowing running ~unsafe code w/ tools)

    • Basic memory / RAG / python interpreter tools

  • Follow-ups

    • Remove OAI support from server

    • Remove non-Python json schema to grammar converters

    • Reach out to frameworks to advertise new option.