llama.cpp/examples/openai/README.md

# examples.openai: OpenAI API-compatible server + agent / tools examples

A simple Python server that sits above the C++ [../server](examples/server) and offers improved OAI compatibility.

## Usage

Run a simple test:

```bash
# Spawns a Python server (which spawns a C++ Server) then hits it w/ a tool-calling request
examples/openai/test.sh
```

To simply run the Python server (+ C++ server under the hood):

```bash
python -m examples.openai
```

## Tools usage (WIP)

```bash
git clone https://github.com/NousResearch/Hermes-Function-Calling examples/openai/hermes_function_calling
```

Then edit `examples/agents/hermes_function_calling/utils.py`:

```py
log_folder = os.environ.get('LOG_FOLDER', os.path.join(script_dir, "inference_logs"))
```

Then run tools in a sandbox:

```bash
REQUIREMENTS_FILE=<( cat examples/agents/hermes_function_calling/requirements.txt | grep -vE "bitsandbytes|flash-attn" ) \
  examples/agents/run_sandboxed_tools.sh \
    examples/agents/hermes_function_calling/functions.py \
    -e LOG_FOLDER=/data/inference_logs
```

TODO: reactor that reads OpenAPI definitions and does the tool calling

## Features

The new examples/openai/server.py:

- Uses llama.cpp C++ server as a backend (spawns it or connects to existing)

- Uses actual jinja2 chat templates read from the models

- Supports grammar-constrained output for both JSON response format and tool calls

- Tool calling “works” w/ all models (even non-specialized ones like Mixtral 7x8B)

    - Optimised support for Functionary & Nous Hermes, easy to extend to other tool-calling fine-tunes

## TODO

- Embedding endpoint w/ distinct server subprocess

- Automatic/manual session caching

    - Spawns the main C++ CLI under the hood

    - Support precaching long prompts from CLI

    - Instant incremental inference in long threads

- Improve examples/agent:

    - Interactive agent CLI that auto-discovers tools from OpenAPI endpoints

    - Script that wraps any Python source as a container-sandboxed OpenAPI endpoint (allowing running ~unsafe code w/ tools)

    - Basic memory / RAG / python interpreter tools

- Follow-ups

    - Remove OAI support from server

    - Remove non-Python json schema to grammar converters

    - Reach out to frameworks to advertise new option.