From 21603f86dd599325fdc933e346f45cfbd5afaefb Mon Sep 17 00:00:00 2001 From: Aaron Teo Date: Sat, 21 Mar 2026 15:42:50 +0800 Subject: [PATCH] docs: update docs again via `llama-gen-docs` Signed-off-by: Aaron Teo --- tools/cli/README.md | 4 ++-- tools/completion/README.md | 4 ++-- tools/server/README.md | 6 +++--- 3 files changed, 7 insertions(+), 7 deletions(-) diff --git a/tools/cli/README.md b/tools/cli/README.md index 139d5ab5b6..cdfa21e17a 100644 --- a/tools/cli/README.md +++ b/tools/cli/README.md @@ -59,7 +59,7 @@ | `--mlock` | DEPRECATED: force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK) | | `--mmap, --no-mmap` | DEPRECATED: whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
(env: LLAMA_ARG_MMAP) | | `-dio, --direct-io, -ndio, --no-direct-io` | DEPRECATED: use DirectIO if available
(env: LLAMA_ARG_DIO) | -| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- mlock: force system to keep model in RAM rather than swapping or compressing.
- mmap: memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- dio: use DirectIO if available.

(env: LLAMA_ARG_LOAD_MODE) | +| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- none: no special loading mode
- mmap: memory-map model (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- mlock: force system to keep model in RAM rather than swapping or compressing
- dio: use DirectIO if available

(env: LLAMA_ARG_LOAD_MODE) | | `--numa TYPE` | attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system page cache before using this
see https://github.com/ggml-org/llama.cpp/issues/1437
(env: LLAMA_ARG_NUMA) | | `-dev, --device ` | comma-separated list of devices to use for offloading (none = don't offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE) | | `--list-devices` | print list of available devices and exit | @@ -135,7 +135,7 @@ | `--mirostat-lr N` | Mirostat learning rate, parameter eta (default: 0.10) | | `--mirostat-ent N` | Mirostat target entropy, parameter tau (default: 5.00) | | `-l, --logit-bias TOKEN_ID(+/-)BIAS` | modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' | -| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') | +| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) | | `--grammar-file FNAME` | file to read grammar from | | `-j, --json-schema SCHEMA` | JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | | `-jf, --json-schema-file FILE` | File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | diff --git a/tools/completion/README.md b/tools/completion/README.md index 9a9cd4287e..73c58703e5 100644 --- a/tools/completion/README.md +++ b/tools/completion/README.md @@ -142,7 +142,7 @@ llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1 | `--mlock` | DEPRECATED: force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK) | | `--mmap, --no-mmap` | DEPRECATED: whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
(env: LLAMA_ARG_MMAP) | | `-dio, --direct-io, -ndio, --no-direct-io` | DEPRECATED: use DirectIO if available
(env: LLAMA_ARG_DIO) | -| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- mlock: force system to keep model in RAM rather than swapping or compressing.
- mmap: memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- dio: use DirectIO if available.

(env: LLAMA_ARG_LOAD_MODE) | +| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- none: no special loading mode
- mmap: memory-map model (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- mlock: force system to keep model in RAM rather than swapping or compressing
- dio: use DirectIO if available

(env: LLAMA_ARG_LOAD_MODE) | | `--numa TYPE` | attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system page cache before using this
see https://github.com/ggml-org/llama.cpp/issues/1437
(env: LLAMA_ARG_NUMA) | | `-dev, --device ` | comma-separated list of devices to use for offloading (none = don't offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE) | | `--list-devices` | print list of available devices and exit | @@ -218,7 +218,7 @@ llama-completion.exe -m models\gemma-1.1-7b-it.Q4_K_M.gguf --ignore-eos -n -1 | `--mirostat-lr N` | Mirostat learning rate, parameter eta (default: 0.10) | | `--mirostat-ent N` | Mirostat target entropy, parameter tau (default: 5.00) | | `-l, --logit-bias TOKEN_ID(+/-)BIAS` | modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' | -| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') | +| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) | | `--grammar-file FNAME` | file to read grammar from | | `-j, --json-schema SCHEMA` | JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | | `-jf, --json-schema-file FILE` | File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | diff --git a/tools/server/README.md b/tools/server/README.md index 12cbdbde79..a851d6eae3 100644 --- a/tools/server/README.md +++ b/tools/server/README.md @@ -76,7 +76,7 @@ For the full list of features, please refer to [server's changelog](https://gith | `--mlock` | DEPRECATED: force system to keep model in RAM rather than swapping or compressing
(env: LLAMA_ARG_MLOCK) | | `--mmap, --no-mmap` | DEPRECATED: whether to memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
(env: LLAMA_ARG_MMAP) | | `-dio, --direct-io, -ndio, --no-direct-io` | DEPRECATED: use DirectIO if available
(env: LLAMA_ARG_DIO) | -| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- mlock: force system to keep model in RAM rather than swapping or compressing.
- mmap: memory-map model. (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- dio: use DirectIO if available.

(env: LLAMA_ARG_LOAD_MODE) | +| `-lm, --load-mode MODE` | model loading mode (default: mmap)
- none: no special loading mode
- mmap: memory-map model (if mmap disabled, slower load but may reduce pageouts if not using mlock)
- mlock: force system to keep model in RAM rather than swapping or compressing
- dio: use DirectIO if available

(env: LLAMA_ARG_LOAD_MODE) | | `--numa TYPE` | attempt optimizations that help on some NUMA systems
- distribute: spread execution evenly over all nodes
- isolate: only spawn threads on CPUs on the node that execution started on
- numactl: use the CPU map provided by numactl
if run without this previously, it is recommended to drop the system page cache before using this
see https://github.com/ggml-org/llama.cpp/issues/1437
(env: LLAMA_ARG_NUMA) | | `-dev, --device ` | comma-separated list of devices to use for offloading (none = don't offload)
use --list-devices to see a list of available devices
(env: LLAMA_ARG_DEVICE) | | `--list-devices` | print list of available devices and exit | @@ -152,7 +152,7 @@ For the full list of features, please refer to [server's changelog](https://gith | `--mirostat-lr N` | Mirostat learning rate, parameter eta (default: 0.10) | | `--mirostat-ent N` | Mirostat target entropy, parameter tau (default: 5.00) | | `-l, --logit-bias TOKEN_ID(+/-)BIAS` | modifies the likelihood of token appearing in the completion,
i.e. `--logit-bias 15043+1` to increase likelihood of token ' Hello',
or `--logit-bias 15043-1` to decrease likelihood of token ' Hello' | -| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) (default: '') | +| `--grammar GRAMMAR` | BNF-like grammar to constrain generations (see samples in grammars/ dir) | | `--grammar-file FNAME` | file to read grammar from | | `-j, --json-schema SCHEMA` | JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | | `-jf, --json-schema-file FILE` | File containing a JSON schema to constrain generations (https://json-schema.org/), e.g. `{}` for any JSON object
For schemas w/ external $refs, use --grammar + example/json_schema_to_grammar.py instead | @@ -238,7 +238,7 @@ For the full list of features, please refer to [server's changelog](https://gith | `-ngld, --gpu-layers-draft, --n-gpu-layers-draft N` | max. number of draft model layers to store in VRAM, either an exact number, 'auto', or 'all' (default: auto)
(env: LLAMA_ARG_N_GPU_LAYERS_DRAFT) | | `-md, --model-draft FNAME` | draft model for speculative decoding (default: unused)
(env: LLAMA_ARG_MODEL_DRAFT) | | `--spec-replace TARGET DRAFT` | translate the string in TARGET into DRAFT if the draft model and main model are not compatible | -| `--spec-type [none\|ngram-cache\|ngram-simple\|ngram-map-k\|ngram-map-k4v\|ngram-mod]` | type of speculative decoding to use when no draft model is provided (default: none) | +| `--spec-type [none\|ngram-cache\|ngram-simple\|ngram-map-k\|ngram-map-k4v\|ngram-mod]` | type of speculative decoding to use when no draft model is provided (default: none)

(env: LLAMA_ARG_SPEC_TYPE) | | `--spec-ngram-size-n N` | ngram size N for ngram-simple/ngram-map speculative decoding, length of lookup n-gram (default: 12) | | `--spec-ngram-size-m N` | ngram size M for ngram-simple/ngram-map speculative decoding, length of draft m-gram (default: 48) | | `--spec-ngram-min-hits N` | minimum hits for ngram-map speculative decoding (default: 1) |