* Make graph_max_nodes vary by ubatch size for models where chunking might explode the graph
* Update src/llama-context.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add missing const
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llama : remove quantization sanity check
This commit removes the quantization sanity check for attention layers.
The motivation for this is that there are model that are hybrid models
that have recurrent layers, experts layers, and attention layers. For
these models the current check fails as the experts layers are not
taking into account. After consideration, it was decided that this check
is not strictly necessary, and can be removed to allow for more flexible
model architectures.
* llama : remove unused pruned_attention_w and is_clip_model vars
Add nosubs|optimize flags to std::regex constructors to prevent
catastrophic backtracking when processing prompts with repeated
identical characters (e.g., 'A' * 10000).
The nosubs flag disables subgroup capture, significantly reducing
memory usage and backtracking on uniform token sequences
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.
It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).
This commit fixes the implementation of the temperature-based sampler
for the case when the temperature is set to zero. This now correctly
selects the most probable token by masking out all other tokens in the
logits.
In addition to match the algorithm proposed in the original
[paper](https://arxiv.org/abs/1904.09751), this resolves the edge-case
where `max_p is > top_p` for a single logit, where the mask would
otherwise be empty (and we thus sample from the whole vocabulary with
equal likelihood)
This commit changes the logging level for output buffer reallocations
in the llama_context::output_reserve function from INFO to DEBUG.
The motivation for this is that it currently logs to info and when
enabling verbose logging for llama-cli this will get mixed with the
output, for example:
```console
What is the capital of Sweden?output_reserve: reallocating output buffer from size 0.58 MiB to 1.74 MiB
1. Stockholm
2\. Helsinki
Based are the options
1. Stockholm
Explanation: Stockholm is the capital of
...
```
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>