This commit adds support for performing sampling operations on the
backend (e.g. GPU) as part of the model computation graph.
The motivation for this feature is to enable sampling to be performed
directly on the backend as part of the computation graph being executed,
allowing for some or all of the sampling to be done on the backend.
For example, the backend sampler chain might select/sample a token
directly in which case only the sampled token needs to be transferred
from device memory to host memory.
It is also possible for the backend samplers to perform filtering of
the logits, or compute and filter the probability distribution, in
which case only the filtered logits or probabilites need to be
transferred back to system memory for further processing by CPU
samplers.
Currently the backend sampling works in a similar manner to how
pooling works, it is a function that is called by build_graph and the
sampler operations become part of the models computation graph.
* sampling : optimize sorting using bucket sort in more places
ggml-ci
* sampling : do not sort in dist sampler
ggml-ci
* sampling : avoid heap allocations for sort buffers
ggml-ci
* common : add option to sort sampling candidates by probability
ggml-ci
* sampling : revert the change for preserving sort buffers
* sampling : use std::copy instead of memcpy
* sampling : clarify purpose of partial sort helpers
ggml-ci
* cont : remove wrong comment [no ci]
* common : update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* initial porting of previous LLG patch
* update for new APIs
* build: integrate llguidance as an external project
* use '%llguidance' as marker to enable llg lark syntax
* add some docs
* clarify docs
* code style fixes
* remove llguidance.h from .gitignore
* fix tests when llg is enabled
* pass vocab not model to llama_sampler_init_llg()
* copy test-grammar-integration.cpp to test-llguidance.cpp
* clang fmt
* fix ref-count bug
* build and run test
* gbnf -> lark syntax
* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag
* rename llguidance test file to test-grammar-llguidance.cpp
* add gh action for llg test
* align tests with LLG grammar syntax and JSON Schema spec
* llama_tokenizer() in fact requires valid utf8
* update llg
* format file
* add $LLGUIDANCE_LOG_LEVEL support
* fix whitespace
* fix warning
* include <cmath> for INFINITY
* add final newline
* fail llama_sampler_init_llg() at runtime
* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes
* simplify #includes
* improve doc string for LLAMA_LLGUIDANCE
* typo in merge
* bump llguidance to 0.6.12
- Add `struct llama_sampler` and `struct llama_sampler_i`
- Add `llama_sampler_` API
- Add `llama_sampler_chain_` API for chaining multiple samplers
- Remove `LLAMA_API_INTERNAL`
- Add `llama_perf_` API and remove old `llama_print_timings` and `llama_reset_timings`
* llama_sampling_sample with default args is more naively usable
* Batches populated by either llama_batch_get_one or llama_batch_add work with default args
* Previously get_one could use the default argument
* Previously add should usually have used the last index where logits[idx] == true
* This hopefully encourages the use of llama_batch_add
* By giving expected results when using default arguments.
* Adds "negative indexing" feature to llama_get_logits_ith and llama_get_embeddings_ith
* Believed to work with any currently well behaved program
* Default arg now works for both cases (previously would give strange results for add case)
* Any non-negative number is unaffected and behaves as previously
* Negative arguments were previously invalid.
* Implemented as a special case of indexing as suggested by @compilade in https://github.com/ggerganov/llama.cpp/pull/6519
* Fixed mismatch type errors
* cited in macOS CI tests
* Missed in original updates based on PR feedback in https://github.com/ggerganov/llama.cpp/pull/6519
* sampling: remove duplicated code for probability distribution access
* free original_logits
* fix original_logits allocation
* fixes based on review @cebtenzzre
* change function name to `llama_sampling_prepare`
* (WIP) Implement stochastic speculative decoding
* sample from residual distribution on draft accept failure
* fix#5657: force greedy sampling with probs when temp is 0
* remove p_accept parameter
* fix style
* remove unused variables
* add srand() in speculative.cpp
* replace use of rand() with mt19937 sampling
* fixes based on review (@JohannesGaessler)
* fix r random generation
* randomly select next sequence to verify + fix bug in memory freeing
* fix bug in active_seqs sync
* fix uniform int distribution initialization
* remove warnings from comparison between int and size_t
* check grammar in `llama_sample_probability_distribution_impl`
* remove malloc code by utilizing vectors
* add PR link to README
* common: use enums for sampler types
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* minor : spaces
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* implemented dynamic temperature sampling from koboldcpp
* removed trailing whitespace
* removed unused temp parameter in llama_sample_entropy
* exposed exponent_val in dynamic temp sampler
* added debug check for printf statements
* use nullptr in llama_sample_softmax call during llama_sample_entropy
this avoids counting the time taken stats twice
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* return earlier if there is only 1 candiate (i.e. max_entropy == 0)
* reformat 't' case in llama_sample_queue
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* check for one or zero candidates case in llama_sample_entropy
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* Samplers sequence order w parameter
* Cleaned commented code
* Fixed formatting
* Rewrote with unordered_map
* Revert and rewrite, too many problems and safeguards would be needed
* Fixed code style
* Code style fixes according to review
* More readable samplers input string, fixed help
* Style fix in sampler_queue
* Formatting fixes
* Fixing whitespaces
* Introduce the new Min-P sampler by @kalomaze
The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter *p* represents the minimum probability for a token to be considered, relative to the probability of the most likely token.
* Min-P enabled and set to 0.05 default
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>
* Fix mirostat state when using multiple sequences
* Fix mirostat by completely refactoring sampling!
* Try to fix zig build.
* Export function to fetch/create default sampler states
Code formatting cleanups and add some comments
Silence a warning about id not being used when logging is disabled
* Apply some renaming suggestions.
Fix comments that were out of sync with the pull.
* Use more consistant naming convention for sampling contexts