This commit fixes the issue where both sampled tokens and logits/probs
were not being copied correctly from the backend to the host when
multiple backend samplers were used.
A test for this scenario has also been added to ensure that both types
of data are copied correctly when different backend samplers are
employed.
* mmf for rdna4
* align the padding for rdna4
* forbit mul_mat_f for rdna4
* fix as comment
* remove device kernels
* add constexpr for early return
* update based on review comment
* change based on the review comment
* pass compile error
* keep code consistency
---------
Co-authored-by: zhang hui <you@example.com>
* Detect GigaChat3-10-A1.8B as deepseek lite
Hardcodes checking number of layers to detect if lite version of deepseek.
* Add commnent identifying deepseek lite variants
deepseek lite variants include DeepSeek-V2-Lite, GigaChat3-10B-A1.8B
This commit refactors the model loading process in common/common.cpp
to enable backend sampler to be configure prior to the llama_context
creation.
The motivation for this change is that just being able to set/reset the
backend samplers after the llama_context has been created will cause a
resize to occur in llama_context::output_reserve which we want to avoid.
* CANN: Refactor `evaluate_and_capture_cann_graph`
**Description of the problem**
* `matched_graph` is obtained even if graph mode is disabled.
* End of graph capture and graph replay are unnecessarily placed in different `if` blocks.
**Proposed solution**
* Obtain `matched_graph` only if graph mode is enabled.
* Place end of graph capture and graph reply inside the same `if` block.
* Unify graph related comments.
* Remove trailing whitespace
This commit introduces a sampling_info struct to encapsulate all
backend sampling related data within the llama_context class.
It also updates to use more descriptive names for sampled tokens and
candidates in the backend sampler ggml data structure.
* Fix DoS / integer overflow
* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)
* White space
* Actually, since it's unsigned, use UINT64_MAX
* fix: TypeError when loading base model remotely in convert_lora_to_gguf
* refactor: simplify base model loading using cache_dir from HuggingFace
* Update convert_lora_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* feat: add remote_hf_model_id to trigger lazy mode in LoRA converter
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
This commit updates common/sampler.cpp set_logits and
src/llama-sampling.cpp llama_sampler_sample to always populate the
logits field when backend sampled probabilities are available.
The motivation for this is that this ensure that CPU sampler always have
access to the logits values even when probabilites have been produced by
backend samplers.