* common : handle incomplete UTF-8 at end of input in PEG parser
* cont : if reached end prematurely, emit needs_more_input to propagate partial output
* cont: refactor peg parse context to add lenient flag
* cont : remove partial flag, keep lenient flag
* ggml-Vulkan: add ELU support
* ggml-Vulkan: remove extra spaces and variables
* ggml-Vulkan: fix format issue
* ggml-Vulkan: fix format issue
* fix whitespace issue
* Update Vulkan.csv and ops.md
* vulkan: Fix data races in coopmat1 mul_mat(_id)
Add barriers between coopmat store and regular loads. We sort of got away with
this because it was the same subgroup accessing the values, but it's still a
race and may not work.
* switch to subgroup control barriers
* tests: add end-to-end tests per model architecture
* fixup for rebase
* fix use-after-free in llama-model-loader.cpp
* fix CI
* fix WebGPU
* fix CI
* disable CI for macOS-latest-cmake-arm64
* use expert_weights_scale only if != 0.0f
* comments
* Allow reshuffled arguments in tagged argument parser format tool calls.
* Remove shuffle just keep the optional parsers in any order
* Remove unnecessary import
* ggml-cuda: add mem check for fusion
* Replace NaNs with -FLT_MAX
* fix typo
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This patch addresses an Internal Compiler Error (Segmentation fault)
observed with gcc 15 by replacing the intrinsic + cast by doing
a cat on the data first and then calling the intrinsic. This bypasses the
buggy compiler path while maintaining identical instruction selection.
Performance Verification:
Assembly analysis on RHEL 9 (GCC 15.1.1) confirms that both the original
code and this fix generate the identical Power10 prefixed load instruction:
`plxv 40, 2(14)`
This ensures zero performance regression while unblocking builds on
newer toolchains.
Reproduced on:
- Alpine Linux + GCC 15.2.0-r2
- RHEL 9 + GCC 15.1.1 (gcc-toolset-15)
Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>
* CUDA: use shared mem for ssm_conv
* fuse silu + ssm_conv
* fuse unary + mul
* enable for fp16
* formatting
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* hexagon: add fp16 support for binary ops: add,sub,mul,div
* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)
* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad
* snapdragon: fix readme link
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* models : add llm_build_delta_net_base
* cont : keep qwen35 and qwen35moe graphs intact
* cont : add comments [no ci]
* add kimi linear to delta-net-base
* removed unnecessary ggml_cont from g_exp_t
* removed ggml_cont from g_diff_exp_t. moved ggml_cont for o to kimi-linear.cpp
* removed unnecessary diag mask
* cont : simplify
* cont : avoid graph splits
* scale q after mul instead of beginning
* scale q after mul instead of beginning
* identical ppl
* cont : fix scale and decay mask
* minor : remove TODO
* block implementation for kda
* remove space at the end of line 101
* concat+pad
* pad+binary row concat
* chunk size 16 for kda
* removed minor differences to master
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Adds CPU-to-CUDA copy capability to
ggml_backend_cuda_cpy_tensor_async()
* Adds function to relax sync requirements between input copies on
supported backends (CUDA for now)
* Exchanges synchronous copy with async copy function.
* Adds macro guards to allow compilation in non-CUDA builds
* Reworked backend detection in ggml-backend.cpp to avoid linking
conflicts
* Relax requirement of checks in async CUDA copies from backend and buffer type to just buffer type, to avoid linking issues
* Minor cleanup
* Makes opt-in to relax use of explicit syncs more general. Backends like
vulkan which require a synchronization between HtoD copies and graph
execution could also adopt this change now.
* Reintroduces stricter check for CPU->CUDA backend async copy via
GGML_DEVICE_TYPE_CPU.
* Corrects initialization of ggml_backend_sync_mode in
ggml_backend_sched_split initialization
* Simplifies synchronizations to adhere to `saaasg` pattern.
* Apply suggestion from @ggerganov (src->buffer to buf_src)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestion from @ggerganov (src->buffer to buf_src) v2
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* model : fix Qwen3.5 model type detection
* Update src/llama-model.cpp
whoops, my bad
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>