* common : expose json-schema functionality to extract type info
* common : fix peg parser negation during needs_more_input
* common : add some defensive measures in constructed peg parser
* common : add nemotron nano 3 support
* common : add nemotron nano 3 tests
* remove debug line
* kv-cache : fix state restore with fragmented cache (#17527)
Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.
* tests : update logic
* cleanup: tightened state_read_meta sig, added is_contiguous case
* fix: state_read_meta arg reorder loose ends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
* Extended TRI
* Fix whitespace
* chore: update webui build output
* Just use cuBLAS for everything...
* Merge both versions
* Remove incorrect imports causing failures for CI
* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"
* Defines for hipBlas
* Aaaand MUSA defines...
* I hate this job...
* Stupid typo...
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* tests: update barrier test to check for race condition in active threads
* cpu: combine n_graph and n_threads into a single atomic update
* tests: add multi-graph test for test_barrier
* feat: Add a batched version of ssm_conv
This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Optimized SSM_SCAN kernel for metal
This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Add test-backend-ops perf tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Real representitive tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use function constant for ssm_conv batch size
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: backend op tests for ssm_scan from granite4 1b-h
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: remove commented out templates
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: float4 version of ssm_conv_batched
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing ggml_metal_cv_free
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Old implementation parallelizes rows across SMs, which does not fit the
needs of backend-sampling (where we have ncols >> nrows and thus want to
parallelize ncols across SMs)
* Fix kimi-k2 parsing
* fix template & add more tests for kimi-k2
* Another fix for Kimi-K2 chat template.
* enable allow_toolcall_in_think for Kimi-K2
* Refine key-value separator and value end format
* Enable tool call in think for kimi-k2
* allow_toolcall_in_think is now tested with Kimi-K2
* Remove outdated TODO comment in XML tool call parser
Removed TODO comment about untested tool call feature.
* Rename function from "utf8_truncate_safe" to "utf8_truncate_safe_len"
* Feat: Added vulkan circular tiling support
* Feat: Added cpu circular
* Feat: Added cuda kernels
* Added tests
* Added tests
* Removed non-pad operations
* Removed unneded changes
* removed backend non pad tests
* Update test-backend-ops.cpp
* Fixed comment on pad test
* removed trailing whitespace
* Removed unneded test in test-backend-ops
* Removed removed test from calls
* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp
Co-authored-by: Ruben Ortlam <picard12@live.de>
* Fixed alignment
* Formatting
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Format pad
* Format
* Clang format
* format
* format
* don't change so much stuff
* clang format and update to bool
* fix duplicates
* don't need to fix the padding
* make circular bool
* duplicate again
* rename vulkan to wrap around
* Don't need indent
* moved to const expr
* removed unneded extra line break
* More readable method calls
* Minor wording changes
* Added final newline
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added circular pad ext tests
* Gate non circular pad devices
* Cleaned gating of non-circular pad devices
---------
Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Reduce temporary memory usage for TOP_K
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* vulkan: fix top_k bug when there are ties in the input
I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).
I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.
* Update tests/test-backend-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add support for CUMSUM and TRI for CUDA.
* Minor optimizations.
* Correct warp_prefix_inclusive_sum in float2 variant to return float2
* Optimize TRI
* Whitespace
* Fix strides.
* Implement double loop
* Whitespace
* Fix HIP compilation bugs
* Optimizations + big case performance tests
* Implement using CUB with fallback to custom kernel
* Remove error message.
* Fixes from code review
* Comment out CPU-unsupported F16/BF16 cases to fix CI
* Fine, you win :P
* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
* Vary warp-size based on physical warp size
* Add GGML_UNUSED_VARS in tri as well
* Use constexpr and call prefix_inclusive with warp_size template param
* Update ggml/src/ggml-cuda/cumsum.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Change to tid % warp_size
* Fix strides; hardcode mask; add ggml_lane_mask_t
* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
* Too hasty...
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
This commit modifies the graph building logic to immediately continue
when a token has already been sampled by the backend sampler.
It also updates the test for backend temporary sampling to include
top-k and distribution samplers in the chain to verify that they are not
producing any logits (they are not run).