* vulkan: extend topk_moe to handle sigmoid w/exp_probs_b for nemotron
Also handle GGML_OP_SCALE at the end (nemotron, deepseek2).
Fewer pipeline variants and spec constants, just use push constants.
In test_topk_moe, change exp_probs_b to be 1D, matching real networks.
Update test-backend-ops and ggml-backend to allow verifying multiple outputs
in a fusion test (topk_moe has two outputs). Previously only the final node
was verified.
* change test_topk_moe to allow results in arbitrary order
* disable sigmoid fusion for moltenvk
* chat: make tool description and parameters optional per OpenAI spec
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix#17667
* refactor: use value() for cleaner optional field access
I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn
and added coverage for exp_probs_b and some other missing combinations. This
exposed a bug in both CUDA and Vulkan backends where they were assuming the
input to argsort and the input to get_rows are the same. I'd like to optimize
this graph in another change, but for now just get it functional.
CUDA also had a bug where it got n_experts from the wrong place, leading to
GGML_ASSERT failures in some of the new tests.
* arg: fix order to use short form before long form
* arg: update doc
* arg: update test-arg-parser
* arg: address review feedback from ngxson
simplified to check first.length() <= last.length() only
fixed: --sampler-seq, --rerank, --draft ordering
note: middle positions in 3+ arg sets are not verified
* arg: update doc
* common : expose json-schema functionality to extract type info
* common : fix peg parser negation during needs_more_input
* common : add some defensive measures in constructed peg parser
* common : add nemotron nano 3 support
* common : add nemotron nano 3 tests
* remove debug line
* kv-cache : fix state restore with fragmented cache (#17527)
Change find_slot to allow non-contiguous allocation during state restore. Fixes 'failed to find available cells in kv cache' error when restoring state to fragmented cache.
* tests : update logic
* cleanup: tightened state_read_meta sig, added is_contiguous case
* fix: state_read_meta arg reorder loose ends
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
When the number of cols is large, split each row across multiple workgroups.
There are three phases that communicate partial results through temp buffers:
(1) compute max partials
(2) take max of partials, compute sum(exp(x-max)) partials
(3) sum partials, compute scaled result
* Extended TRI
* Fix whitespace
* chore: update webui build output
* Just use cuBLAS for everything...
* Merge both versions
* Remove incorrect imports causing failures for CI
* Still failing... remove all direct cublas imports and rely on common imports from "common.cuh"
* Defines for hipBlas
* Aaaand MUSA defines...
* I hate this job...
* Stupid typo...
* Update ggml/src/ggml-cuda/solve_tri.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* tests: update barrier test to check for race condition in active threads
* cpu: combine n_graph and n_threads into a single atomic update
* tests: add multi-graph test for test_barrier
* feat: Add a batched version of ssm_conv
This was done using Claude Code. It found a number of optimizations around
how the threads were organized, resulting in a huge performance boost!
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Optimized SSM_SCAN kernel for metal
This used Claude Code and resulted in a modest performance improvement
while maintaining correctness.
Branch: Mamba2SSD
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Add test-backend-ops perf tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: Real representitive tests for SSM_CONV
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use function constant for ssm_conv batch size
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* test: backend op tests for ssm_scan from granite4 1b-h
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* style: remove commented out templates
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: float4 version of ssm_conv_batched
Branch: SSMKernelImprovements
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Add missing ggml_metal_cv_free
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Old implementation parallelizes rows across SMs, which does not fit the
needs of backend-sampling (where we have ncols >> nrows and thus want to
parallelize ncols across SMs)