* vulkan: Reduce temporary memory usage for TOP_K
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
* vulkan: fix top_k bug when there are ties in the input
I noticed by inspection a bug in the vulkan top_k shader where if the least
value in the top_k appears multiple times we could end up writing those extra
copies out rather than some larger values (if the larger values are on higher
numbered threads).
I rewrote the test verification to handle this case, where the final index set
is not necessarily the same.
* Update tests/test-backend-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add support for CUMSUM and TRI for CUDA.
* Minor optimizations.
* Correct warp_prefix_inclusive_sum in float2 variant to return float2
* Optimize TRI
* Whitespace
* Fix strides.
* Implement double loop
* Whitespace
* Fix HIP compilation bugs
* Optimizations + big case performance tests
* Implement using CUB with fallback to custom kernel
* Remove error message.
* Fixes from code review
* Comment out CPU-unsupported F16/BF16 cases to fix CI
* Fine, you win :P
* Fix last cast, use NO_DEVICE_CODE and GGML_UNUSED_VARS
* Vary warp-size based on physical warp size
* Add GGML_UNUSED_VARS in tri as well
* Use constexpr and call prefix_inclusive with warp_size template param
* Update ggml/src/ggml-cuda/cumsum.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestions from code review
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Change to tid % warp_size
* Fix strides; hardcode mask; add ggml_lane_mask_t
* Missing renames, remove unused get_warp_mask(), explicit calls to ggml_cuda_info()
* Too hasty...
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* common : implement parser combinators to simplify chat parsing
* add virtual destructor to parser_base
* fix memory leak from circular references of rules
* implement gbnf grammar building
* remove unused private variable
* create a base visitor and implement id assignment as a visitor
* fix const ref for grammar builder
* clean up types, friend classes, and class declarations
* remove builder usage from until_parser
* Use a counter class to help assign rule ids
* cache everything
* add short description for each parser
* create a type for the root parser
* implement repetition parser
* Make optional, one_or_more, and zero_or_more subclasses of repetition
* improve context constructor
* improve until parsing and add benchmarks
* remove cached() pattern, cache in parser_base with specialized parsing functions for each parser
* improve json parsing performance to better match legacy parsing
* fix const auto * it for windows
* move id assignment to classes instead of using a visitor
* create named rules in the command r7b example
* use '.' for any in GBNF
* fix parens around choices in gbnf grammar
* add convenience operators to turn strings to literals
* add free-form operators for const char * to simplify defining literals
* simplify test case parser
* implement semantic actions
* remove groups in favor of actions and a scratchpad
* add built in actions for common operations
* add actions to command r7b example
* use std::default_searcher for platforms that don't have bm
* improve parser_type handling and add cast helper
* add partial result type to better control when to run actions
* fix bug in until()
* run actions on partial results by default
* use common_chat_msg for result
* add qwen3 example wip
* trash partial idea and simplify
* move action arguments to a struct
* implement aho-corasick matcher for until_parser and to build exclusion grammars
* use std::string for input, since std::string_view is incompatible with std::regex
* Refactor tests
* improve qwen3 example
* implement sax-style parsing and refactor
* fix json string in test
* rename classes to use common_chat_ prefix
* remove is_ suffix from functions
* rename from id_counter to just counter
* Final refactored tests
* Fix executable name and editorconfig-checker
* Third time's the charm...
* add trigger parser to begin lazy grammar rule generation
* working lazy grammar
* refactor json rules now that we check for reachability
* reduce pointer usage
* print out grammars in example
* rename to chat-peg-parser* and common_chat_peg_parser*
* Revert unrelated changes
* New macros for CMakeLists to enable multi-file compilations
* starting unicode support
* add unicode support to char_parser
* use unparsed args as additional sources
* Refactor tests to new harness
* Fix CMakeLists
* fix rate calculation
* add unicode tests
* fix trailing whitespace and line endings
skip-checks: true
* Helpers + rewrite qwen3 with helpers
* Fix whitespace
* extract unicode functions to separate file
* refactor parse unicode function
* fix compiler error
* improve construction of sequence/choice parsers
* be less clever
* add make_parser helper function
* expand usage of make_parser, alias common_chat_msg_peg_parser_builder to builder in source
* lower bench iterations
* add unicode support to until_parser
* add unicode support to json_string_parser
* clean up unicode tests
* reduce unicode details to match src/unicode.cpp
* simplify even further
* remove unused functions
* fix type
* reformat char class parsing
* clean up json string parser
* clean up + fix diagnostics
* reorder includes
* compact builder functions
* replace action_parser with capture_parser, rename env to semantics
* rename env to semantics
* clean up common_chat_parse_context
* move type() to below constant
* use default constructor for common_chat_peg_parser
* make all operators functions for consistency
* fix compilation errors in test-optional.cpp
* simplify result values
* rename json_string_unquoted to json_string_content
* Move helper to separate class, add separate explicit and helper classes
* Whitespace
* Change + to append()
* Reformat
* Add extra helpers, tests and Minimax example
* Add some extra optional debugging prints + real example of how to use them
* fix bug in repetitions when min_count = 0 reports failures
* dump rule in debug
* fix token accumulation and assert parsing never fails
* indent debug by depth
* use LOG_* in tests so logs sync up with test logs
* - Add selective testing
- Refactor all messaging to use LOG_ERR
- Fix lack of argument / tool name capturing
- Temporary fix for double event capture
* refactor rule() and introduce ref()
* clean up visitor
* clean up indirection in root parser w.r.t rules
* store shared ptr directly in parser classes
* replace aho-corasick automation with a simple trie
* Reset prev for qwen3 helper example variant
* refactor to use value semantics with std::variant/std::visit
* simplify trie_matcher result
* fix linting issues
* add annotations to rules
* revert test workaround
* implement serializing the parser
* remove redundant parsers
* remove tests
* gbnf generation fixes
* remove LOG_* use in tests
* update gbnf tests to test entire grammar
* clean up gbnf generation and fix a few bugs
* fix typo in test output
* remove implicit conversion rules
* improve test output
* rename trie_matcher to trie
* simplify trie to just know if a node is the end of a word
* remove common_chat_ prefix and ensure a common_peg_ prefix to all types
* rename chat-peg-parser -> peg-parser
* promote chat-peg-parser-helper to chat-peg-parser
* checkpoint
* use a static_assert to ensure we handle every branch
* inline trivial peg parser builders
* use json strings for now
* implement basic and native chat peg parser builders/extractors
* resolve refs to their rules
* remove packrat caching (for now)
* update tests
* compare parsers with incremental input
* benchmark both complete and incremental parsing
* add raw string generation from json schema
* add support for string schemas in gbnf generation
* fix qwen example to include \n
* tidy up example
* rename extractor to mapper
* rename ast_arena to ast
* place basic tests into one
* use gbnf_format_literal from json-schema-to-grammar
* integrate parser with common/chat and server
* clean up schema and serialization
* add json-schema raw string tests
* clean up json creation and remove capture parser
* trim spaces from reasoning and content
* clean up redundant rules and comments
* rename input_is_complete to is_partial to match rest of project
* simplify json rules
* remove extraneous file
* remove comment
* implement += and |= operators
* add comments to qwen3 implementation
* reorder arguments to common_chat_peg_parse
* remove commented outdated tests
* add explicit copy constructor
* fix operators and constness
* wip: update test-chat for qwen3-coder
* bring json parser closer to json-schema-to-grammar rules
* trim trailing space for most things
* fix qwen3 coder rules w.r.t. trailing spaces
* group rules
* do not trim trailing space from string args
* tweak spacing of qwen3 grammar
* update qwen3-coder tests
* qwen3-coder small fixes
* place parser in common_chat_syntax to simplify invocation
* use std::set to collect rules to keep order predictable for tests
* initialize parser to make certain platforms happy
* revert back to std::unordered_set, sort rule names at the end instead
* uncomment rest of chat tests
* define explicit default constructor
* improve arena init and server integration
* fix chat test
* add json_member()
* add a comprehensive native example
* clean up example qwen test and add response_format example to native test
* make build_peg_parser accept std::function instead of template
* change peg parser parameters into const ref
* push tool call on tool open for constructed parser
* add parsing documentation
* clean up some comments
* add json schema support to qwen3-coder
* add id initializer in tests
* remove grammar debug line from qwen3-coder
* refactor qwen3-coder to use sequence over operators
* only call common_chat_peg_parse if appropriate format
* simplify qwen3-coder space handling
* revert qwen3-coder implementation
* revert json-schema-to-grammar changes
* remove unnecessary forward declaration
* small adjustment to until_parser
* rename C/C++ files to use dashes
* codeowners : add aldehir to peg-parser and related files
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Fix .gitignore
* Add memory64 option and remove unneeded macros for setting threads to 1
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
* fix pipeline selection
* vulkan: Add N-ary search algorithm for topk
* microoptimizations
* vulkan: support larger argsort
This is an extension of the original bitonic sorting shader that puts the
temporary values in global memory and when more than 1024 threads are needed
it runs multiple workgroups and synchronizes through a pipelinebarrier.
To improve the memory access pattern, a copy of the float value is kept with
the index value. I've applied this same change to the original shared memory
version of the shader, which is still used when ncols <= 1024.
* Reduce the number of shader variants. Use smaller workgroups when doing a single pass, for a modest perf boost
* reduce loop overhead
* run multiple cols per invocation, to reduce barrier overhead
* Fix too relaxed check on CUDA "fast copy" (can_be_transposed) condition
* Argh.
* Making CISC happy ;)
* Integrate CONT tests
* Use loopy loop
* Skip new tests for (B)F16 for now.
* Add files via upload
* fix unit test
* fix crashes for --reasoning-format=none
* Patch buggy official MiniMax-M2 chat template
* add upstream minja fix: https://github.com/ochafik/minja/pull/7
* Fix <think> token not generated
* add test copied from https://github.com/ggml-org/llama.cpp/pull/16946
* cleanup
* Hopes to fix the compilation error on CI
* Delete chat template patching since it’s fixed by upstream Minja
* Remove undeeded Minimax-M2 template patch
https://github.com/ochafik/minja/pull/7#issuecomment-3480356100
* Add proper handling of optional parameters with test
merged tests from: 23d4bb75c4
* Fix making all tool parameters optional
* Move xml tool parser to separate file
* cleanup & add tests for GLM4.5
* add streaming tests & enhancement & cleanups
Add streaming test for both GLM 4.5 and minimax-m2.
Cleanup for preserved_tokens.
Cleanup for grammar rule name.
Enhance the parser's stability.
* cleanup & add support for Kimi-K2 Qwen3-Coder Apriel-1.5 Xiaomi-MiMo
* apply suggestions from reviewers
* fix a misuse for data.grammar_lazy
* fix grammar when tool have no argument
* Fix `no triggers set for lazy grammar!` for GLM4.5/4.6. Insert additional stops for Kimi-K2
* update chat.cpp
* fix grammar for GLM 4.5/4.6
* Try fix Jinja template for GLM
* Try fix GLM-4.6.jinja
* Update common/chat-parser-xml-toolcall.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update tests/test-chat.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* improve chat template for GLM, rename Kimi-K2 template to Kimi-K2-Thinking
* Improve Kimi-K2 chat template
* Fix unit test
* Fix "Invalid tool call arguments passed" in a rare case.
In a rare case, the model may emit a raw string that begins with a valid JSON string. This commit adds unit tests to cover that scenario and fixes the regression introduced during the Kimi-K2 adaptation.
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* extract rotate_pairs logic from ggml_compute_forward_rope_f32
* templateify ggml_compute_forward_rope_f32 and _f16
* abort when rope type not supported, remove GLM from test-rope
* add imrope branch to switch
* add rope tests for perf
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan : implement upscale with bicubic interpolation
* cuda : implement upscale with bicubic interpolation
* tests : add ggml_interpolate with GGML_SCALE_MODE_BICUBIC to backend tests
* adapt OpenCL backend to not support the OP in that case so tests don't fail
* print scale mode & flags in test-backend-ops
This change combines the rms_norm+mul and rope+view+set_rows fusions to
allow fusing the whole sequence together. This comes up in Qwen3, Bailing,
and some other models.
* WIP
* added a cpy kernel specific to transposed tensor which uses smem to avoid uncoalesced access; test cases also added shwoing improved memory bandwidth
* added BF16 support
* more strict check to make sure src0 is a transpose
* reformulated to handle more complicated transpose cases
* bring back 2D transpose for higher performance
* allow build on windows
* tranpose copy more shapes
* minor tweak
* final clean up
* restore some test cases
* keep only the kernel for true tranposed case; updated with review suggestions
* make CI happy
* remove headers not needed
* reduced bank conflicts for fp16 and bf16
* add missing const*
* now bank conflicts free
* use padding instead of swizzling
---------
Co-authored-by: bssrdf <bssrdf@gmail.com>
* tests: fix segfault in moe-expert-reduce test in support mode and --show-coverage
* tests: init gf and filter out fusion tests for support mode
* tests: filter out fusion cases before calling eval_support
* tests: filter out fusion cases from show_test_coverage as well, fix lint
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* server : support unified context across slots
* cont : fix speculative decoding initialization
* context : fix n_ctx_per_seq computation
* server : purge slots one by one
* tests : add unified cache server tests
* llama : update per-seq context computation
* test-thread-safety : handle tiny training context of the input model
* server : fix server_tokens clear()
* server : use 4 slots + unified KV by default
* llama : add note about context size queries
* cont : update todos [no ci]
* context : do not cap the size of the context
* tests : adjust parameters to be CI friendlier
* context : add warning
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).
Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).
Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.
Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.
Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.
Add new backend tests.
* ggml : fix interpolate with align-corners and ne=1
* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken
* fix clang warning
* SYCL: Add support for FLOOR,CEIL,ROUND and TRUNC unary operators
Clean up unrelated changes from previous commit
* Chore: remove empty lines and fix indentation
* Clean up: remove leftover blank lines and fix spacing
* chore: fix trailing whitespace and ensure final newline
* Cleanup: remove redundant declarations already defined in header
* Sync docs/ops.md with updated backend operation support
* docs: update ops.md after rebase
* docs: update ops.md - Vulkan supports SSM_CONV and SSM_SCAN
* opencl: add mm_q8_0_f32
* opencl: fix data loading for incomplete tile
* opencl: use q8_0 mm for larger matrix
* opencl: add some tests to cover the path
* optimise GGML_OP_SUM
* add non-contiguous tests by permuting the input
* change tests to require full contiguity of OP_SUM
* cuda : add check GGML_OP_SUM
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing
- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
* refactor: implement streaming-aware universal reasoning parser
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.
- Rework try_parse_reasoning() to track whitespace, partial tags, and
multiple reasoning segments, allowing proper separation of reasoning_content
and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments
The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.
Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
* refactor: address review feedback from allozaur
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)
- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
* refactor: address review feedback from ngxson
* debug: say goodbye to curl -N, hello one-click raw stream
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story
- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
* npm run format
* chat-parser: address review feedback from ngxson
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
* metal : pad K, V and Mask when needed
* cont : simplify
* cuda : add TODO about KV padding requirement
* metal : add comments
* metal : remove mask padding requirement
* tests : add -INF blocks to the KQ mask in the FA tests
* cont : bump -INF block size to 64
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* ggml : prevent division by zero in FA CPU op
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* fix: Fix duplicate fake image before token on first slice
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use double-newline before overview image
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Remove incorrect newline at the end of granite chat template gen prompt
There should not be one, even for the language models.
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* tests: Remove bad newline from granite chat template test (legacy)
Branch: GraniteDoclingStopping
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: added a dedicated Magistral chat format that preserves [THINK] spans, parses reasoning before tool calls
* feat: new flow in the chat template test suite for Magistral
* do not use more threads than physically available
* ensure n_threads > 0
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* First attempt
* No permute during convert (fixes qk tensors), proper norm application.
* RoPE = NeoX
* Coherence!
* Migrate xielu params from tensors to hyperparameters
* Simple CUDA kernel
* Revert stupid LLM refactorings
* Chat template support
* configchecker / flake8 errors
* Reorder unary.cu
* I do conclude that LLMs are, in fact, stupid.
* Fix after merge
* Final newline
* Make xIELU an UNARY_OP
* Final newline
* Correctly account for parameter shift
* Argh.
* Update ggml/src/ggml-cpu/unary-ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Refactor: remove unused methods, inline and factorize softplus, add const modifiers
* Revert CUDA changes, implement xIELU as a separate OP
* Pesky newline
* Add float2half / half2float for F16 inputs/outputs
* CUDA variants, attempt 2
* Actually, attempt 3
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Missing convert header
* Proper formula and reference for xIELU in the comments.
* Modify unary-ops.cpp to add the functor-based logic besides the template system to retain optimizations
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add tensor mappings for Apertus to global list instead
* Fix lazy on scalars
* Update ggml/src/ggml-cuda/unary.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add comment about the constraints on positive/negative alpha
* Change `softplus` to `ggml_softplus`
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Work on rope
* Simplify inplace operation generation and combine mul/add generation
* Work on rope variants
* implement neox rope
* rope complete
* Add sub,div,glu operators
* implement scale op
* Update cpy shader to handle cont/more types
* formatting
* Update test vars printing for rope,rms_norm
* Avoid ROPE hardcoded constants
* Add TODO to change ROPE constants to enum
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix TODO comment
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.
The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* vulkan: 64-bit im2col
Add variants of the im2col shaders that use buffer_device_address/buffer_reference,
and use 64-bit address calculations. This is needed for large convolutions used in
stable-diffusion.cpp.
* fix validation error for large im2col
* metal : support mul_mm with src1->type == GGML_TYPE_F16
* metal : support mul_mm_id with src1->type == GGML_TYPE_F16
[no ci]
* metal : mul_mm support ne00 % 32 != 0
* metal : support mul_mm_id with ne00 % 32 != 0
* cont : remove unnecessary unrolls
* cont : simplify data loading
* metal : optimize mul_mm when output bounds checks are not needed
* vulkan: handle mat_mul with A matrix > 4GB
This change splits mat_mul operations with huge A matrix into chunks in the M
dimension. This works well for stable-diffusion use cases where the im2col
matrix has very large M.
Fix the order of setting the stride in mul_mm_cm2 - setting the dimension
clobbers the stride, so stride should be set after.
* build fixes
* CUDA: mul_mat_id for mmf for bs <= 64 for f16 and bs <= 32 for f32
This commit adds mul_mat_id support for ncols_dst >= 16. It does this by
packing ncols_dst tiles into the blockDim.y.
My tests on a RTX 3090 show that this is faster than the cuBLAS fallback
for f16 till bs=64, and for f32 till bs=32
* Review: refactor if statement
* devops: move s390x and ppc64le ci build
we have access to ubuntu-24.04-s390x and ppc64le images now
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le for now since they have compiler errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: stop warnings as errors
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: switch to non-macro flag
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: going the llama macro route
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian gguf test models
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le to test s390x, check test build
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.inp files for big-endian tests
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dup .gguf.out files for big-endian too
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add python setup and endian byteswap
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: pooring thing does not have s390x python3
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add missing rust compiler for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try rust actions runner
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Revert "devops: try rust actions runner"
This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: try a different path for rust
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: dump home directory and user info
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: install gguf-py only
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: missed relative path
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: remove big-endian files since local swapping is working
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: revert test-tokenizer-0 cmakelists
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix unicode flags conversion from and to uint16_t
Bitfields are allocated in different order on s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Simplify byteswap command
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix endianness detection in vocab loader
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Disable test-thread-safety on s390x
In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.
There is no clean way to separate all those steps
to add byteswapping between them, so just skip this test.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q8_0 test in test-quantize-fns
vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add big-endian stories260K
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: add s390x test-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix test does not exist
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: fix model not found llama-eval-callback
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix q3_K dot product error in test-quantize-fns on s390x
Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: re-enable ppc64le for testing
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: activate test-thread-safety for s390x
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: disable ppc64le tests
for some reason it keeps failing test-thread-safety tests and I do not
have a machine that is able to replicate the tests.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* devops: LLAMA_FATAL_WARNINGS=ON
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Correct repository URL for s390x for test-thread-safety model
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fix fs_get_cache_directory
Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Re-enable CI for ppc64le
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Fortify ggml_rope_impl
Only memcpy data from sections argument if it's non-NULL.
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way
* Update URL for big-endian model
* Update .github/workflows/build.yml
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update remaining mentions of BE models to ggml-org/models repo
---------
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* CUDA: add a fused top-K MoE kernel
This kernel does the following:
1. softmax over the logits per token [n_experts, n_tokens]
2. argmax reduce over the top-k (n_experts_used) logits
3. write weights + ids to global memory
It is intended as fusion of softmax->top-k->get_rows pipeline for MoE models
* Refactor into ggml_cuda_should_use_topk_moe
* Review: Use better coalescing pattern, use WARP_SIZE, store logits into registers before
* Review: format + micro-optimizations
* Fix bug: fix tie breakers
* Add optional norm + clean-up code
* Use smem for final write
* Add bounds check
* Use better memory pattern for writeback
* run the x64 ci on regular machines
* set up the same thing for arm
fix test-quantize-perf just like #12306
* try to disable sve
* add another sve run
* ggml : make gallocr respect the backend's max buffer size
* if the graph requires more memory than can fit into a single allocation, split it into multiple backend buffers
* vulkan: report the actual max allocation size in buffer type interface
* fix missing newline, apple-clang warning
* track size of individual chunks in ggml_dyn_tallocr and raise max chunks.
revert to use suballocation_block_size as max chunk size for vulkan.
* track (chunk, offset) pairs instead of "global" offsets through gallocr.
* simpler, don't need loops to map between local/global offsets
* touches more code
* fix dyn_tallocr_max_size and initialization
* fix memory leak when buffers are reused due to same buffer type appearing multiple times
* make vbuffer allocation follow the same logic as backend_buffer did before
* continue to use leftover unallocated space of previous chunks after a new one has been created
* treat free blocks of each chunk as separate list
* they're still allocated together, but start/end of each chunk is tracked, and allocate/free iterate over sub-ranges
* exhaust freed blocks of all chunks before considering their last blocks with unallocated space
* start with 0 chunks/blocks and create chunks as needed
* allow the last chunk to grow beyond max size
* refactor: move adding new free block and new chunk into separate functions
* allocate chunks individually with a separate free-blocks list for each one
* needs a bit more memory/allocations/indirections, but code is simpler
* fix warnings (missing static) & debug checks
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* CUDA: Optimize PAD_REFLECT_1D
feat: add more test cases for PAD_REFLECT_1D
* use fast_div to improve performance
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Apply suggestion from JohannesGaessler
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* optimize
* use a concise expression to further speedup the cuda kernel
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Add paramater buffer pool, batching of submissions, refactor command building/submission
* Add header for linux builds
* Free staged parameter buffers at once
* Format with clang-format
* Fix thread-safe implementation
* Use device implicit synchronization
* Update workflow to use custom release
* Remove testing branch workflow
* some f32 tests passing
* Disable set_rows until it's implemented
* f32 add all tests passing
* Begin work on set_rows
* Work on set rows
* Add error buffers for reporting unsupported SET_ROWS indices
* Remove extra comments
* Add templated addition, clean up code
* Get addition and multiplication working
* Implement rms_norm
* Add get_rows implementation
* Add new get_rows files
* Refactor use of wg size entry
* Fix compilation
* Try manually unrolled q4_0 quant
* Revert "Try manually unrolled q4_0 quant"
This reverts commit 77f8b96515.
* Move to constant max wg size
* Check for tensor size in supports_op
* Vectorize f32 and change default workgroup size
* Move f32 get_rows from < 4 to % 4 != 0
* fix linter errors
* Add in-place tests
---------
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
* metal : improve naming
* metal : refactor device
ggml-ci
* cont : props
ggml-ci
* metal : apply ggml_mem_ranges_t
ggml-ci
* metal : remove GGML_METAL_USE_BF16
ggml-ci
* metal : refactor device buffer
ggml-ci
* cont : fix naming
* metal : sync before destroying the backend
ggml-ci
* metal : refactor context
ggml-ci
* metal : migrate ggml-metal.m to ggml-metal.cpp
ggml-ci
* metal : adjust ops API
ggml-ci
* metal : use C++ to store piplienes
ggml-ci
* metal : migrate ops to separate functions
ggml-ci
* metal : add ggml_metal_library_t
ggml-ci
* metal : improve naming
ggml-ci
* metal : cleanp
ggml-ci
* metal : add support for GGML_OP_LOG
ggml-ci
* metal : fix error handling
ggml-ci
* Add fastdiv and fastmodulo to k_bin_bcast kernel
* Address review comments
* `prod_` instead of `prod` suffix
* Add test case for `k_bin_bcast_unravel` in CUDA backend
* CUDA: Add mul_mat_id support the mmf
Add support for mul_mat_id for bs < 16
* Review: use warp_size, fix should_use_mmf condition
* Launch one block per expert, stride along n_expert_used
* templatize mul_mat_id
* Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids
* Reduce compile times by dividing mmf into f16, bf16 and f32 variants
* Divide mmf by ncols_dst
* Add missing files
* Fix MUSA/HIP builds
* requirements : update transformers/torch for Embedding Gemma
This commit updates the requirements to support converting
Embedding Gemma 300m models.
The motivation for this change is that during development I had a local
copy of the transformers package which is what I used for converting
the models. This was a mistake on my part and I should have also updated
my transformers version to the official release.
I had checked the requirements/requirements-convert_legacy_llama.txt
file and noted that the version was >=4.45.1,<5.0.0 and came to the
conculusion that no updated would be needed, this assumed that
Embedding Gemma would be in a transformers release at the time
Commit fb15d649ed ("llama : add support
for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to
convert themselves would be able to do so. However, Embedding Gemma is
a preview release and this commit updates the requirements to use this
preview release.
* resolve additional python dependencies
* fix pyright errors in tokenizer test and remove unused import