llama.cpp

Commit Graph

Author	SHA1	Message	Date
Diego Devesa	360d6533db	ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type (#15797 ) * ggml-backend : add GGML_BACKEND_DEVICE_TYPE_IGPU device type ggml-backend : add device id to device props llama : only use iGPU devices if there are no GPU devices llama : do not use multiple devices from different backends with the same device id	2025-09-11 22:47:38 +02:00
Johannes Gäßler	0e6ff0046f	CUDA: larger SRAM reads for tile FA, AMD FP16 dot (#15927 ) * CUDA: larger SRAM reads for tile FA, AMD FP16 dot * fix logic for availability of v_dot2_f32_f16	2025-09-11 21:19:58 +02:00
ddh0	df082f5630	nitpick : correct MB to MiB (#15934 ) MB was incorrectly used for 1024 x 1024 bytes instead of MiB	2025-09-11 19:12:34 +02:00
Daniel Bevenius	24a6734daf	ggml-cpu : add check for ARM MATMUL_INT8/i8mm support (#15922 ) This commit adds a check for GGML_MACHINE_SUPPORTS_i8mm when enabling MATMUL_INT8 features, ensuring that i8mm intrinsics are only used when the target hardware actually supports them. The motivation for this is to fix ggml CI build failures where the feature detection correctly identifies that i8mm is not supported, adding the +noi8mm flag, but MATMUL_INT8 preprocessor definitions are still enabled, causing the compiler to attempt to use vmmlaq_s32 intrinsics without i8mm support. Refs: https://github.com/ggml-org/ggml/actions/runs/17525174120/job/49909199499	2025-09-11 14:39:12 +01:00
Charles Xu	2b3efea9a4	kleidiai: fix GGML_ASSERT(cur_backend_id != -1) failed (#15614 ) kleidiai: fix GGML_ASSERT(cur_backend_id != -1) failed removes the Whisper-specific check for GET_ROWS support	2025-09-11 12:45:40 +02:00
hipudding	c0389dba43	CANN: Disable acl_graph for prefill stage (#15933 ) Since the prefill length is not fixed, graphs constructed for the prefill stage cannot be reused. For this reason, ACL graph execution is disabled by default during prefill.	2025-09-11 15:59:37 +08:00
Oliver Simons	00681dfc16	CUDA: Add `fastdiv` to `k_bin_bcast`, giving 1-3% E2E performance (#15872 ) Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend	2025-09-10 22:04:03 +02:00
Jie Fu (傅杰)	4f658855fa	llama : support T5 models with unequal number of encoder-decoder layers (#15909 ) * Extend the support of T5 models with different encoder-decoder layers Signed-off-by: Jie Fu <jiefu@tencent.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Rename n_dec_layer --> dec_n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> * Adapt to cases when dec_n_layer > n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-10 20:51:51 +02:00
Sigbjørn Skjæret	6ab397e12b	graph : support non-contiguous Q in build_attn_mha (#15908 ) * support non-contiguous Q in build_attn_mha * Update src/llama-graph.cpp ggml-ci Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-10 19:08:59 +02:00
Daniel Bevenius	9de447d94e	ggml-cpu : fix padding in ggml_timestep_embedding (#15917 ) This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: https://github.com/ggml-org/ggml/issues/1324	2025-09-10 17:31:40 +02:00
Georgi Gerganov	0f0a3c2851	metal : make the backend async (#15906 ) * metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci	2025-09-10 17:52:35 +03:00
Daniel Bevenius	33daece86b	ci : add caching for ROCm installation in release workflow (#15924 ) This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit `ff02caf9ee` ("ci : cache ROCm installation in windows-latest-cmake-hip (#15887)").	2025-09-10 15:39:57 +02:00
Daniel Bevenius	e7b6d83b52	tests : filter out no-ops from coverage report (#15900 ) * tests : filter out no-ops from coverage report This commit is a follow-up commit for #15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: https://github.com/ggml-org/llama.cpp/pull/15745 * use of ggml_op enum values instead of strcmp	2025-09-10 14:17:09 +02:00
j-k	2cfef4d117	media : add transparent icon svg and png [no ci] (#15891 )	2025-09-10 14:51:28 +03:00
Jesse	09e72a037c	gitignore : Ignore vim swap files in tests (#15901 )	2025-09-10 14:28:47 +03:00
Chenguang Li	10d8b2b6b0	CANN: Add ROPE sin/cos cache for reuse (#15912 ) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-10 18:42:00 +08:00
Chenguang Li	28b5f190ef	CANN: implement LRU cache for ACL graphs (#15814 ) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-10 15:29:12 +08:00
Daniel Bevenius	86587da03b	llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893 ) This commit adds check for two function pointers returned from ggml_backend_reg_get_proc_address. The motivation for this is that the function pointer could be nullptr if the get proc address function changes in the future. This is also consistent with all the other calls to ggml_backend_reg_get_proc_address in the code base.	2025-09-10 05:33:58 +02:00
Daniel Bevenius	ff02caf9ee	ci : cache ROCm installation in windows-latest-cmake-hip (#15887 ) This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job. The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step. Refs: https://github.com/ggml-org/llama.cpp/pull/15365	2025-09-10 05:23:19 +02:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
Johannes Gäßler	17bc5a815f	HIP: use v_dot2_f32_f16 instruction for FA (#15884 )	2025-09-09 14:04:43 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Aman Gupta	a972faebed	CUDA: Add mul_mat_id support for the mmf kernel (#15767 ) * CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds	2025-09-09 14:38:02 +08:00
Johannes Gäßler	550cf726e1	CUDA: fix GET_ROWS for large tensors (#15882 )	2025-09-09 08:11:01 +02:00
Georgi Gerganov	c252ce67c4	contrib : add notes about merging PRs (#15881 ) * contrib : add notes about merging PRs * Update CONTRIBUTING.md Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-09 08:42:10 +03:00
Daniel Bevenius	70cd37dbbe	requirements : update transformers/torch for Embedding Gemma (#15828 ) * requirements : update transformers/torch for Embedding Gemma This commit updates the requirements to support converting Embedding Gemma 300m models. The motivation for this change is that during development I had a local copy of the transformers package which is what I used for converting the models. This was a mistake on my part and I should have also updated my transformers version to the official release. I had checked the requirements/requirements-convert_legacy_llama.txt file and noted that the version was >=4.45.1,<5.0.0 and came to the conculusion that no updated would be needed, this assumed that Embedding Gemma would be in a transformers release at the time Commit `fb15d649ed` ("llama : add support for EmbeddingGemma 300m (#15798)) was merged. So anyone wanting to convert themselves would be able to do so. However, Embedding Gemma is a preview release and this commit updates the requirements to use this preview release. * resolve additional python dependencies * fix pyright errors in tokenizer test and remove unused import	2025-09-09 06:06:52 +02:00
Piotr Wilkin (ilintar)	acc1b008cf	model-conversion : add extra debugging support for model conversion (#15877 ) * feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py	2025-09-09 06:05:55 +02:00
Aldehir Rojas	7057faf64b	json : support `enum` values within `allOf` (#15830 )	2025-09-08 16:14:32 -05:00
j-k	fe1c92cd7b	media : add llama1 icon (#15878 ) Add svg and png based off llama1-icon.svg	2025-09-08 21:57:01 +03:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Aman Gupta	0a16bf52e6	CUDA: generate_cu_files.py - add missing mxfp4 (#15880 )	2025-09-09 01:23:46 +08:00
Jesse	88021565f0	chat : Deepseek V3.1 reasoning and tool calling support (OpenAI Style) (#15533 ) * Add DeepSeek V3.1 thinking mode support - Added COMMON_CHAT_FORMAT_DEEPSEEK_V3_1 enum value - Created common_chat_params_init_deepseek_v3_1() function (currently uses R1 implementation) - Created common_chat_parse_deepseek_v3_1() function that handles V3.1 thinking format: - Extracts reasoning content before '</think>' tag into reasoning_content - Extracts regular content after '</think>' tag into content - No opening '<think>' tag in V3.1 format - Added detection logic for V3.1 templates based on pattern: 'message['prefix'] is defined and message['prefix'] and thinking' - Added V3.1 case to parsing switch statement This addresses the issue where V3.1 outputs reasoning content followed by '</think>' and then regular content without the opening '<think>' tag. * Another attempt by V3.1 non-thinking * Fix test, but it's not asserting anything. * Ignore vim swap files in tests dir * Update the test * Try using try_find_literal instead of regex * passing test * Revert "Try using try_find_literal instead of regex" This reverts commit `c50d887ec2`. * Remove unnecessary change * Remove comment * Add code to handle non-thinking mode. * Try to set message['prefix'] when thinking is enabled. * This fixes reasoning, but breaks normal content. We need state in the chat parser. * DeepSeek V3.1 thinking is now the default. Disable with `--reasoning-budget 0`. * Simplify (DeepSeek V3.1 reasoning) * Fix sign inversion bug * Add some tool calling code (not working). * Tool calls working in non-reasoning mode. * Attempt a unit test for tool call parsing. * Passing test * Add tests for both happy path and broken fenced DeepSeek V3.1 tool call variants. * Passing DeepSeek V3.1 tool call tests, but model is not working. * Revert assistance response prefill change. Not my monkeys. * Add fenced_thinking unit test variant. Passes, but thinking tool calling still isn't working for some reason. * Tests pass in reasoning mode. Also e2e tool test passes. * Make a copy of the parse_json_tool_calls function for deepseek-v3.1 so as to not accidentally introduce regressions. * Fix thinking_forced_open logic. tool calling broken. Need to add another test case. * That's what I get for cargo culting a newline. * Add multi tool call test for deepseek v3.1 non-reasoning * Move test, remove .gitignore change * Place deepseek-v3.1 reasoning test directly into existing reasoning function per CISC's request. * Address whitespace CI failure. * Merge two assert_equals per CISC's request. * Add DeepSeek-V3.1 tests to tests/test-chat.cpp per CISC's request. * Merge deepseek V3.1 and regular parse_json_tool_calls() function behaviors by adding optional update_cursor argument. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 fix reasoning_format none * Strip grammar down to strictly what we expect based on model card. Throw out parts we cargo culted from R1 that don't make sense. * Update tests/test-chat-parser.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * DeepSeek V3.1 - Add edge case where thinking is forced open, there is tool calling in the reasoning content, but then the model just stops the output without closing the </think> tag, so it's not a partial. In this case, use the tool call in the reasoning content. * DeepSeek V3.1 - simplify update_cursor * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update common/chat.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Fix indent --------- Co-authored-by: openhands <openhands@all-hands.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-08 16:59:48 +02:00
Xuan-Son Nguyen	56920f5665	server : bring back timings_per_token (#15879 )	2025-09-08 16:50:05 +02:00
Georgi Gerganov	b0d52998b9	cuda : fix supports_op condition for get_rows when number of blocks is too large (#15868 ) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci	2025-09-08 13:56:51 +03:00
Georgi Gerganov	f28d4f4ac9	metal : refactor + optimize (#15857 ) * metal : refactor ggml-ci * cont : refactor FA-vec kernel * cont : print metal library load time * minor : warn to debug + bettern kernel names ggml-ci * metal : optimize mul_mv q8_0 ggml-ci * metal : simplify FA pipeline creation functions ggml-ci * metal : improve naming consistency * metal : safer function constants offsets ggml-ci * metal : comments ggml-ci	2025-09-08 13:34:56 +03:00
Xuan-Son Nguyen	9fcb29f22f	ggml: allow casting between f32 and i32 (#15783 ) * ggml: allow casting between f32 and i32 * fix cuda * add vulkan * fix CPU non-cont * add non-cont test case * add note * extend test number range * correct note * add cont version for vulkan	2025-09-08 12:33:01 +02:00
Sigbjørn Skjæret	5ef22d281d	CUDA: non-contiguous src0 not supported for PAD (#15869 )	2025-09-08 12:55:44 +03:00
Daniel Bevenius	233d773d02	convert : force setting sliding_window from original config (#15867 ) * convert : force setting sliding_window from original config This commit modifies the set_gguf_parameters method for EmbeddingGemma so that it reads the sliding_window parameter from the original model config.json and uses that value. The motivation for this change is that the Gemma3TextConfig constructor adjusts the sliding_window value, which can lead to inconsistencies when converting models as we expects this value to match the original model's configuration. Refs: `bb45d3631e/src/transformers/models/gemma3/configuration_gemma3.py (L230)` * fix flake8 error * add link to huggingface PR	2025-09-08 09:44:34 +02:00
Georgi Gerganov	a885dcff11	batched-bench : fix llama_synchronize usage during prompt processing (#15835 ) ggml-ci	2025-09-08 10:27:07 +03:00
Georgi Gerganov	663027fd54	context : fix n_outputs during reserve (#15858 ) ggml-ci	2025-09-08 10:26:36 +03:00
Georgi Gerganov	cf0e3ba150	model : avoid ggml_cont_3d for fused QKV weights (#15662 ) * model : avoid ggml_cont_3d for fused QKV weights ggml-ci * kv-cache : make cpy_k and cpy_v implementation more readable ggml-ci * cont : add comments ggml-ci * cont : minor fix [no ci] * cont : one more fix * cont : clarity ggml-ci * kv-cache : require contiguous heads of k_cur and v_cur ggml-ci	2025-09-08 10:25:33 +03:00
Jeff Bolz	d413dca003	tests: large sizes for get_rows (#15687 )	2025-09-07 23:23:41 -05:00
Chenguang Li	85ca66a746	CANN: Stream sync between devices for acl_graph (#15809 ) * CANN: Switch to stream synchronization Switch to stream synchronization because events are not effective. Co-authored-by: hipudding <huafengchun@gmail.com> * CANN: add Comments --------- Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-08 10:03:29 +08:00
Jeff Bolz	3976dfbe00	vulkan: support im2col_3d (#15795 )	2025-09-07 13:50:26 -05:00
Aaron Teo	d36e61c580	ggml-cpu: clean up s390x SIMD (#15855 ) * ggml-cpu: clean up s390x simd Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> (cherry picked from commit `0da4b6aa07`) Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-cpu: fix hsum data types Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-08 02:18:28 +08:00
Jeff Bolz	c97b5e5854	vulkan: Support pad_ext (#15794 )	2025-09-07 19:00:49 +02:00
Jeff Bolz	267e99867f	vulkan: Use larger loads in scalar/coopmat1 matmul (#15729 ) I think glslang will translate an access like x[i][1].z to OpAccessChain ... x, i, 1, 2 OpLoad float16_t ... rather than loading all of x[i] in a single OpLoad. Change the code to explicitly load the vector/matrix.	2025-09-07 18:53:07 +02:00
Daniel Bevenius	3b15924d71	ggml WebGPU: remove userdata from request adapter callback (#15527 ) * ggml WebGPU: remove userdata from request adapter callback This commit removes the `userdata` parameter from the WebGPU request adapter callback in `ggml-webgpu.cpp`. Instead, the lambda function captures the `webgpu_context` directly. The motivation for this change is to simplify the code and improve readability. * inline the callback lambda into the RequestAdapter call This commit removes the callback lambda variable and inlines it directly into the RequestAdapter call.	2025-09-07 11:19:45 +03:00
Johannes Gäßler	79bc429262	CUDA: faster tile FA (Pascal/AMD), headsize 256 (#15769 )	2025-09-07 00:26:28 +02:00

1 2 3 4 5 ...

6551 Commits All Branches Search

6551 Commits

All Branches