llama.cpp

Commit Graph

Author	SHA1	Message	Date
Jeff Bolz	10dc500bdb	vulkan: handle rope with large number of rows (#18306 )	2025-12-26 16:53:46 +01:00
o7si	4893cc07bb	server : fix crash when seq_rm fails for hybrid/recurrent models (#18391 ) * server : fix crash when seq_rm fails for hybrid/recurrent models * server : add allow_processing param to clear_slot	2025-12-26 16:35:29 +01:00
Francisco Herrera	af3be131c0	docs: added note for pre SYCL Intel hardware (#18016 ) Specify that it's for pre sycl hardware	2025-12-26 10:34:30 +08:00
0Marble	b07cda687c	CANN: implement the SSM_CONV operator (#17737 ) * CANN: implement SSM_CONV operator Co-authored-by: Aleksei Lobanov, <zeromarblectm@gmail.com> Co-authored-by: Sujin Kang, <waterjin326@gmail.com> * CANN: remove custom error limit for SSM_CONV * CANN: merge SSM_CONV tensor shape/strides into one line --------- Co-authored-by: Sujin Kang, <waterjin326@gmail.com>	2025-12-26 09:12:04 +08:00
Aman Gupta	85c40c9b02	ggml-cuda: fix regex for arch list (#18371 ) * ggml-cuda: fix regex for arch list * make regex exact	2025-12-26 01:35:14 +08:00
Aman Gupta	83b3b1c271	cuda: optimize cumsum cub path (#18362 ) * cuda: optimize cumsum cub path * remove heavy perf test	2025-12-25 23:55:38 +08:00
Aman Gupta	b0fb0f0aee	ggml-cuda: fix blackwell native builds (#18361 ) * ggml-cuda: fix blackwell native builds Replace 12x in native architectures by 12xa * replace for GGML_NATIVE=OFF too * only replace for native * remove 120f-virtual for default compilation --------- Co-authored-by: Aman Gupta <aman>	2025-12-25 22:12:11 +08:00
Penglin Cai	e68c19b0fd	CANN: Add support for CONV_TRANSPOSE_1D when kernel size > 255 (#17934 ) * CONV_TRANSPOSE_1D kernel_size>255 * remove condition check * fix the bug of type conversion * removing trailing whitespaces * fix: return true in the switch case	2025-12-25 16:46:09 +08:00
Aadeshveer Singh	c54bba869d	ggml : optimize cuda cumsum fallback kernel (#18343 )	2025-12-25 12:11:13 +08:00
Xuan-Son Nguyen	f5acfb2ffa	server: (router) add stop-timeout option (#18350 ) * server: (router) add stop-timeout option * also allow stop while loading * add docs * unload_lru: also wait for unload to complete	2025-12-24 23:47:49 +01:00
Xuan-Son Nguyen	4cbafad4f0	model: support MiMo-V2-Flash (#18328 ) * mimov2: convert ok * rename mimov2 --> mimo2 * fix conversion * runnable not incorrect * use sink * add_sliding_window_pattern * add swa and per-layer n_head_kv * correct params * somewhat working * correct gating func * nits * mimo2: wire RMS eps + MoE bias + converter guards * add co-author Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com> * use add_rope_freq_base_swa --------- Co-authored-by: Aaryan Kapoor <aaryankapoor2006@gmail.com> Co-authored-by: Aaryan-Kapoor <Aaryan-Kapoor@users.noreply.github.com>	2025-12-24 23:07:08 +01:00
Aadeshveer Singh	c184284230	fit-params : fix race condition in fit-params output (#18276 )	2025-12-24 15:57:38 +01:00
Aman Gupta	c8a2417d7b	CUDA: experimental native mxfp4 support for blackwell (#17906 ) * CUDA: experimental native mxfp4 support for blackwell * optimize load_tiles * optimize quantize_mxfp4 * cleanup * first pass review: formatting * use interleaved layout for mma * mmq: add assert for size * use __nv_fp4x4_e2m1 * use iter_k as 512, cleanup * Use 1200 as blackwell instead of 1000 * address review comments * mmq: fix stride * quantize.cu: use reference impl of e8m0 scale * address review comments * add 120f-virtual + minor fixes --------- Co-authored-by: Aman Gupta <aman>	2025-12-24 22:28:26 +08:00
Saba Fallah	54132f1b1f	model : support for LlamaBidirectionalModel architecture (#18220 ) * model: llama-embed-nemotron * minor: python lint * changed arch-name * templated llm_build_llama to be used for both llama and llama-embed arch	2025-12-24 14:02:36 +01:00
Jeff Bolz	2a9ea2020c	vulkan: fix command buffer corruption in ggml_backend_vk_event_wait (#18302 )	2025-12-24 12:36:34 +01:00
Wang Weixuan	ce7a6dc0fc	CANN : refactor ACL graph cache (#17752 ) Move the graph property checking code into methods of LRU cache. Signed-off-by: Wang Weixuan <wangweixvan@gmail.com>	2025-12-24 17:50:24 +08:00
Jesse Ikonen	1ce0126b18	docs: Fix typos in SYCL documentation (#18269 )	2025-12-24 17:19:47 +08:00
Ruben Ortlam	7f459c98e7	vulkan: use fewer FA rows for small cache runs (#18280 )	2025-12-24 08:59:14 +01:00
TianHao324	cf2ffc02bc	CANN: Uses yarn_ramp cache in ROPE (#17725 )	2025-12-24 14:55:33 +08:00
ddh0	10355dc7d0	common: add `LLAMA_ARG_OVERRIDE_TENSOR` env var for `-ot` arg (#18267 )	2025-12-24 14:19:12 +08:00
Xuan-Son Nguyen	5ee4e43f26	server: return_progress to also report 0% processing state (#18305 )	2025-12-23 21:49:05 +01:00
Pascal	5b6c9bc0f3	webui: apply webui_settings on first load (#18223 ) * webui: apply webui_settings on first load The webui_settings from /props were not applied on initial load when default_generation_settings.params was null Now syncs whenever serverProps is available, regardless of params, works for both single-model and router modes * chore: update webui build output	2025-12-23 15:48:03 +01:00
Xuan-Son Nguyen	849d021104	server: fix crash with model not having BOS/EOS (#18321 )	2025-12-23 14:39:36 +01:00
Daniel Bevenius	8e3ead6e4d	model-conversion : add device option to run-org-model.py (#18318 ) * model-conversion : add device option to run-org-model.py This commit refactors the `run-org-model.py` script to include a `--device` argument, to allow users to specify the device on which to run the model (e.g., cpu, cuda, mps, auto). It also extracts a few common functions to prepare for future changes where some code duplication will be removed which there currently exists in embedding scripts. The Makefile is also been updated to pass the device argument, for example: ```console (venv) $ make causal-verify-logits DEVICE=cpu ``` * fix error handling and remove parser reference This commit fixes the error handling which previously referenced an undefined 'parser' variable.	2025-12-23 14:07:25 +01:00
Chris Rohlf	12ee1763a6	rpc : add check for rpc buffer type (#18242 )	2025-12-23 11:56:49 +02:00
nullname	ed75977717	ggml-hexagon: create generalized functions for cpu side op (#17500 ) * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility * refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility * refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity * add comment * refactor: remove redundant buffer checks in hexagon supported operations * wip * add missing include to fix weak symbol warning * add ggml_hexagon_op_generic * refactor: simplify tensor operation initialization and buffer management in hexagon implementation * refactor: streamline hexagon operation initialization and buffer management * refactor: update function signatures and streamline request handling in hexagon operations * wip * ggml-hexagon: clean up code formatting and improve unary operation handling * wip * rename * fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations * refactor: replace ggml_hexagon_mul_mat with template-based binary operation for improved flexibility refactor: replace ggml_hexagon_mul_mat_id with template-based binary operation for improved flexibility refactor: initialize buffer types and streamline dspqueue_buffers_init calls for clarity refactor: remove redundant buffer checks in hexagon supported operations add missing include to fix weak symbol warning add ggml_hexagon_op_generic refactor: simplify tensor operation initialization and buffer management in hexagon implementation refactor: streamline hexagon operation initialization and buffer management refactor: update function signatures and streamline request handling in hexagon operations ggml-hexagon: clean up code formatting and improve unary operation handling fix: add support for permuted F16 tensors and enhance quantization checks in matrix operations # Conflicts: # ggml/src/ggml-hexagon/ggml-hexagon.cpp * hexagon: fix merge conflicts * hexagon: minor cleanup for buffer support checks * hexagon: factor out op_desc and the overal op logging * hexagon: further simplify and cleanup op dispatch logic * snapdragon: update adb scripts to use llama-cli and llama-completion * fix pipeline failure --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-22 23:13:24 -08:00
Daniel Bevenius	847c35f7d5	model-conversion : add trust_remote_code for embedding scripts (#18288 ) This commit adds the trust_remote_code=True parameter when loading models and configurations in the embedding model conversion scripts. It also adds a cast to float for models that might use a data type that is not supported by python, for example bfloat16. The motivation for this is that some models may require custom code to be executed during loading, and setting trust_remote_code to True avoids getting prompted for confirmation. Future work will consolidate the embedding conversion scripts with the causal conversion scripts to avoid code duplication. But in the mean time it would be nice to have this fix in place.	2025-12-23 07:27:37 +01:00
Neo Zhang	a6a552e4ec	[SYCL] replace llama-cli by llama-completion to rm the impact to test script (#18290 ) * replace llama-cli by llama-completion to rm the impact to test script * Update examples/sycl/run-llama2.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama2.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama3.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/run-llama3.sh Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/win-run-llama2.bat Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update examples/sycl/win-run-llama3.bat Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-23 12:59:12 +08:00
Alessandro98-git	96e33a814e	model : fix div-by-zero for Nemotron V2 (#18309 ) * llama-model : fix Nemotron V2 crash by moving MoE parameters calculation * remove whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-23 03:04:57 +01:00
Ryan Mangeno	dfc959b886	model : Granite Embedding support (#15641 ) ModernBERT but without `head.norm` so will currently fail to convert and run any other ModernBERT models, PRs with `head.norm` support welcome! * constants and tensor mappings for modern bert support, model not supported yet but working on getting conversion to work for encoder only * conversion now working, hf -> gguf * working on support, now working on building graph * some cleanup * cleanup * continuing * correct tensor shape for qkv * fixed tensor mappings and working on buildin graph * tensor debugging now works -> (llama-eval-callback), instead of simulated gate split with views, GEGLU is now used which does exactly this * cleanup * cleanup * cleanup * more cleanup * ubatch issues, the assert for checking equal seqs in llama-graph.cpp when building attention keeps failing, setting ubatch size to 1 when running llama-embedding with --ubatch-size 1 makes it work, but needs to be looked into more * added cls token per previous modern bert attempt, still working on checking out the rest * fixed pre tokenizer and still working through previous pr * working through previous attemp, implimented more accurate conversion per previous attempt, added local sliding window attention that alternates every third layer * fixed pre tokenizer * working on swa with local and global alternating attention * some cleanup and now fails on build attn * starting to work, and some cleanup, currently failing on last layer construction in graph build * alternating rope implemented and modern bert graph build succeeds * fixed asser for equal ubatch seq * cleanup * added mask check in vocab * fixed alternating rope, the hparams.rope_freq_base_train and hparams.rope_freq_base_train_swa were the same and i set them to correct values * reuse variable * removed repeat * standard swa method can be used instead of a new enum being LLAMA_SWA_TYPE_LOCAL * correct swa layer indexing, is supposed to be 0, 3, 6 ... instead of 1, 4, 7 ... * more modular hparam setting * replaced attn out norm with ffn_norm and cosine similarity between hf embds and llama.cpp embds went way up, from 0.05 to 0.24, replaced the cacheless kv with swa todo per the previous conversion * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf_update.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-vocab.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-graph.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * removed redundant hparam set * enums for model sizes * conversion for modern-bert model supported rather than just granite-small * Update src/llama-model.cpp Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> * Update src/llama-model.cpp Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> * fixed ordering of enum for freq_base_swa * fixed where I added residual, now gives much much better embeddings~ * readded cacheless logic * removing whitespace * conversion now working for swa pattern - dense every n layers * modern bert put into seperate src file * removing whitespace * fixed whitespace and newline errors in editorconfig job * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * better naming convention, n_swa_pattern -> swa_period * reusing sliding_window_pattern key rather than making new dense_every_n_layers key, and adding writing and reading support * fixing pyright type-check fail * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-saver.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/models/modern-bert.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model-loader.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * added descriptions in llama-model * fixed tensor mappings for conversion * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * mapping name for size * nits * unused --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-23 00:28:19 +01:00
compilade	8f48807380	gguf-py : do not align the data start offset (#18291 ) The safetensors format doesn't require alignment.	2025-12-22 20:25:16 +01:00
Shouyu	bf6bc3c155	ggml-hexagon: gelu optimization (#18151 ) * feat: working gelu with src0 put on vtcm * feat: gelu ping-pong for both in and out * fix: fixu compile error * break: distinguish dma ddr->vtcm and vtcm->ddr operation * fix: fix dma queue size * break: update dma api to either pop src or dst ptr * fix: fix activation vtcm allocation issue for src1 when swapperd * refactor: ping-pong gelu logic to avoid unnecessary if else * dma: improved queue interface and prefetch handling * gelu: fix N+2 block prefetch --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2025-12-22 10:56:52 -08:00
Xuan-Son Nguyen	179fd82a72	gen-docs: automatically update markdown file (#18294 ) * gen-docs: automatically update markdown file * also strip whitespace * do not add extra newline * update TOC	2025-12-22 19:30:19 +01:00
Taimur Ahmad	d34d5ca1e9	llamafile: add rvv support for sgemm kernels (#18199 ) Co-authored-by: Rehan Qasim <rehan.qasim@10xengineers.ai>	2025-12-22 20:20:23 +02:00
lhez	eb492bf43f	opencl: unpack q4_0 for adreno in get_tensor (#18278 )	2025-12-22 10:19:01 -08:00
Jeff Bolz	e3b35ddf1c	vulkan: Extend rope fusions to allow mrope (#18264 ) Extend the test-backend-ops tests as well.	2025-12-22 11:03:13 -06:00
Xuan-Son Nguyen	6ce863c803	server: prevent data race from HTTP threads (#18263 ) * server: prevent data race from HTTP threads * fix params * fix default_generation_settings * nits: make handle_completions_impl looks less strange * stricter const * fix GGML_ASSERT(idx < states.size()) * move index to be managed by server_response_reader * http: make sure req & res lifecycle are tied together * fix compile * fix index handling buggy * fix data race for lora endpoint * nits: fix shadow variable * nits: revert redundant changes * nits: correct naming for json_webui_settings	2025-12-22 14:23:34 +01:00
Xuan-Son Nguyen	3997c78e33	server: fix data race in to_json_anthropic (#18283 )	2025-12-22 13:21:43 +01:00
Mattt	ee74642982	release: update release workflow to store XCFramework as Zip file (#18284 ) * Update release workflow to store XCFramework as Zip file * Add comments to document Zip file requirement for XCFramework * Apply suggestions from code review Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-22 20:11:46 +08:00
Aaron Teo	a28310488c	convert: rework ftype heuristics (#18214 ) * convert: rework ftype heuristics Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> convert: fix type-check Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> convert: bring back heuristics comment Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * convert: revert to using first tensor Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * convert: rework heuristics logic Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * convert: rm redundant float32 check Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-22 20:03:49 +08:00
Xuan-Son Nguyen	86af848153	server: (docs) remove mention about extra_args (#18262 )	2025-12-22 12:22:01 +01:00
Johannes Gäßler	147a521636	tool/ex/tests: consistently free ctx, then model (#18168 )	2025-12-22 11:00:37 +01:00
Jeff Bolz	e1f15b454f	vulkan: Implement set_tensor_async and the event interfaces (#18047 ) The goal is to enable the async loading code paths in llama_model_loader::load_all_data, originally from #7896. This works and the loads themselves are faster, but with host visible vidmem I think the cost of allocating/mapping vidmem moves and becomes more expensive, and I don't see a benefit by default. But with GGML_VK_DISABLE_HOST_VISIBLE_VIDMEM=1 I do see a significant improvement in model loading time.	2025-12-21 21:52:09 +01:00
Johannes Gäßler	0e1ccf15c7	llama: fix RPC for -fit on (#18233 )	2025-12-21 19:33:08 +01:00
Xuan-Son Nguyen	5e25ddebff	move copilot instructions to AGENTS.md (#18259 ) * move copilot --> agents.md * agents: add disclose AI usage * refine	2025-12-21 19:09:21 +01:00
Jeff Bolz	fd05c51cec	vulkan: fix im2col overflowing maxworkgroupcount (#18180 )	2025-12-21 10:32:58 +01:00
Jeff Bolz	b365c3ff01	vulkan/cuda: fix topk_moe with exp_probs_b (#18071 ) I updated test_topk_moe to more closely match llm_graph_context::build_moe_ffn and added coverage for exp_probs_b and some other missing combinations. This exposed a bug in both CUDA and Vulkan backends where they were assuming the input to argsort and the input to get_rows are the same. I'd like to optimize this graph in another change, but for now just get it functional. CUDA also had a bug where it got n_experts from the wrong place, leading to GGML_ASSERT failures in some of the new tests.	2025-12-21 10:27:34 +01:00
Jeff Bolz	cb64222b0c	vulkan: support GGML_UNARY_OP_XIELU (#18062 )	2025-12-21 10:17:58 +01:00
Jeff Bolz	6eb7081860	vulkan: in graph_optimize, try to group ADD operations (#18060 ) I saw the adds not staying together in the new nemotron 3 nano model.	2025-12-21 10:05:08 +01:00
lovedheart	4117ae5557	Vulkan: some improvement on mul_mat_iq2_xs (#18031 ) * Some improvement on mul_mat_iq2_xs Refactor calculations for db values and grid data to optimize performance and reduce redundancy. * Fix trailing whitespace	2025-12-21 09:59:52 +01:00

1 2 3 4 5 ...

7544 Commits All Branches Search

7544 Commits

All Branches