llama.cpp

Commit Graph

Author	SHA1	Message	Date
yulo	c33a58bced	HIP: enable mmf for RDNA3 (#17879 ) * enable mmf for RDNA3 * disable mmf for some shape * move some mmvf to mmf * more mmfv to mmf * 3 is good in mmvf --------- Co-authored-by: zhang hui <you@example.com>	2025-12-12 11:34:33 +01:00
Pascal	a81a569577	Add a search field on model selector / improve mobile display (#17765 ) * webui: add search field to model selector and fixes mobile viewport overflow * webui: simplify model search style and code * refacor: Search Input component & consistent UI for Models Selector search * feat: Use Popover component + improve interactions * fix: Fetching props for only loaded models in ROUTER mode * webui: prevent models selector popover from overflowing viewport Use Floating UI's auto-positioning with 50dvh height limit and proper collision detection instead of forcing top positioning. Fixes overflow on desktop and mobile keyboard issues * webui: keep search field near trigger in models selector Place search at the 'near end' (closest to trigger) by swapping layout with CSS flexbox order based on popover direction. Prevents input from moving during typing as list shrinks * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2025-12-11 18:21:21 +01:00
Piotr Wilkin (ilintar)	53ecd4fdb9	SOLVE_TRI extension to more dimensions (#17793 ) * Extended TRI * Fix whitespace * chore: update webui build output * Just use cuBLAS for everything... * Merge both versions * Remove incorrect imports causing failures for CI * Still failing... remove all direct cublas imports and rely on common imports from "common.cuh" * Defines for hipBlas * Aaaand MUSA defines... * I hate this job... * Stupid typo... * Update ggml/src/ggml-cuda/solve_tri.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-12-11 17:20:43 +01:00
Georgi Gerganov	4d10b78e23	Merge branch 'master' into HEAD	2025-12-11 14:42:56 +02:00
Georgi Gerganov	c6f6e4f96a	ggml-alloc : fix reuse-parent logic for misaligned sizes (#17884 )	2025-12-11 14:30:10 +02:00
Georgi Gerganov	d9f8f60618	batch : fix sequence id ownership (#17915 ) * batch : fix sequence id ownage * cont : reduce allocations	2025-12-11 14:29:47 +02:00
Georgi Gerganov	ab65b47a52	tests : run backend sampler tests always on the CPU	2025-12-11 14:14:47 +02:00
Georgi Gerganov	74b112e3e7	sampling : fix greedy	2025-12-11 13:37:02 +02:00
Georgi Gerganov	8544aba37f	sampling : generic ggml op support detection	2025-12-11 13:19:43 +02:00
Georgi Gerganov	d5d16651a8	cont : fix build	2025-12-11 11:27:47 +02:00
Georgi Gerganov	54e9054017	sampling : optimize logit_bias sampler	2025-12-11 11:14:39 +02:00
Yuichiro Utsumi	e4ae383317	docs: use port 8080 in Docker examples (#17903 )	2025-12-11 17:12:07 +08:00
Daniel Bevenius	56720f8f01	Merge pull request #1 from JohannesGaessler/gpu-sampling-hip HIP/MUSA: fix build for backend sampling	2025-12-11 09:20:55 +01:00
nullname	34ce48d97a	ggml-hexagon: fix `rope` failure at `test-backend-ops` (#17565 ) * fix test failure * fix: correct scaling calculations in rope_cache_init * fix: optimize element copying in rope_hex_f32 using memcpy * fix: optimize loop boundaries in rope_hex_f32 for better performance * feat: add profiling macros for performance measurement in operations	2025-12-10 14:45:43 -08:00
Sigbjørn Skjæret	45e350e3d3	ci: fix riscv64-native build (#17916 )	2025-12-10 23:24:31 +01:00
Xuan-Son Nguyen	c6b2c9310c	mtmd: some small clean up (#17909 ) * clip: add support for fused qkv in build_vit * use bulid_ffn whenever possible * fix internvl * mtmd-cli: move image to beginning * test script: support custom args	2025-12-10 22:20:06 +01:00
Xuan-Son Nguyen	34a6d86982	cli: enable jinja by default (#17911 ) * cli: enable jinja by default * Update common/arg.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-12-10 22:19:42 +01:00
Johannes Gäßler	42cf5c01e5	HIP/MUSA: fix build for backend sampling	2025-12-10 22:19:28 +01:00
Pascal	f32ca51bfe	server: add presets (config) when using multiple models (#17859 ) * llama-server: recursive GGUF loading Replace flat directory scan with recursive traversal using std::filesystem::recursive_directory_iterator. Support for nested vendor/model layouts (e.g. vendor/model/.gguf). Model name now reflects the relative path within --models-dir instead of just the filename. Aggregate files by parent directory via std::map before constructing local_model server : router config POC (INI-based per-model settings) * server: address review feedback from @aldehir and @ngxson PEG parser usage improvements: - Simplify parser instantiation (remove arena indirection) - Optimize grammar usage (ws instead of zero_or_more, remove optional wrapping) - Fix last line without newline bug (+ operator instead of <<) - Remove redundant end position check Feature scope: - Remove auto-reload feature (will be separate PR per @ngxson) - Keep config.ini auto-creation and template generation - Preserve per-model customization logic Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> * server: adopt aldehir's line-oriented PEG parser Complete rewrite of INI parser grammar and visitor: - Use p.chars(), p.negate(), p.any() instead of p.until() - Support end-of-line comments (key=value # comment) - Handle EOF without trailing newline correctly - Strict identifier validation ([a-zA-Z_][a-zA-Z0-9_.-]) - Simplified visitor (no pending state, no trim needed) - Grammar handles whitespace natively via eol rule Business validation preserved: - Reject section names starting with LLAMA_ARG_ - Accept only keys starting with LLAMA_ARG_* - Require explicit section before key-value pairs Co-authored-by: aldehir <aldehir@users.noreply.github.com> * server: fix CLI/env duplication in child processes Children now receive minimal CLI args (executable, model, port, alias) instead of inheriting all router args. Global settings pass through LLAMA_ARG_* environment variables only, eliminating duplicate config warnings. Fixes: Router args like -ngl, -fa were passed both via CLI and env, causing 'will be overwritten' warnings on every child spawn * add common/preset.cpp * fix compile * cont * allow custom-path models * add falsey check * server: fix router model discovery and child process spawning - Sanitize model names: replace / and \ with _ for display - Recursive directory scan with relative path storage - Convert relative paths to absolute when spawning children - Filter router control args from child processes - Refresh args after port assignment for correct port value - Fallback preset lookup for compatibility - Fix missing argv[0]: store server binary path before base_args parsing * Revert "server: fix router model discovery and child process spawning" This reverts commit e3832b42eeea7fcb108995966c7584479f745857. * clarify about "no-" prefix * correct render_args() to include binary path * also remove arg LLAMA_ARG_MODELS_PRESET for child * add co-author for ini parser code Co-authored-by: aldehir <hello@alde.dev> * also set LLAMA_ARG_HOST * add CHILD_ADDR * Remove dead code --------- Co-authored-by: aldehir <aldehir@users.noreply.github.com> Co-authored-by: ngxson <ngxson@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: aldehir <hello@alde.dev>	2025-12-10 22:18:21 +01:00
Max Krasnyansky	e1f4921980	Fix race conditions in threadpool when dealing with dynamic/frequent n_threads changes (#17748 ) * tests: update barrier test to check for race condition in active threads * cpu: combine n_graph and n_threads into a single atomic update * tests: add multi-graph test for test_barrier	2025-12-10 12:32:23 -08:00
Georgi Gerganov	4dff236a52	ggml : remove GGML_KQ_MASK_PAD constant (#17910 ) * ggml : remove GGML_KQ_MASK_PAD constant * cont : remove comment	2025-12-10 20:53:16 +02:00
Georgi Gerganov	804e7e3795	graph : respect sampler order for graph reuse	2025-12-10 20:40:15 +02:00
Georgi Gerganov	44d5c4b592	batch : fix sequence id ownage	2025-12-10 20:35:58 +02:00
Sigbjørn Skjæret	4df6e859e9	cuda : add missing support check for xielu (#17895 )	2025-12-10 16:16:20 +01:00
Georgi Gerganov	38882247d3	Merge branch 'master' into HEAD	2025-12-10 17:07:21 +02:00
Xuan-Son Nguyen	6c2131773c	cli: new CLI experience (#17824 ) * wip * wip * fix logging, add display info * handle commands * add args * wip * move old cli to llama-completion * rm deprecation notice * move server to a shared library * move ci to llama-completion * add loading animation * add --show-timings arg * add /read command, improve LOG_ERR * add args for speculative decoding, enable show timings by default * add arg --image and --audio * fix windows build * support reasoning_content * fix llama2c workflow * color default is auto * fix merge conflicts * properly fix color problem Co-authored-by: bandoti <bandoti@users.noreply.github.com> * better loading spinner * make sure to clean color on force-exit * also clear input files on "/clear" * simplify common_log_flush * add warning in mtmd-cli * implement console writter * fix data race * add attribute * fix llama-completion and mtmd-cli * add some notes about console::log * fix compilation --------- Co-authored-by: bandoti <bandoti@users.noreply.github.com>	2025-12-10 15:28:59 +01:00
Eric Zhang	b677721819	model : Qwen3-Next-80B-A3B has 48 layers (#17898 ) * model : Qwen3-Next-80B-A3B has 48 layers * model : Add 80B-A3B type name	2025-12-10 15:22:40 +01:00
lhez	2d2e1030e3	docs : update opencl ops (#17904 )	2025-12-10 15:20:00 +01:00
Georgi Gerganov	c02654eb7d	graph : make the compute graph constant with respect to active samplers	2025-12-10 16:19:18 +02:00
Georgi Gerganov	0ecee8be37	server : reconnect the backend_sampling setting in the WebUI	2025-12-10 15:42:20 +02:00
Georgi Gerganov	81cb5783c8	Merge branch 'master' into HEAD	2025-12-10 13:41:32 +02:00
Johannes Gäßler	17f7f4baad	CUDA: fix unpadded strides in MMA FA kernel (#17891 )	2025-12-10 12:39:56 +01:00
Xuan-Son Nguyen	9e79b0116e	convert: allow using quantized Mistral weight (#17889 ) * convert: allow using quantized Mistral weight * data_torch.ndim * update dequant fn Co-authored-by: compilade <compilade@users.noreply.github.com> --------- Co-authored-by: compilade <compilade@users.noreply.github.com>	2025-12-10 10:26:22 +01:00
Neo Zhang Jianyu	2e9eab80c2	fix softmax for iGPU (#17838 )	2025-12-10 16:59:57 +08:00
Aldehir Rojas	2fbe3b7bb7	common : add parser for ministral/mistral large 3/devstral 2 (#17713 )	2025-12-09 17:31:04 -06:00
Sigbjørn Skjæret	63391852b0	docs : update cpu and cuda ops (#17890 ) * update cuda ops * update CPU as well	2025-12-09 23:31:29 +01:00
Gabe Goodhart	086a63e3a5	metal: SSM kernel improvements (#17876 ) * feat: Add a batched version of ssm_conv This was done using Claude Code. It found a number of optimizations around how the threads were organized, resulting in a huge performance boost! Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Optimized SSM_SCAN kernel for metal This used Claude Code and resulted in a modest performance improvement while maintaining correctness. Branch: Mamba2SSD Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Add test-backend-ops perf tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: Real representitive tests for SSM_CONV Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use function constant for ssm_conv batch size Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * test: backend op tests for ssm_scan from granite4 1b-h Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * style: remove commented out templates Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: float4 version of ssm_conv_batched Branch: SSMKernelImprovements Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add missing ggml_metal_cv_free Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-12-09 21:30:02 +02:00
Piotr Wilkin (ilintar)	b63509262a	Add DIAG for CUDA (#17873 ) * Add DIAG for CUDA * Refactor parameters	2025-12-09 20:28:57 +01:00
Johannes Gäßler	48f47565a7	docs: clarify that CPU support should be first (#17886 )	2025-12-09 20:10:36 +01:00
Oliver Simons	6dc6614bf0	Disable cooperative groups for musa Didn't find any doc online, so I don't even know if they support this	2025-12-09 19:09:52 +01:00
Oliver Simons	a25fda5290	Fix launch logic when supports_cooperative_launch=false	2025-12-09 19:03:47 +01:00
Oliver Simons	3f0594ad0b	Try fixing HIP build errors by adding corresponding #defines Will likely have to disable for MUSA as I didn't find any docs online	2025-12-09 18:51:28 +01:00
Gabe Goodhart	02e409a5be	ggml : Provide macos-specific backtrace printing to avoid terminal death (#17869 ) * fix: Provide macos-specific backtrace printing to avoid terminal death Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Add GGML_BACKTRACE_LLDB env var to enable using lldb for backtrace Branch: MacOSSafeBacktrace Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-12-09 18:29:07 +02:00
Georgi Gerganov	34b407b41c	sampling : use host buffer type for inputs	2025-12-09 17:53:17 +02:00
Georgi Gerganov	92ff767918	llama : require backend samplers to be of type llama_sampler_chain	2025-12-09 15:38:37 +02:00
Georgi Gerganov	6b82eb7883	metal : print node names for debugging (#17882 )	2025-12-09 15:25:49 +02:00
Oliver Simons	07003f1ffb	Fix compiler warnings by casting `const` away	2025-12-09 13:05:43 +01:00
Oliver Simons	886c3668b5	Add TODOs to and adjust heuristics of row-wise soft_max in CUDA Heuristics were selected based on the following numbers: ``` -- Before Backend 1/2: CUDA0 Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Device memory: 97250 MB (96691 MB free) SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 2236 runs - 450.34 us/run - 655360 kB/run - 1401.20 GB/s SOFT_MAX(type=f32,ne=[12888,256,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 17748 runs - 56.80 us/run - 128880 kB/run - 2168.19 GB/s SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 57204 runs - 18.35 us/run - 12320 kB/run - 640.57 GB/s SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 9840 runs - 102.46 us/run - 81920 kB/run - 763.45 GB/s SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98064 runs - 10.25 us/run - 6160 kB/run - 573.43 GB/s SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98310 runs - 10.25 us/run - 10240 kB/run - 953.20 GB/s SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 172011 runs - 5.99 us/run - 640 kB/run - 101.84 GB/s SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 172011 runs - 5.97 us/run - 770 kB/run - 123.02 GB/s SOFT_MAX(type=f32,ne=[8192,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 172011 runs - 6.00 us/run - 64 kB/run - 10.16 GB/s SOFT_MAX(type=f32,ne=[8192,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 163820 runs - 6.12 us/run - 256 kB/run - 39.91 GB/s SOFT_MAX(type=f32,ne=[8192,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 147438 runs - 6.88 us/run - 1024 kB/run - 141.92 GB/s SOFT_MAX(type=f32,ne=[16384,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 122865 runs - 8.20 us/run - 128 kB/run - 14.89 GB/s SOFT_MAX(type=f32,ne=[16384,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 114674 runs - 8.87 us/run - 512 kB/run - 55.06 GB/s SOFT_MAX(type=f32,ne=[16384,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98292 runs - 10.24 us/run - 2048 kB/run - 190.82 GB/s SOFT_MAX(type=f32,ne=[32768,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 49146 runs - 21.37 us/run - 256 kB/run - 11.43 GB/s SOFT_MAX(type=f32,ne=[32768,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 49146 runs - 22.54 us/run - 1024 kB/run - 43.33 GB/s SOFT_MAX(type=f32,ne=[32768,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 49146 runs - 23.92 us/run - 4096 kB/run - 163.32 GB/s SOFT_MAX(type=f32,ne=[65536,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 32764 runs - 38.94 us/run - 512 kB/run - 12.54 GB/s SOFT_MAX(type=f32,ne=[65536,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 24573 runs - 41.94 us/run - 2048 kB/run - 46.57 GB/s SOFT_MAX(type=f32,ne=[65536,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 24582 runs - 43.09 us/run - 8192 kB/run - 181.32 GB/s SOFT_MAX(type=f32,ne=[131072,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 16382 runs - 74.56 us/run - 1024 kB/run - 13.10 GB/s SOFT_MAX(type=f32,ne=[131072,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 16382 runs - 79.85 us/run - 4096 kB/run - 48.92 GB/s SOFT_MAX(type=f32,ne=[131072,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 12294 runs - 82.41 us/run - 16384 kB/run - 189.64 GB/s SOFT_MAX(type=f32,ne=[262144,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 8191 runs - 145.16 us/run - 2048 kB/run - 13.46 GB/s SOFT_MAX(type=f32,ne=[262144,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 8194 runs - 155.46 us/run - 8192 kB/run - 50.26 GB/s SOFT_MAX(type=f32,ne=[262144,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 7175 runs - 160.70 us/run - 32768 kB/run - 194.56 GB/s SOFT_MAX(type=f32,ne=[524288,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 8191 runs - 285.81 us/run - 4096 kB/run - 13.67 GB/s SOFT_MAX(type=f32,ne=[524288,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 4098 runs - 306.91 us/run - 16384 kB/run - 50.92 GB/s SOFT_MAX(type=f32,ne=[524288,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 3591 runs - 317.06 us/run - 65536 kB/run - 197.32 GB/s -- After Backend 1/2: CUDA0 Device description: NVIDIA RTX PRO 6000 Blackwell Max-Q Workstation Edition Device memory: 97250 MB (96691 MB free) SOFT_MAX(type=f32,ne=[4096,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 2236 runs - 450.67 us/run - 655360 kB/run - 1400.15 GB/s SOFT_MAX(type=f32,ne=[12888,256,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 17748 runs - 56.97 us/run - 128880 kB/run - 2161.50 GB/s SOFT_MAX(type=f32,ne=[77,4096,5,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 57204 runs - 18.35 us/run - 12320 kB/run - 640.36 GB/s SOFT_MAX(type=f32,ne=[1024,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 9840 runs - 102.46 us/run - 81920 kB/run - 763.42 GB/s SOFT_MAX(type=f32,ne=[77,1024,10,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98064 runs - 10.25 us/run - 6160 kB/run - 573.43 GB/s SOFT_MAX(type=f32,ne=[256,256,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98310 runs - 10.25 us/run - 10240 kB/run - 953.21 GB/s SOFT_MAX(type=f32,ne=[64,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 147438 runs - 7.00 us/run - 640 kB/run - 87.26 GB/s SOFT_MAX(type=f32,ne=[77,64,20,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 147438 runs - 6.99 us/run - 770 kB/run - 105.05 GB/s SOFT_MAX(type=f32,ne=[8192,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 172011 runs - 6.02 us/run - 64 kB/run - 10.13 GB/s SOFT_MAX(type=f32,ne=[8192,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 163820 runs - 6.12 us/run - 256 kB/run - 39.87 GB/s SOFT_MAX(type=f32,ne=[8192,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 147438 runs - 6.91 us/run - 1024 kB/run - 141.40 GB/s SOFT_MAX(type=f32,ne=[16384,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 122865 runs - 8.20 us/run - 128 kB/run - 14.89 GB/s SOFT_MAX(type=f32,ne=[16384,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 114674 runs - 8.79 us/run - 512 kB/run - 55.54 GB/s SOFT_MAX(type=f32,ne=[16384,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98292 runs - 10.24 us/run - 2048 kB/run - 190.82 GB/s SOFT_MAX(type=f32,ne=[32768,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 131056 runs - 8.11 us/run - 256 kB/run - 30.12 GB/s SOFT_MAX(type=f32,ne=[32768,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 49146 runs - 22.54 us/run - 1024 kB/run - 43.33 GB/s SOFT_MAX(type=f32,ne=[32768,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 49146 runs - 23.32 us/run - 4096 kB/run - 167.50 GB/s SOFT_MAX(type=f32,ne=[65536,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 122865 runs - 8.19 us/run - 512 kB/run - 59.63 GB/s SOFT_MAX(type=f32,ne=[65536,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 40955 runs - 24.59 us/run - 2048 kB/run - 79.43 GB/s SOFT_MAX(type=f32,ne=[65536,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 24582 runs - 43.21 us/run - 8192 kB/run - 180.84 GB/s SOFT_MAX(type=f32,ne=[131072,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 122865 runs - 8.19 us/run - 1024 kB/run - 119.25 GB/s SOFT_MAX(type=f32,ne=[131072,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 40955 runs - 24.59 us/run - 4096 kB/run - 158.87 GB/s SOFT_MAX(type=f32,ne=[131072,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 12294 runs - 82.37 us/run - 16384 kB/run - 189.74 GB/s SOFT_MAX(type=f32,ne=[262144,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 122865 runs - 8.20 us/run - 2048 kB/run - 238.28 GB/s SOFT_MAX(type=f32,ne=[262144,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 36873 runs - 28.66 us/run - 8192 kB/run - 272.61 GB/s SOFT_MAX(type=f32,ne=[262144,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 9225 runs - 108.51 us/run - 32768 kB/run - 288.13 GB/s SOFT_MAX(type=f32,ne=[524288,1,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 98292 runs - 10.24 us/run - 4096 kB/run - 381.65 GB/s SOFT_MAX(type=f32,ne=[524288,4,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 32784 runs - 31.74 us/run - 16384 kB/run - 492.43 GB/s SOFT_MAX(type=f32,ne=[524288,16,1,1],mask=0,sinks=0,m_prec=f32,nr23=[1,1],scale=1.000000,max_bias=0.000000,inplace=0): 8721 runs - 121.20 us/run - 65536 kB/run - 516.19 GB/s ```	2025-12-09 12:58:56 +01:00
Oliver Simons	a84dfd3e10	CUDA: Add Cooperative-Groups-based parallelization of ncols in softmax Old implementation parallelizes rows across SMs, which does not fit the needs of backend-sampling (where we have ncols >> nrows and thus want to parallelize ncols across SMs)	2025-12-09 12:58:56 +01:00
Sigbjørn Skjæret	86a3f0fad8	ggml : allow fill node alloc inplace (#17870 )	2025-12-09 12:23:47 +01:00

... 4 5 6 7 8 ...

7747 Commits All Branches Search

7747 Commits

All Branches