llama.cpp

Commit Graph

Author	SHA1	Message	Date
bssrdf	55859a86aa	remove implicit op and related calls; replace conv_2d with conv_2d_implicit kernel	2025-10-29 21:36:03 -04:00
bssrdf	2dfbbee73f	clean up	2025-10-29 13:19:35 -04:00
bssrdf	1e568252b5	switch to default conv2d interface	2025-10-29 12:11:26 -04:00
bssrdf	4b1920e9e7	reduced bank conflicts for output	2025-10-29 10:40:52 -04:00
bssrdf	75dde410a8	WIP: minor tweak	2025-10-28 14:41:48 -04:00
bssrdf	3ea524e9c4	WIP: almost working	2025-10-27 23:10:19 -04:00
bssrdf	6d12288037	WIP: fixed a bug in cpy transpos index computation	2025-10-27 17:32:03 -04:00
bssrdf	a3784e17ad	WIP: debugging cpy transpose	2025-10-27 15:09:03 -04:00
bssrdf	cc327f5224	added a specialization for cuda copy op when tensor is transposed	2025-10-27 11:23:27 -04:00
bssrdf	c68fe36ae2	WIP: cleanup; enhanced test case	2025-10-25 21:57:39 -04:00
bssrdf	475f9879c5	WIP: fixed another bug	2025-10-25 20:24:14 -04:00
bssrdf	396f55831c	WIP: bug fix	2025-10-25 18:14:12 -04:00
bssrdf	610e41ae2d	still debugging	2025-10-25 11:10:39 -04:00
bssrdf	980ddc1e87	properly use __CUDA_ARCH__ to protect the tensor path	2025-10-24 21:56:58 -04:00
bssrdf	24b553204b	WIP: fixed another bug	2025-10-24 16:53:40 -04:00
bssrdf	6c90c20cb1	WIP: bug fix	2025-10-24 15:33:57 -04:00
bssrdf	be25be8ed3	WIP: debugging tensor core kernel	2025-10-24 14:24:26 -04:00
bssrdf	80a996cfc0	WIP: tensore code compiled ok	2025-10-24 11:41:11 -04:00
bssrdf	2715341c1d	WIP: output	2025-10-23 21:29:45 -04:00
bssrdf	66f6d16265	WIP	2025-10-23 13:52:26 -04:00
bssrdf	215ebf6526	WIP	2025-10-22 15:56:55 -04:00
bssrdf	1b69ed44c6	WIP	2025-10-21 17:15:26 -04:00
bssrdf	f931ad883f	WIP	2025-10-21 17:12:50 -04:00
bssrdf	f0a480cc22	WIP	2025-10-21 15:43:35 -04:00
bssrdf	6a1f8b4d57	change padding size back to 4	2025-10-15 14:21:04 -04:00
bssrdf	ac77b8d0e0	change padding size to 1; added padding to input smem	2025-10-15 14:07:24 -04:00
bssrdf	3f99818925	unroll some loops	2025-10-15 12:46:46 -04:00
bssrdf	b70cca2ea3	add support for both NCHW and NHWC layouts	2025-10-14 14:24:35 -04:00
bssrdf	3e2f722d11	fixed missing dilation	2025-10-14 11:12:55 -04:00
bssrdf	2237722056	added block variants; to be debugged	2025-10-14 11:02:10 -04:00
bssrdf	16b0f0ae3c	work in progress	2025-10-13 18:41:30 -04:00
bssrdf	0ca43582e8	reorder register tile loop	2025-10-08 13:52:56 -04:00
bssrdf	53a2ccbe12	minor update and add direct conv in benchmarking	2025-09-24 21:48:20 -04:00
bssrdf	2ec76aa8f3	Merge branch 'master' into conv2d-implicit	2025-09-10 22:04:20 -04:00
Oliver Simons	00681dfc16	CUDA: Add `fastdiv` to `k_bin_bcast`, giving 1-3% E2E performance (#15872 ) Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend	2025-09-10 22:04:03 +02:00
bssrdf	735886b099	merged with upstream master	2025-09-10 12:28:59 -04:00
Daniel Bevenius	9de447d94e	ggml-cpu : fix padding in ggml_timestep_embedding (#15917 ) This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: https://github.com/ggml-org/ggml/issues/1324	2025-09-10 17:31:40 +02:00
Georgi Gerganov	0f0a3c2851	metal : make the backend async (#15906 ) * metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci	2025-09-10 17:52:35 +03:00
Chenguang Li	10d8b2b6b0	CANN: Add ROPE sin/cos cache for reuse (#15912 ) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-10 18:42:00 +08:00
Chenguang Li	28b5f190ef	CANN: implement LRU cache for ACL graphs (#15814 ) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-10 15:29:12 +08:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
Johannes Gäßler	17bc5a815f	HIP: use v_dot2_f32_f16 instruction for FA (#15884 )	2025-09-09 14:04:43 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Aman Gupta	a972faebed	CUDA: Add mul_mat_id support for the mmf kernel (#15767 ) * CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds	2025-09-09 14:38:02 +08:00
Johannes Gäßler	550cf726e1	CUDA: fix GET_ROWS for large tensors (#15882 )	2025-09-09 08:11:01 +02:00
Jeff Bolz	e68aa10d8f	vulkan: sort graph to allow more parallel execution (#15850 ) * vulkan: sort graph to allow more parallel execution Add a backend proc to allow the backend to modify the graph. The vulkan implementation looks at which nodes depend on each other and greedily reorders them to group together nodes that don't depend on each other. It only reorders the nodes, doesn't change the contents of any of them. With #15489, this reduces the number of synchronizations needed. * call optimize_graph per-split	2025-09-09 02:10:07 +08:00
Aman Gupta	0a16bf52e6	CUDA: generate_cu_files.py - add missing mxfp4 (#15880 )	2025-09-09 01:23:46 +08:00
Georgi Gerganov	b0d52998b9	cuda : fix supports_op condition for get_rows when number of blocks is too large (#15868 ) * cuda : fix supports_op condition for get_rows when src1->ne2 > 1 ggml-ci * ggml : add comment about ggml_get_rows ggml-ci * cuda : add FIXME [no ci] * cuda : update support condition ggml-ci	2025-09-08 13:56:51 +03:00
Georgi Gerganov	f28d4f4ac9	metal : refactor + optimize (#15857 ) * metal : refactor ggml-ci * cont : refactor FA-vec kernel * cont : print metal library load time * minor : warn to debug + bettern kernel names ggml-ci * metal : optimize mul_mv q8_0 ggml-ci * metal : simplify FA pipeline creation functions ggml-ci * metal : improve naming consistency * metal : safer function constants offsets ggml-ci * metal : comments ggml-ci	2025-09-08 13:34:56 +03:00

1 2 3 4 5 ...

1351 Commits