llama.cpp

Commit Graph

Author	SHA1	Message	Date
bssrdf	30990788e8	WIP	2025-10-27 08:29:20 -04:00
bssrdf	c68fe36ae2	WIP: cleanup; enhanced test case	2025-10-25 21:57:39 -04:00
bssrdf	475f9879c5	WIP: fixed another bug	2025-10-25 20:24:14 -04:00
bssrdf	396f55831c	WIP: bug fix	2025-10-25 18:14:12 -04:00
bssrdf	610e41ae2d	still debugging	2025-10-25 11:10:39 -04:00
bssrdf	c45df12ee7	this case is broken; to be debugged	2025-10-24 22:40:34 -04:00
bssrdf	980ddc1e87	properly use __CUDA_ARCH__ to protect the tensor path	2025-10-24 21:56:58 -04:00
bssrdf	24b553204b	WIP: fixed another bug	2025-10-24 16:53:40 -04:00
bssrdf	6c90c20cb1	WIP: bug fix	2025-10-24 15:33:57 -04:00
bssrdf	be25be8ed3	WIP: debugging tensor core kernel	2025-10-24 14:24:26 -04:00
bssrdf	80a996cfc0	WIP: tensore code compiled ok	2025-10-24 11:41:11 -04:00
bssrdf	2715341c1d	WIP: output	2025-10-23 21:29:45 -04:00
bssrdf	66f6d16265	WIP	2025-10-23 13:52:26 -04:00
bssrdf	215ebf6526	WIP	2025-10-22 15:56:55 -04:00
bssrdf	1b69ed44c6	WIP	2025-10-21 17:15:26 -04:00
bssrdf	f931ad883f	WIP	2025-10-21 17:12:50 -04:00
bssrdf	f0a480cc22	WIP	2025-10-21 15:43:35 -04:00
bssrdf	15484c9bd6	turn on tests for implicit conv2d	2025-10-17 22:16:16 -04:00
bssrdf	6a1f8b4d57	change padding size back to 4	2025-10-15 14:21:04 -04:00
bssrdf	ac77b8d0e0	change padding size to 1; added padding to input smem	2025-10-15 14:07:24 -04:00
bssrdf	3f99818925	unroll some loops	2025-10-15 12:46:46 -04:00
bssrdf	b70cca2ea3	add support for both NCHW and NHWC layouts	2025-10-14 14:24:35 -04:00
bssrdf	3e2f722d11	fixed missing dilation	2025-10-14 11:12:55 -04:00
bssrdf	2237722056	added block variants; to be debugged	2025-10-14 11:02:10 -04:00
bssrdf	16b0f0ae3c	work in progress	2025-10-13 18:41:30 -04:00
bssrdf	0ca43582e8	reorder register tile loop	2025-10-08 13:52:56 -04:00
bssrdf	c6255442bb	minor updates	2025-10-08 13:38:16 -04:00
bssrdf	53a2ccbe12	minor update and add direct conv in benchmarking	2025-09-24 21:48:20 -04:00
bssrdf	2ec76aa8f3	Merge branch 'master' into conv2d-implicit	2025-09-10 22:04:20 -04:00
Oliver Simons	00681dfc16	CUDA: Add `fastdiv` to `k_bin_bcast`, giving 1-3% E2E performance (#15872 ) Add fastdiv and fastmodulo to k_bin_bcast kernel * Address review comments * `prod_` instead of `prod` suffix * Add test case for `k_bin_bcast_unravel` in CUDA backend	2025-09-10 22:04:03 +02:00
Jie Fu (傅杰)	4f658855fa	llama : support T5 models with unequal number of encoder-decoder layers (#15909 ) * Extend the support of T5 models with different encoder-decoder layers Signed-off-by: Jie Fu <jiefu@tencent.com> * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/constants.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/gguf_writer.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-arch.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-hparams.h Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update src/llama-model.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Rename n_dec_layer --> dec_n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> * Adapt to cases when dec_n_layer > n_layer Signed-off-by: Jie Fu <jiefu@tencent.com> --------- Signed-off-by: Jie Fu <jiefu@tencent.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-10 20:51:51 +02:00
Sigbjørn Skjæret	6ab397e12b	graph : support non-contiguous Q in build_attn_mha (#15908 ) * support non-contiguous Q in build_attn_mha * Update src/llama-graph.cpp ggml-ci Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-09-10 19:08:59 +02:00
bssrdf	735886b099	merged with upstream master	2025-09-10 12:28:59 -04:00
Daniel Bevenius	9de447d94e	ggml-cpu : fix padding in ggml_timestep_embedding (#15917 ) This commit fixes the zero padding for odd dimensions in ggml_compute_forward_timestep_embedding_f32. The motivation for this is that currently if an odd dimension is used, the padding check incorrectly uses the dimension value for indexing. For example, with dim=15: Elements 0-6 are set to cosine values Elements 7-13 are set to sine values Element 14 is left uninitialized (contains garbage) Element 15 is correctly set to zero This fix changes embed_data[dim] to embed_data[2 * half] so that element 14 (the first unused element) is properly set to zero as well as the last element. Resolves: https://github.com/ggml-org/ggml/issues/1324	2025-09-10 17:31:40 +02:00
Georgi Gerganov	0f0a3c2851	metal : make the backend async (#15906 ) * metal : make the backend async ggml-ci * cont : add comments, extend op offload, clean up ggml-ci * metal : fix batch size for MUL_MAT_ID * metal : remove deprecated ggml_backend_metal_buffer_from_ptr * metal : create only metal buffers, no wrapping of host memory ggml-ci * metal : restore .alloc_buffer for buffer_from_ptr_type ggml-ci * metal : remove broken implementation of GGML_OP_SET ggml-ci * metal : clean-up loose ends, ready for tests ggml-ci * metal : support both private and shared buffers ggml-ci * metal : enable private buffers + add global device queue * metal : disable host buffer to prevent races ggml-ci * metal : avoid extra copy during set_tensor ggml-ci * metal : use separate buffer types for shread and private Metal buffers ggml-ci * metal : simplify synchronization logic ggml-ci * metal : fix build ggml-ci * metal : do not implement cpy_tensor ggml-ci * metal : separate implementations for shared and private buffers ggml-ci	2025-09-10 17:52:35 +03:00
Daniel Bevenius	33daece86b	ci : add caching for ROCm installation in release workflow (#15924 ) This commit applies the same caching to the release workflow which currently exists for the main CI workflow that was introduced in Commit `ff02caf9ee` ("ci : cache ROCm installation in windows-latest-cmake-hip (#15887)").	2025-09-10 15:39:57 +02:00
Daniel Bevenius	e7b6d83b52	tests : filter out no-ops from coverage report (#15900 ) * tests : filter out no-ops from coverage report This commit is a follow-up commit for #15745 to address the feedback on how no-op operations should be filtered out from the coverage report. The feedback regarding the UNARY and GLU sub-operations not being handled I not exactly sure what should be done. They are included in the coverage, for example ABS, ELU, EXP, GELU, GEGLU, GEGLU_ERF etc are in the list of covered operations: ```console $ ./build/bin/test-backend-ops --show-coverage Operations covered by tests (89): ✓ ABS ✓ ACC ✓ ADD ✓ ADD1 ✓ ADD_ID ✓ ARANGE ✓ ARGMAX ✓ ARGSORT ✓ CLAMP ✓ CONCAT ✓ CONV_2D ✓ CONV_2D_DW ✓ CONV_3D ✓ CONV_TRANSPOSE_1D ✓ CONV_TRANSPOSE_2D ✓ COS ✓ COUNT_EQUAL ✓ CPY ✓ CROSS_ENTROPY_LOSS ✓ CROSS_ENTROPY_LOSS_BACK ✓ DIAG_MASK_INF ✓ DIV ✓ DUP ✓ ELU ✓ EXP ✓ FLASH_ATTN_EXT ✓ GATED_LINEAR_ATTN ✓ GEGLU ✓ GEGLU_ERF ✓ GEGLU_QUICK ✓ GELU ✓ GELU_ERF ✓ GELU_QUICK ✓ GET_ROWS ✓ GET_ROWS_BACK ✓ GROUP_NORM ✓ HARDSIGMOID ✓ HARDSWISH ✓ IM2COL ✓ IM2COL_3D ✓ L2_NORM ✓ LEAKY_RELU ✓ LOG ✓ MEAN ✓ MUL ✓ MUL_MAT ✓ MUL_MAT_ID ✓ NEG ✓ NORM ✓ OPT_STEP_ADAMW ✓ OPT_STEP_SGD ✓ OUT_PROD ✓ PAD ✓ PAD_REFLECT_1D ✓ POOL_2D ✓ REGLU ✓ RELU ✓ REPEAT ✓ REPEAT_BACK ✓ RMS_NORM ✓ RMS_NORM_BACK ✓ ROLL ✓ ROPE ✓ ROPE_BACK ✓ RWKV_WKV6 ✓ RWKV_WKV7 ✓ SCALE ✓ SET ✓ SET_ROWS ✓ SGN ✓ SIGMOID ✓ SILU ✓ SILU_BACK ✓ SIN ✓ SOFT_MAX ✓ SOFT_MAX_BACK ✓ SQR ✓ SQRT ✓ SSM_CONV ✓ SSM_SCAN ✓ STEP ✓ SUB ✓ SUM ✓ SUM_ROWS ✓ SWIGLU ✓ SWIGLU_OAI ✓ TANH ✓ TIMESTEP_EMBEDDING ✓ UPSCALE Operations without tests (14): ✗ ADD_REL_POS ✗ CUSTOM ✗ DIAG ✗ DIAG_MASK_ZERO ✗ FLASH_ATTN_BACK ✗ GET_REL_POS ✗ IM2COL_BACK ✗ MAP_CUSTOM1 ✗ MAP_CUSTOM2 ✗ MAP_CUSTOM3 ✗ POOL_1D ✗ POOL_2D_BACK ✗ WIN_PART ✗ WIN_UNPART Coverage Summary: Total operations: 103 Tested operations: 89 Untested operations: 14 Coverage: 86.4% ``` Refs: https://github.com/ggml-org/llama.cpp/pull/15745 * use of ggml_op enum values instead of strcmp	2025-09-10 14:17:09 +02:00
j-k	2cfef4d117	media : add transparent icon svg and png [no ci] (#15891 )	2025-09-10 14:51:28 +03:00
Jesse	09e72a037c	gitignore : Ignore vim swap files in tests (#15901 )	2025-09-10 14:28:47 +03:00
Chenguang Li	10d8b2b6b0	CANN: Add ROPE sin/cos cache for reuse (#15912 ) * CANN: Add ROPE sin/cos cache for reuse Introduce sin/cos caching mechanism in ROPE to avoid redundant computation across layers. The cache is built on the first layer per device and reused by subsequent layers if parameters match. - Added sin_cache / cos_cache pointers and position_length tracking - Introduced cache validity flags and properties: (ext_factor, theta_scale, freq_scale, attn_factor, is_neox) - Accelerates ROPE by eliminating repeated sin/cos generation This change reduces overhead in multi-layer scenarios while preserving correctness by verifying parameter consistency. Co-authored-by: hipudding <huafengchun@gmail.com> * fix typo Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com> Co-authored-by: hipudding <huafengchun@gmail.com>	2025-09-10 18:42:00 +08:00
Chenguang Li	28b5f190ef	CANN: implement LRU cache for ACL graphs (#15814 ) * CANN: implement LRU cache for ACL graphs in CANN backend - Introduce ggml_cann_graph_lru_cache to store multiple ggml_cann_graph objects. - Graphs are loaded on demand and evicted using LRU policy when capacity is exceeded. - Updated push, move_to_front, and clear methods to manage cached graphs efficiently. - Ensures reuse of graphs, reducing graph reconstruction overhead in CANN backend. * fix typo * The LRU cache capacity can be configured via an env variable Signed-off-by: noemotiovon <757486878@qq.com> * refactory acl graph * refactory && fix review comments Signed-off-by: noemotiovon <757486878@qq.com> --------- Signed-off-by: noemotiovon <757486878@qq.com>	2025-09-10 15:29:12 +08:00
Daniel Bevenius	86587da03b	llama : check returned fn ptrs from ggml_backend_reg_get_proc_address (#15893 ) This commit adds check for two function pointers returned from ggml_backend_reg_get_proc_address. The motivation for this is that the function pointer could be nullptr if the get proc address function changes in the future. This is also consistent with all the other calls to ggml_backend_reg_get_proc_address in the code base.	2025-09-10 05:33:58 +02:00
Daniel Bevenius	ff02caf9ee	ci : cache ROCm installation in windows-latest-cmake-hip (#15887 ) This commit adds caching of the ROCm installation for the windows-latest-cmake-hip job. The motivation for this is that the installation can sometimes hang and/or not complete properly leaving an invalid installation which later fails the build. By caching the installation hopefully we can keep a good installation available in the cache and avoid the installation step. Refs: https://github.com/ggml-org/llama.cpp/pull/15365	2025-09-10 05:23:19 +02:00
Ruben Ortlam	ae355f6f71	vulkan: throw the oom error instead of no memory type found (#15905 )	2025-09-09 22:26:03 +02:00
Jeff Bolz	4f63cd705c	vulkan: Fix OOB accesses in soft_max_back (#15861 )	2025-09-09 14:41:15 +02:00
Johannes Gäßler	17bc5a815f	HIP: use v_dot2_f32_f16 instruction for FA (#15884 )	2025-09-09 14:04:43 +02:00
lksj92hs	ed54e32558	Workaround for subgroup arithmetic failing on MoltenVK with AMD GPUs (issue 15846) (#15886 )	2025-09-09 14:01:15 +02:00
Aman Gupta	a972faebed	CUDA: Add mul_mat_id support for the mmf kernel (#15767 ) * CUDA: Add mul_mat_id support the mmf Add support for mul_mat_id for bs < 16 * Review: use warp_size, fix should_use_mmf condition * Launch one block per expert, stride along n_expert_used * templatize mul_mat_id * Pad shmem to 16 bytes, add helper function mul_mat_f_switch_ids * Reduce compile times by dividing mmf into f16, bf16 and f32 variants * Divide mmf by ncols_dst * Add missing files * Fix MUSA/HIP builds	2025-09-09 14:38:02 +08:00
Johannes Gäßler	550cf726e1	CUDA: fix GET_ROWS for large tensors (#15882 )	2025-09-09 08:11:01 +02:00
Georgi Gerganov	c252ce67c4	contrib : add notes about merging PRs (#15881 ) * contrib : add notes about merging PRs * Update CONTRIBUTING.md Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update CONTRIBUTING.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-09 08:42:10 +03:00

1 2 3 4 5 ...

6482 Commits All Branches Search

6482 Commits

All Branches