llama.cpp

Commit Graph

Author	SHA1	Message	Date
zhanmyz	f98d215162	Change the input parameter shape of CONT operator	2026-01-15 10:05:41 -08:00
zhanmyz	9a7b7d8d6d	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT/ROPE/SCALE/SOFTMAX/ADD adjacent op graph conversion	2026-01-15 10:05:41 -08:00
zhanmyz	95ae982d59	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT graph conversion of consecutive OPs	2026-01-15 10:05:41 -08:00
zhanmyz	901f7347ff	Execute CONT & VIEW operators in OV Frontend is OK	2026-01-15 10:05:41 -08:00
zhanmyz	081b52667b	Execute singel CONT operator is OK	2026-01-15 10:05:41 -08:00
zhanmyz	afb8594194	add tmp source code files	2026-01-15 10:05:41 -08:00
zhanmyz	57582fda39	add implementation of CPY when the output tensor is non-contiguous	2026-01-15 10:05:41 -08:00
zhanmyz	8484769981	add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops	2026-01-15 10:05:41 -08:00
zhanmyz	cb2729bc4a	Move CPY from GGML OV Backend to OV Frontend	2026-01-15 10:05:41 -08:00
zhanmyz	2b04bd43be	Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML backend	2026-01-15 10:05:41 -08:00
zhanmyz	0f7d07de7d	Add support for RMS_NORM OP	2026-01-15 10:05:41 -08:00
yumengbo	2353c73f53	Support ROPE op.	2026-01-15 10:05:41 -08:00
yumengbo	8aba03bac6	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	d218c61e6d	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	590f587b27	Add support for UNARY SILU op . Fix pytorch impl bugs.	2026-01-15 10:05:41 -08:00
yumengbo	b100f89bad	Change to implementation following pytorch frontend	2026-01-15 10:05:41 -08:00
yumengbo	e95f29cbc0	Fix issue for output memory copy of infer request	2026-01-15 10:05:41 -08:00
zhanmyz	8c5a609f8d	add the rms_norm operator implemented using OpenVINO to the GGML backend of llama.cpp	2026-01-15 10:05:41 -08:00
zhanmyz	80c330a469	Update build.md and add operation mapping(GGML to OpenVINO)	2026-01-15 10:05:41 -08:00
zhanmyz	49804f43fc	add GET_ROWS operator of OpenVINO to GGML of llama.cpp	2026-01-15 10:05:41 -08:00
yumengbo	5b46dc23be	Change output for infer request to set output tensor. Support scale, view op.	2026-01-15 10:05:41 -08:00
yumengbo	31bd816426	Add GGML_OV_FRONTEND option. Add readme.	2026-01-15 10:05:41 -08:00
yumengbo	9b7b63d12c	Convert subgraph with add, sub, mul, div op to ov model and do infer on openvino device	2026-01-15 10:05:41 -08:00
yumengbo	34e826ac14	Implement GgmlOvDecoder. Add dump functions.	2026-01-15 10:05:41 -08:00
yumengbo	171c4681f4	Add PoC of integration of openvino frontend. Main changes: ggml-ov-frontend-utils, GraphIterator, Decoder	2026-01-15 10:05:41 -08:00
zhanmyz	ee31dc1c1b	add get openvino available ops function	2026-01-15 10:05:41 -08:00
zhanmyz	77d68146a8	add OpenVINO frontend convert process steps	2026-01-15 10:05:41 -08:00
zhanmyz	0a81aa19f7	Add compile options	2026-01-15 10:05:40 -08:00
zhanmyz	adc2c70f44	Add OpenVINO MUL operator to GGML of Llama.cpp.	2026-01-15 10:05:40 -08:00
zhanmyz	faa4a7de76	Solve the issue of abnormal model output caused by using OpenVINO ADD operator	2026-01-15 10:05:40 -08:00
zhanmyz	9b9d51dddf	* Configure the device(default CPU) that uses OpenVINO to compile the model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.	2026-01-15 10:05:40 -08:00
zhanmyz	5294402b50	add openvino as optional backend for Llama.cpp ggml	2026-01-15 10:05:40 -08:00
Yanglei Zou	fe5720e684	Add ggml-openvino base files	2026-01-15 10:05:40 -08:00
Johannes Gäßler	5c662d21a3	CUDA: fix allignment on register spill for FA (#18815 )	2026-01-15 15:14:50 +01:00
shalinib-ibm	8cc0ba957b	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837 )	2026-01-15 17:31:18 +08:00
Max Krasnyansky	cff777f226	hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822 ) * hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars * hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32 Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication. Update HTP ops infra to support OP_CPY * hexagon: cleanup and refactor hex/hvx/htp headers and helper libs hex is basically all scalar/core platform stuff (L2, DMA, basic utils) hvx is all hvx related utils, helpers, etc htp is higher level stuff like Ops, etc hvx-utils library got a nice round of cleanup and refactoring to reduce duplication use hvx_vec_store_a where possible * hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h Moved sigmoid and tanh vector functions from hvx-utils.h to a new header hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid array processing using a macro pattern similar to hvx-copy.h. Updated act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed unused hvx-sigmoid.c. * hexagon: factor out hvx-sqrt.h * hexagon: mintor update to hvx-utils.h * hexagon: remove spurios log * hexagon: factor out and optimize hvx_add/sub/mul * hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions * hexagon: refactor reduction functions to hvx-reduce.h Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`. Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`. Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic. Updated `softmax-ops.c` to use the new functions. * hexagon: refactor the rest of arithmetic functions to hvx-arith.h Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`. Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested. * hexagon: refactor hvx_sum_of_squares_f32 - Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature. - Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`. - Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`. - Update existing reduction functions in `hvx-reduce.h` to use the updated macro. - Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature. * hexagon: use hvx_splat instead of memset * hexagon: consistent use of f32/f16 in all function names to match the rest of GGML * hexagon: fix hvx_copy_f16_f32 on v75 and older * hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL * scripts: update snapdragon/adb scripts to enable host param	2026-01-14 21:46:12 -08:00
Oliver Simons	36f0132464	CUDA: Factor out and re-use `block_reduce` function (#18785 ) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-15 10:44:54 +08:00
Jeff Bolz	3e4bb29666	vulkan: Check maxStorageBufferRange in supports_op (#18709 ) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-14 10:59:05 +01:00
Daniel Bevenius	01cbdfd7eb	CUDA : fix typo in clang pragma comment [no ci] (#18830 )	2026-01-14 10:31:49 +01:00
Ruben Ortlam	635ef78ec5	vulkan: work around Intel fp16 bug in mmq (#18814 )	2026-01-14 09:41:23 +01:00
Perry Naseck	7d587e5544	ggml-metal: do not copy headers for embedded, use current binary dir for embedded (#18705 )	2026-01-14 09:22:25 +02:00
yulo	ea4a321f2a	HIP: add fattn-mma-f16 for RDNA4 (#18481 ) * finish VQ mma * flash_attn_ext_f16_iter * KQ_rowsum * correct exp * fix scale error * fix softmax scale * fix softmax scale * enable fattn on cpu side * fix random error * disable fattn-mma-f16 on rdna3 * fix wrong col for rdna * use identity mat to transpose * resolve conflicts * basic tuning for DeepSeek-R1-Distill-Qwen-1.5B * fix volta compile error * align rdna4 policy for fattn * adjust fattn policy * adjust kernel selection logic * update as the review comments * keep fattn-wmma logic * adjust kernel selection logic --------- Co-authored-by: zhang hui <you@example.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2026-01-13 13:52:16 +01:00
Georgi Gerganov	0a57271ab6	CUDA : fix unused argument when USE_CUDA_GRAPH=OFF (#18800 )	2026-01-13 12:25:53 +02:00
Jeff Bolz	8e2da778da	vulkan: change memory_logger to be controlled by an env var (#18769 )	2026-01-12 13:32:55 +01:00
Jeff Bolz	2bbe4c2cf8	vulkan: Use VK_EXT_shader_64bit_indexing to handle large mat_mul(_id) (#18678 ) This fixes incoherent output in Llama-4-Maverick-17B-128E-PAB-Q8_0, which has a mul_mat_id with an A matrix that's Q8_0 8192 x 5120 x 128. This should work when the number of blocks in the A matrix is less than 2^32 (for mul_mat_vec or mul_mm_cm2), or for mul_mm I think the limit is like 2^32*LOAD_VEC_A elements. - Divide batch_stride by QUANT_K earlier, so the block index calculation works in 32b. - Each vk_pipeline_struct has a linked list of pipelines that will allow it to handle variants. So far this change just adds a single use case for this, compiling with the e64BitIndexingEXT flag. - Use the 64b indexing variant when the A matrix is larger than maxStorageBufferRange. 64-bit indexing has some cost - around 3-5% in MoE models, so it's worth the effort to avoid enabling it unconditionally.	2026-01-12 12:32:13 +01:00
Ruben Ortlam	1051ecd289	vulkan: Disable large coopmat matmul configuration on proprietary AMD driver (#18763 ) * vulkan: Disable large coopmat matmul configuration on proprietary AMD driver * Also disable the large tile size	2026-01-12 07:29:35 +01:00
Ruben Ortlam	0e76501e1d	Vulkan: Optimize Matmul parameters for AMD GPUs with Coopmat support (#18749 ) * vulkan: Enable and optimize large matmul parameter combination for AMD * limit tuning to AMD GPUs with coopmat support * use tx_m values instead of _l	2026-01-11 17:33:33 +01:00
shaofeiqi	707cbafcaa	opencl: add SOFTPLUS op support (#18726 )	2026-01-10 21:57:44 -08:00
Johannes Gäßler	d2ff4e23ac	HIP: adjust RDNA3.5 MMQ kernel selction logic (#18666 )	2026-01-10 17:19:01 +01:00
Perry Naseck	657a2e644b	cmake : update blas logic (#18205 )	2026-01-10 18:00:54 +02:00

1 2 3 4 5 ...

1930 Commits