llama.cpp

Commit Graph

Author	SHA1	Message	Date
zhanmyz	eac9a99530	1. Solve the AC issue of Permute+VIEW and MULMAL issue in the phase of “1. Process Prompt and predict the first token”. 2. There is still an AC issue in the "2. Predict the subsequent tokens phase" and it is being debugged. A deviation has been detected in the computation of OpenVINO's CPY Node at stage 2, and it is currently being fixed.	2026-01-15 10:05:41 -08:00
zhanmyz	8ae700ae11	Process Prompt and predict first token is OK	2026-01-15 10:05:41 -08:00
zhanmyz	8020138406	add debug info	2026-01-15 10:05:41 -08:00
zhanmyz	b02265a507	1. In the Prompt process and predict first token stage, the PERMUTE node needs to be integrated into the OV Frontend 2. In the predict latest token stage, the VIEW, CONT, Reshape need to be integrated into the OV Frontend.	2026-01-15 10:05:41 -08:00
zhanmyz	19ec9b6bf5	Try to add VIEW node to OV Frontend and have some issues that need to be dealt with	2026-01-15 10:05:41 -08:00
zhanmyz	b14b49d5f6	Minor Update	2026-01-15 10:05:41 -08:00
zhanmyz	467a5ddf04	1. Update the implementation of CPY node when it's non-contiguous 2. Remove duplicate get node operation function	2026-01-15 10:05:41 -08:00
zhanmyz	cff473a9e2	1. All operators implemented using OpenVINO can be successfully executed individually. 2. VIEW op output tensor shape is not same with CONT(non-contiguous) input tensor shape 3. CPY(non-contiguous) can't be implemented with original input/output tensor shape and data(need change the original shape when create input/output tensor) Currently. VIEW op executed in the ggml backend and others executed in the OpenVINO Frontend.	2026-01-15 10:05:41 -08:00
zhanmyz	e08a7fda33	All adjacent ops can conversion but calculation result is wrong and need debugging	2026-01-15 10:05:41 -08:00
zhanmyz	d05c458421	change CONT and MULMAT input node shape	2026-01-15 10:05:41 -08:00
zhanmyz	246a2d1021	Change the input and ouput node shape of MUL_MAT operator	2026-01-15 10:05:41 -08:00
zhanmyz	f37fa21a5c	Change the input and ouput node shape of MUL_MAT operator	2026-01-15 10:05:41 -08:00
zhanmyz	f98d215162	Change the input parameter shape of CONT operator	2026-01-15 10:05:41 -08:00
zhanmyz	9a7b7d8d6d	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT/ROPE/SCALE/SOFTMAX/ADD adjacent op graph conversion	2026-01-15 10:05:41 -08:00
zhanmyz	95ae982d59	OV Frontend supports GET_ROWS/RMS_NORM/MUL/MUL_MAT graph conversion of consecutive OPs	2026-01-15 10:05:41 -08:00
zhanmyz	901f7347ff	Execute CONT & VIEW operators in OV Frontend is OK	2026-01-15 10:05:41 -08:00
zhanmyz	081b52667b	Execute singel CONT operator is OK	2026-01-15 10:05:41 -08:00
zhanmyz	afb8594194	add tmp source code files	2026-01-15 10:05:41 -08:00
zhanmyz	57582fda39	add implementation of CPY when the output tensor is non-contiguous	2026-01-15 10:05:41 -08:00
zhanmyz	8484769981	add implementation of MUL_MAT, CPY, CONT of GGML ops using OV ops	2026-01-15 10:05:41 -08:00
zhanmyz	cb2729bc4a	Move CPY from GGML OV Backend to OV Frontend	2026-01-15 10:05:41 -08:00
zhanmyz	2b04bd43be	Add MUL_MAT,CPY,CONT as operators implemented in OpenVINO for GGML backend	2026-01-15 10:05:41 -08:00
zhanmyz	0f7d07de7d	Add support for RMS_NORM OP	2026-01-15 10:05:41 -08:00
yumengbo	2353c73f53	Support ROPE op.	2026-01-15 10:05:41 -08:00
yumengbo	8aba03bac6	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	d218c61e6d	Support Softmax op	2026-01-15 10:05:41 -08:00
yumengbo	590f587b27	Add support for UNARY SILU op . Fix pytorch impl bugs.	2026-01-15 10:05:41 -08:00
yumengbo	b100f89bad	Change to implementation following pytorch frontend	2026-01-15 10:05:41 -08:00
yumengbo	e95f29cbc0	Fix issue for output memory copy of infer request	2026-01-15 10:05:41 -08:00
zhanmyz	8c5a609f8d	add the rms_norm operator implemented using OpenVINO to the GGML backend of llama.cpp	2026-01-15 10:05:41 -08:00
zhanmyz	80c330a469	Update build.md and add operation mapping(GGML to OpenVINO)	2026-01-15 10:05:41 -08:00
zhanmyz	49804f43fc	add GET_ROWS operator of OpenVINO to GGML of llama.cpp	2026-01-15 10:05:41 -08:00
yumengbo	5b46dc23be	Change output for infer request to set output tensor. Support scale, view op.	2026-01-15 10:05:41 -08:00
yumengbo	31bd816426	Add GGML_OV_FRONTEND option. Add readme.	2026-01-15 10:05:41 -08:00
yumengbo	9b7b63d12c	Convert subgraph with add, sub, mul, div op to ov model and do infer on openvino device	2026-01-15 10:05:41 -08:00
yumengbo	34e826ac14	Implement GgmlOvDecoder. Add dump functions.	2026-01-15 10:05:41 -08:00
yumengbo	171c4681f4	Add PoC of integration of openvino frontend. Main changes: ggml-ov-frontend-utils, GraphIterator, Decoder	2026-01-15 10:05:41 -08:00
zhanmyz	ee31dc1c1b	add get openvino available ops function	2026-01-15 10:05:41 -08:00
zhanmyz	77d68146a8	add OpenVINO frontend convert process steps	2026-01-15 10:05:41 -08:00
zhanmyz	0a81aa19f7	Add compile options	2026-01-15 10:05:40 -08:00
zhanmyz	adc2c70f44	Add OpenVINO MUL operator to GGML of Llama.cpp.	2026-01-15 10:05:40 -08:00
zhanmyz	faa4a7de76	Solve the issue of abnormal model output caused by using OpenVINO ADD operator	2026-01-15 10:05:40 -08:00
zhanmyz	9b9d51dddf	* Configure the device(default CPU) that uses OpenVINO to compile the model * Add OpenVINO ADD operator to Llama.cpp. The output is somewhat abnormal and needs further debugging.	2026-01-15 10:05:40 -08:00
zhanmyz	5294402b50	add openvino as optional backend for Llama.cpp ggml	2026-01-15 10:05:40 -08:00
Yanglei Zou	fe5720e684	Add ggml-openvino base files	2026-01-15 10:05:40 -08:00
Johannes Gäßler	5c662d21a3	CUDA: fix allignment on register spill for FA (#18815 )	2026-01-15 15:14:50 +01:00
shalinib-ibm	8cc0ba957b	ggml-cpu: optimize ggml_vec_dot_bf16 for Power9 (#18837 )	2026-01-15 17:31:18 +08:00
Max Krasnyansky	cff777f226	hexagon: support for OP_CPY, host buffers now optional, hvx-utils refactoring and optimizations (#18822 ) * hexagon: disable repack buffers if host buffers are disabled, improved handling of env vars * hexagon: add support for OP_CPY fp16/fp32 -> fp16/fp32 Factore out all hvx_copy functions into hvx-copy.h header and reduced code duplication. Update HTP ops infra to support OP_CPY * hexagon: cleanup and refactor hex/hvx/htp headers and helper libs hex is basically all scalar/core platform stuff (L2, DMA, basic utils) hvx is all hvx related utils, helpers, etc htp is higher level stuff like Ops, etc hvx-utils library got a nice round of cleanup and refactoring to reduce duplication use hvx_vec_store_a where possible * hexagon: refactor HVX sigmoid functions to hvx-sigmoid.h Moved sigmoid and tanh vector functions from hvx-utils.h to a new header hvx-sigmoid.h. Implemented aligned and unaligned variants for sigmoid array processing using a macro pattern similar to hvx-copy.h. Updated act-ops.c to use the new aligned variant hvx_sigmoid_f32_aa. Removed unused hvx-sigmoid.c. * hexagon: factor out hvx-sqrt.h * hexagon: mintor update to hvx-utils.h * hexagon: remove spurios log * hexagon: factor out and optimize hvx_add/sub/mul * hexagon: remove _opt variants of add/sub/mul as they simply fully aligned versions * hexagon: refactor reduction functions to hvx-reduce.h Moved `hvx_self_max_f32` and `hvx_self_sum_f32` from `hvx-utils.h`/`.c` to `hvx-reduce.h`. Renamed them to `hvx_reduce_max_f32` and `hvx_reduce_sum_f32`. Added aligned (`_a`) and unaligned (`_u`) variants and used macros to unify logic. Updated `softmax-ops.c` to use the new functions. * hexagon: refactor the rest of arithmetic functions to hvx-arith.h Moved `hvx_sum_of_squares_f32`, `hvx_min_scalar_f32`, and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` to use `dst, src, ..., n` argument order. Updated call sites in `act-ops.c`. Refactor Hexagon HVX arithmetic functions (min, clamp) to hvx-arith.h Moved `hvx_min_scalar_f32` and `hvx_clamp_scalar_f32` from `hvx-utils.c/h` to `hvx-arith.h`. Implemented aligned/unaligned variants (`_aa`, `_au`, etc.) and used macros to reduce code duplication. Updated these functions to use `dst, src, ..., n` argument order and updated call sites in `act-ops.c`. `hvx_sum_of_squares_f32` remains in `hvx-utils.c` as requested. * hexagon: refactor hvx_sum_of_squares_f32 - Modify `hvx_sum_of_squares_f32` in `ggml/src/ggml-hexagon/htp/hvx-reduce.h` to use `dst, src` signature. - Implement `_a` (aligned) and `_u` (unaligned) variants for `hvx_sum_of_squares_f32`. - Update `hvx_reduce_loop_body` macro to support both returning and storing results via `finalize_op`. - Update existing reduction functions in `hvx-reduce.h` to use the updated macro. - Update `rms_norm_htp_f32` in `ggml/src/ggml-hexagon/htp/unary-ops.c` to match the new signature. * hexagon: use hvx_splat instead of memset * hexagon: consistent use of f32/f16 in all function names to match the rest of GGML * hexagon: fix hvx_copy_f16_f32 on v75 and older * hexagon: update readme to include GGML_HEXAGON_EXPERIMENTAL * scripts: update snapdragon/adb scripts to enable host param	2026-01-14 21:46:12 -08:00
Oliver Simons	36f0132464	CUDA: Factor out and re-use `block_reduce` function (#18785 ) * CUDA: Refactor and expose two_stage_warp_reduce_* function * Use `two_stage_warp_reduce` also in softmax kernel, move smem out of it Moving smem out of `__device__` function to `__global__` function allows for explicit smem reuse, as either compiler or cuda rt seem to not free it afterwards (`cudaFuncSetAttribute` fails when not accounting for it once for each call to two_stage_warp_reduce) * Update ggml/src/ggml-cuda/common.cuh Co-authored-by: Aman Gupta <amangupta052@gmail.com> * Use two_stage_warp_reduce in group_norm_f32 * Use two_stage_warp_reduce in rms_norm_f32 * Fix smem calculation which expects bytes * Make `two_stage_warp_reduce` accept all values warp_reduce accepts Also integrate it into norm_f32 function * Use two_stage_warp_reduce in l2_norm_f32 * Use type traits for block reduction for better legibility Also adresss other requests by @am17an such as variable renaming * Make norm tests cover all cuda paths * Mark columns % WARP_SIZE !=0 as supported for RMS_NORM_BACK Unit-tests passed locally, let's see if they pass in the CI as well * Use `enum class` for `block_reduce_method` This is more type-safe than plain enum * Rename variables as suggested in code review by @am17an * Rename two_stage_warp_reduce -> block_reduce * Fix trailing whitespace in common.cuh * Make condition of static_assert type-dependent This delays evaluation until the template is actually instantiated. Otherwise, some compilers may evaluate the assert when parsing the template, resulting in build errors as observed here: https://github.com/ggml-org/llama.cpp/actions/runs/20960323123/job/60235530068?pr=18785 * Inline definitions --------- Co-authored-by: Aman Gupta <amangupta052@gmail.com>	2026-01-15 10:44:54 +08:00
Jeff Bolz	3e4bb29666	vulkan: Check maxStorageBufferRange in supports_op (#18709 ) * vulkan: Check maxStorageBufferRange in supports_op * skip maxStorageBufferRange check when shader64BitIndexing is enabled	2026-01-14 10:59:05 +01:00

1 2 3 4 5 ...

1942 Commits