llama.cpp

Commit Graph

Author	SHA1	Message	Date
Georgi Gerganov	0320ac5264	metal : refactor + optimize v2 (#15995 ) * metal : improve naming * metal : refactor device ggml-ci * cont : props ggml-ci * metal : apply ggml_mem_ranges_t ggml-ci * metal : remove GGML_METAL_USE_BF16 ggml-ci * metal : refactor device buffer ggml-ci * cont : fix naming * metal : sync before destroying the backend ggml-ci * metal : refactor context ggml-ci * metal : migrate ggml-metal.m to ggml-metal.cpp ggml-ci * metal : adjust ops API ggml-ci * metal : use C++ to store piplienes ggml-ci * metal : migrate ops to separate functions ggml-ci * metal : add ggml_metal_library_t ggml-ci * metal : improve naming ggml-ci * metal : cleanp ggml-ci * metal : add support for GGML_OP_LOG ggml-ci * metal : fix error handling ggml-ci	2025-09-17 20:38:12 +03:00
Aleksander Grygier	a7a98e0fff	SvelteKit-based WebUI (#14839 )	2025-09-17 19:29:13 +02:00
Xuan-Son Nguyen	8f8f2274ee	convert : add Llama4ForCausalLM (#16042 ) * convert : add Llama4ForCausalLM * handle swa * half working version * fix use_kq_norm * fix use_kq_norm	2025-09-17 19:18:21 +02:00
Johannes Gäßler	c959b676be	CUDA: fix FA occupancy, optimize tile kernel (#15982 )	2025-09-17 15:32:42 +02:00
David Ribeiro Alves	cd08fc3ecc	common : Fix corrupted memory error on json grammar initialization (#16038 ) Initalizing RESERVED_NAME in is_reserved_name() is not thread safe and leads to corrupted memory when used from multiple threads as can be seen in the asan trace below. This fixes the initialization to make it thread-safe. #0 0x000100abd018 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) __hash_table:1565 #1 0x000100ab0320 in SchemaConverter::visit(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) json-schema-to-grammar.cpp:802 #2 0x000100aafc48 in std::__1::__function::__func<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2, std::__1::allocator<build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&)::$_2>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> (std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #3 0x000100a2c938 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&), std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0::operator()(common_grammar_builder const&) const::'lambda'(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>, void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)>::operator()(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&) function.h:319 #4 0x000100a139f8 in foreach_function(nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&, std::__1::function<void (nlohmann::json_abi_v3_12_0::basic_json<nlohmann::json_abi_v3_12_0::ordered_map, std::__1::vector, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, bool, long long, unsigned long long, double, std::__1::allocator, nlohmann::json_abi_v3_12_0::adl_serializer, std::__1::vector<unsigned char, std::__1::allocator<unsigned char>>, void> const&)> const&) chat.cpp:762 #5 0x000100a2a7f4 in std::__1::__function::__func<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0, std::__1::allocator<common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool)::$_0>, void (common_grammar_builder const&)>::operator()(common_grammar_builder const&) function.h:319 #6 0x000100aa98f4 in build_grammar(std::__1::function<void (common_grammar_builder const&)> const&, common_grammar_options const&) json-schema-to-grammar.cpp:982 #7 0x0001009c9314 in common_chat_params_init_llama_3_x(minja::chat_template const&, templates_params const&, bool) chat.cpp:1110 #8 0x0001009b8afc in common_chat_templates_apply_jinja(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:1992 #9 0x0001009b533c in common_chat_templates_apply(common_chat_templates const, common_chat_templates_inputs const&) chat.cpp:2074 #10 0x000100810120 in llamacpp_apply_chat_template+0x724 (predict_oai-98384e17fb94e863:arm64+0x100090120) ... ==45482==Register values: x[0] = 0x00006020004147f8 x[1] = 0x00006080000013c8 x[2] = 0x0000000000000000 x[3] = 0x0000604006289738 x[4] = 0x0000000000000002 x[5] = 0x0000000000000001 x[6] = 0x04034000004b4000 x[7] = 0x0000000000000001 x[8] = 0xbebebebebebebebe x[9] = 0x17d7d7d7d7d7d7d7 x[10] = 0x00000c04000828ff x[11] = 0x0000000000000001 x[12] = 0x000000002018d383 x[13] = 0x0000000000000000 x[14] = 0xfa0000000000fafa x[15] = 0x000010700001ffff x[16] = 0x000000019dc012c0 x[17] = 0x00000001021284f8 x[18] = 0x0000000000000000 x[19] = 0x00000001700acdc0 x[20] = 0x0000000000000002 x[21] = 0x000000002018d384 x[22] = 0x16dd16fd2e731151 x[23] = 0x0000007000020000 x[24] = 0x0000000100c69c08 x[25] = 0x0000000100c69c20 x[26] = 0x00006080000013c7 x[27] = 0x0000000100c69c00 x[28] = 0x00000001700acd60 fp = 0x00000001700aceb0 lr = 0x0000000100abce30 sp = 0x00000001700acd60 AddressSanitizer can not provide additional info. SUMMARY: AddressSanitizer: SEGV __hash_table:1565 in std::__1::pair<std::__1::__hash_iterator<std::__1::__hash_node<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, void>>, bool> std::__1::__hash_table<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::hash<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::equal_to<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>, std::__1::allocator<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>>>::__emplace_unique_key_args<std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>>, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&>(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char>> const&) Thread T5 created by T0 here: #0 0x0001020b99d4 in pthread_create+0x5c (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x359d4) #1 0x000100873910 in std::sys::pal::unix:🧵:Thread:🆕:h77254fdd87a28e05+0x118 (predict_oai-98384e17fb94e863:arm64+0x1000f3910) #2 0x0001007c7a1c in test::run_test::haeb3c2bcd5ed6cf6+0x76c (predict_oai-98384e17fb94e863:arm64+0x100047a1c) #3 0x0001007aedb0 in test::console::run_tests_console::he9d142d704f3a986+0x149c (predict_oai-98384e17fb94e863:arm64+0x10002edb0) #4 0x0001007c5758 in test::test_main::hf86a5e20735245b9+0x118 (predict_oai-98384e17fb94e863:arm64+0x100045758) #5 0x0001007c5da0 in test::test_main_static::h61ee9c8fd30abca0+0x54 (predict_oai-98384e17fb94e863:arm64+0x100045da0) ... ==45482==ABORTING	2025-09-17 11:08:02 +03:00
Eve	cb5bb6cc05	vulkan: automatically remove unsupported devices (#15976 ) * remove unsupported vulkan devices * make this happen during selection instead * pass by reference	2025-09-17 09:35:37 +02:00
Daniel Bevenius	a91d035b90	ci : revert back to macos-13 for macOS-latest-cmake-x64 (#16040 ) This commit reverts the change of the runs-on parameter for the macOS-latest-cmake-x64 job back to macos-13 that was make in Commit `51abc96bdc` ("ci : update macos-latest* jobs to use macos-latest (#15938)"). The motivation for this is that using macos-latest will cause an ARM based runner to be used, and not an x64 based runner. Refs: https://github.com/ggml-org/llama.cpp/pull/15938#issuecomment-3300805127	2025-09-17 09:34:09 +02:00
Jie Fu (傅杰)	745cbcf2fe	llama-quant : fix the verification of attention layers for encoder-decoder models (#16023 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-17 09:30:55 +02:00
Jie Fu (傅杰)	1cbd80f8cf	examples : support encoder-decoder models in the simple example (#16002 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-17 10:29:00 +03:00
Shane A	85286f3548	model : add OLMo3 support (#16015 ) * Add HF to gguf conversion logic for Olmo3 * Add Olmo3 implementation * Update rope comment * Fix indentation Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Apply suggestion from @CISC Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-17 09:01:58 +02:00
Chenguang Li	d5fabe3682	CANN: Optimize ggml_cann_set_device (#15935 ) * CANN: Fix ggml_cann_set_device to avoid redundant device switches - Added a check to skip aclrtSetDevice if the current device is already set. - Prevents unnecessary context switches while keeping thread/device consistency. * CANN: add device default id	2025-09-17 14:33:08 +08:00
jacekpoplawski	8ff206097c	llama-bench: add --n-cpu-moe support (#15952 ) * llama-bench: add --n-cpu-moe support Support --n-cpu-moe in llama-bench the same way it is supported by llama-server.	2025-09-16 16:17:08 +02:00
Daniel Bevenius	77475530b8	ci : use macos-latest for arm64 webgpu build (#16029 ) This commit updates the runs-on field for the macOS arm64 webgpu build job to use macos-latest instead of just latest. The motivation for this is that this job can wait for a runner to pick up the job for a very long time, sometimes over 7 hours. This is an attempt to see if this change can help reduce the wait time. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17754163447/job/50454257570?pr=16004	2025-09-16 15:27:52 +02:00
Daniel Bevenius	3913f8730e	ggml : fix padding in timestep embedding kernels (#15932 ) * ggml : remove adding extra dim timestep embedding This commit updates the ggml_timestep_embedding function to no longer add an extra dimension when the specified dimension is odd. The motivation for this change is that this introduces an unnecessary dimension when the dimension is odd, which caused an issue in the kernels which were not expecting this extra dimension and it resulted in uninitialized memory for the second to last dimension. * ggml-cuda : fix padding in timestep embedding kernel This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension. * ggml-metal : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel * ggml-opencl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-sycl : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-vulkan : fix padding in timestep embedding kernel This commit fixes the zero padding for odd dimensions in the timestep embedding kernel. * ggml-cpu : fix padding in timestep embedding function This commit removes the zeroing out of the last dimension now that we are not adding the extra padding dimension.	2025-09-16 15:25:57 +02:00
Daniel Bevenius	76888d202e	ci : upload xcframework artifact from ios-xcode-build job (#16010 ) This commit updates the github workflows build.yml file to include steps for uploading and downloading the xcframework artifact. The macos-latest-swift job now depends on the ios-xcode-build job and downloads the xcframework artifact produced by it. The motivation for this changes is that it takes a long time to build the xcframework and we are currently doing this twice in the workflow. With this change, we only build it once and reuse the artifact.	2025-09-16 13:41:38 +02:00
Bowen Han	f1fbffb5c0	fix: apply clang-format to CUDA macros (#16017 ) clang-format previously broke long CUDA macros (e.g. __launch_bounds__) into unreadable line breaks inside template declarations, such as: template<int D, int ncols, int nwarps, int VKQ_stride, typename KQ_acc_t, bool use_logit_softcap> __launch_bounds__(nwarps*ggml_cuda_get_physical_warp_size(), 1) This change adjusts formatting rules so that CUDA macros remain consistent and aligned with the surrounding template syntax.	2025-09-16 08:59:19 +02:00
Daniel Bevenius	51abc96bdc	ci : update macos-latest* jobs to use macos-latest (#15938 ) * ci : update macos-latest* jobs to use macos-latest This commit updates the jobs that are named macos-latest* to use the macos-latest label instead explicit versions. The motivation for this is that there is currently a mixuture of versions in this workflow and there are jobs that are failing because they require a newer version. Refs: https://github.com/ggml-org/llama.cpp/actions/runs/17644792595/job/50140010907#step:5:1759 * ci : add xcodebuild -downloadPlatform iOS command	2025-09-16 05:57:16 +02:00
Yuri Khrustalev	07808ebb07	cmake : Do not install tools on iOS targets (#15903 )	2025-09-16 09:54:44 +07:00
Aman Gupta	6d758839ff	Add LLaDA-7b-MoE diffusion model (#16003 )	2025-09-16 10:38:28 +08:00
Jake Karnes	3d4053f77f	CUDA: fix im2col_3d to respect non-contiguous inputs (views) (#15956 ) * fix im2col_3d to respect non-contiguous inputs (views) The CUDA 3D im2col kernel computed source addresses assuming compact layout (products of dims), ignoring nb[] strides. This patch switches im2col_3d source indexing to use true strides derived from src1->nb[] (in elements), mirroring the approach used in the 2D CUDA im2col path. Destination indexing is unchanged. * use ggml_element_size() for src strides Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-09-16 00:28:31 +02:00
Diego Devesa	dc381aa9a6	docker : enable rocWMMA in ROCm images, add gfx1151 (#15997 )	2025-09-15 23:38:52 +02:00
Diego Devesa	10d197409b	releases : switch to rocWMMA develop branch, add gfx1151 (#15992 ) * releases : switch to rocWMMA develop branch, add gfx1151 * remove unused variable ROCM_VERSION	2025-09-15 23:38:42 +02:00
yael-works	b907255f4b	SYCL: Add COUNT_EQUAL operator support (#15991 ) * SYCL: Add COUNT_EQUAL operator support (rebased on master) * SYCL: remove duplicate op_count_equal definition * tests: remove test_count_equal_typed and use test_count_equal for all cases * tests: keep only I32 case for COUNT_EQUAL as suggested * tests: keep only I32 case for COUNT_EQUAL as requested	2025-09-15 18:51:35 +02:00
Nikolay Popov	28c39da7c6	llama-run: Fix model download on Windows (#15988 ) * llama-run: Fix model download on Windows * fix SSL error (SSL peer certificate or SSH remote key was not OK) * fix program crash on std::filesystem::rename * llama-run: create a separate method to utilize RAII * llama-run: handle rename exception	2025-09-15 11:08:30 +01:00
Aman Gupta	106220562a	CUDA: some micro-optimizations in mmf.cuh for mul_mat_id (#15926 )	2025-09-15 17:35:11 +08:00
ddh0	a68f31edd7	fix KLD percentile output (#15999 ) In `llama-perplexity`, when using `--kl-divergence`, the KL divergence statistics output mistakenly displays the 99th percentile twice. This change fixes that and correctly displays the 90th percentile as originally intended (presumably).	2025-09-15 09:54:57 +02:00
Sigbjørn Skjæret	b8e09f08b9	model : add grok-2 support (#15539 ) * add grok-2 support * type fix * type fix * type fix * "fix" vocab for invalid sequences * fix expert tensor mapping and spaces in vocab * add chat template * fix norm tensor mapping * rename layer_out_norm to ffn_post_norm * ensure ffn_post_norm is mapped * fix experts merging * remove erroneous FFN_GATE entry * concatenate split tensors and add more metadata * process all expert layers and try cat instead of hstack * add support for community BPE vocab * fix expert feed forward length and ffn_down concat * commit this too * add ffn_up/gate/down, unsure if sequence is right * add ffn_gate/down/up to tensor names * correct residual moe (still not working) * mess-- * fix embedding scale being applied twice * add built in chat template * change beta fast for grok if default value * remove spm vocab in favor of community bpe vocab * change attention temp length metadata type to integer * update attention temp length metadata * remove comment * replace M_SQRT2 with std::sqrt(2) * add yarn metadata, move defaults to hparams	2025-09-14 23:00:59 +02:00
Sigbjørn Skjæret	6c019cb04e	server : only attempt to enable thinking if using jinja (#15967 )	2025-09-14 21:17:04 +02:00
Georgi Gerganov	9dcd200d57	metal : remove memory pools (#15966 ) * metal : remove mem pool usage ggml-ci * metal : remove mem pool implementation ggml-ci * metal : take into account the actual allocated memory of the tensor ggml-ci * cont : use ggml_backend_buft_get_alloc_size ggml-ci * cont : improve, comments ggml-ci * cont : add functions for the extra tensor sizes * metal : add comments ggml-ci * metal : implement .get_alloc_size for the rest of the buffer types ggml-ci * metal : remove ggml_metal_heap ggml-ci	2025-09-14 22:02:32 +03:00
Adam	0fa154e350	rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series (#15994 ) * rocm.Dockerfile: added gfx1200,gfx1201 architectures to support AMD Radeon RX 9000 series https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.4.1/reference/system-requirements.html#rdna-os states the Radeon RX 9000 series is supported support from Ubuntu 24.04.2, and the dockerfile is using 24.04 which is ROCm 6.4. This fixed the `ROCm error: invalid device function` I was getting when trying to use the rocm container.	2025-09-14 20:43:54 +02:00
Ruben Ortlam	261e6a20ff	Vulkan: Clean up mul_mm shader (#15987 ) * vulkan: move mul_mm dequantization steps into a separate file and functions * improve mul_mm vector load code * fix debug mode issues and warnings	2025-09-14 16:56:28 +02:00
lcy	a0e13dcbe5	build: fix the build failures of Windows HIP release job (#15984 ) * build: fix the cache keys for Windows HIP release job Update the cache keys to include the HIP SDK version, preventing the use of outdated ROCm installation caches. * build: sync changes from release.yml to build.yml - Update HIP SDK version to 25.Q3 and ROCm version to 6.4.2 - Update the cache keys to reflect the new versions * build: remove Windows HIP release for gfx1151 since the current stable rocWMMA does not support gfx1151.	2025-09-14 07:20:35 -07:00
Georgi Gerganov	a14bd35014	metal : fix kernel requirements (#15983 ) * metal : fix kernel requirements ggml-ci * cont : fix supports_op * cont : fix supports_op for ARGMAX	2025-09-14 15:33:22 +03:00
Radoslav Gerganov	918b26f197	rpc : fix regression when --device is used (#15981 ) Fix regression introduced with commit `50f4281a6`	2025-09-14 12:28:18 +03:00
Diego Devesa	9ecb884346	releases : update ROCM, add gfx1200, gfx1201, gfx1151 (#15972 ) * releases : update ROCM, add gfx1200, gfx1201, gfx1151 * releases : set target to 13.3 for macos-x64 * add hipblaslt.dll to release * add hipblaslt/library to release	2025-09-14 02:21:59 -07:00
Radoslav Gerganov	d1c6f11f47	doc : update documentation for --tensor-split (#15980 ) * doc : update documentation for --tensor-split * Update tools/main/README.md Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * Update tools/main/README.md Co-authored-by: Diego Devesa <slarengh@gmail.com> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-09-14 12:10:07 +03:00
Aaron Teo	6380d6a3e7	ggml-zdnn: rm user mapped buffers (#15965 ) * ggml-zdnn: rm user mapped buffers Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: rm dead code Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * ggml-zdnn: attempt to fix missing extra data buffer free Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>	2025-09-14 13:37:03 +08:00
Jeff Bolz	aa0c461efe	vulkan: fix failing dequant shaders (#15862 ) * vulkan: fix failing dequant shaders * add missing const	2025-09-13 17:29:43 +02:00
Jeff Bolz	b9c9c9f789	vulkan: initialize vulkan-hpp to allow using extension function pointers (#15705 ) Use this to query register count for shader compiles on NVIDIA. Currently this is only for performance debug, but it could eventually be used in some heuristics like split_k.	2025-09-13 17:23:30 +02:00
Diego Devesa	50f4281a6f	llama : allow using iGPUs with --device (#15951 ) * llama : allow using iGPUs with --device * mtmd : allow iGPU * rpc-server : allow iGPU	2025-09-13 16:49:49 +02:00
Georgi Gerganov	55758b00ca	metal : refactor kernel loading (#15964 ) * metal : refactor bin kernels loading ggml-ci * metal : refactor rms kernel loading ggml-ci * ci : try to add memory leaks check ggml-ci * ci : try to enable memory leak detection for Mac * cont : seems to be working	2025-09-13 16:24:22 +03:00
Georgi Gerganov	f161463a54	metal : allow ops to run concurrently (#15929 ) * metal : run graphs ops concurrently ggml-ci * cont : add flags for debugging and disabling concurrency ggml-ci * cont : refactor and handle fusing ggml-ci * cont : simplify - no need to use GPU address ggml-ci * cont : prepare mem ranges for reuse + add ggml-metal-common.cpp ggml-ci * cont : avoid redundant keywords in cpp [no ci] * metal : reorder graph for better concurrency ggml-ci * metal : fix race on mem pool buffers ggml-ci * cont : add env GGML_METAL_GRAPH_OPTIMIZE_DISABLE ggml-ci * cont : refactor, optimize, add comments ggml-ci * cont : refactor ggml-metal.m ggml-ci * minor : update logs [no ci]	2025-09-13 13:54:28 +03:00
Georgi Gerganov	84d7b2fca1	metal : fix memory leaks (#15962 ) ggml-ci	2025-09-13 12:45:04 +03:00
Aaron Teo	40be51152d	ggml-zdnn: fix #15414 , activate FP16 and BF16 acceleration and incorrect zTensor free (#15839 )	2025-09-13 02:39:52 +08:00
Eric Curtin	4bf5549269	Add docker protocol support for llama-server model loading (#15790 ) To pull and run models via: llama-server -dr gemma3 Add some validators and sanitizers for Docker Model urls and metadata Signed-off-by: Eric Curtin <eric.curtin@docker.com>	2025-09-12 16:31:50 +01:00
Haiyue Wang	f4e664f838	context : remove redundant explicit casting to the same type (#15948 ) The function 'output_reserve' return type is 'uint32_t', so need to add explicit casting.	2025-09-12 18:16:32 +03:00
Georgi Gerganov	f088b6a84f	server : adjust prompt similarity thold + add logs (#15913 ) ggml-ci	2025-09-12 17:02:55 +03:00
Ruben Ortlam	304ac5693d	Vulkan iGPU device selection overhaul and PCI ID API support (#15947 ) * vulkan: implement ggml igpu device type, implement pci id support * fix compiler warning * prevent printf overflow warning	2025-09-12 13:24:21 +02:00
Mathieu Baudier	6c88ad8fa7	vulkan: Make device memory check more portable (#15939 )	2025-09-12 09:06:20 +02:00
Neo Zhang Jianyu	704d90c987	Revert "sycl: add usage of enqueue_functions extension (#14244 )" (#15910 ) * Revert "sycl: add usage of enqueue_functions extension (#14244)" This reverts commit `8308f98c7f`. * fix missed revert code, format the code	2025-09-12 09:15:12 +08:00

1 2 3 4 5 ...

6551 Commits All Branches Search

6551 Commits

All Branches