llama.cpp

Commit Graph

Author	SHA1	Message	Date
Gaurav Garg	b1b132efcb	cuda : enable CUDA Graph on CUDA Toolkit < 12.x (#12394 ) * Enable CUDA Graph on CTK < 12.x `cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x. * Fix compilation errors with MUSA * Disable CUDA Graph for MUSA	2025-03-17 20:25:13 +02:00
uvos	34c961b181	CUDA/HIP: Fix fattn-vec-* when device warp size is not 32 (#12315 ) When fattn-wmma was ported over to warp64 various bits that also touch fattn-vec where converted to selectable warp size, however the fattn-vec kernels dont work with 64 wide warps for now, so we need to avoid launching them with parameters for warp64	2025-03-12 10:14:11 +01:00
uvos	10f2e81809	CUDA/HIP: refractor mmqv to unify the calculation of nwarps and rows per block between host and device code. (#12177 ) refactor mmqv to unify the calculation of nwarps and rows per block between host and device code. --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-03-11 20:16:03 +01:00
Johannes Gäßler	5220a16d18	CUDA: fix FA logic for PTX 7.0 and CC >= 7.5 (#12222 )	2025-03-06 18:45:09 +01:00
uvos	e721c05c93	HIP/CUDA: set the paramerter value in maintain_cuda_graph instead of replaceing it. (#12209 ) This avoids conflict with internal cuda/hip runtimes memory managment behavior.	2025-03-06 08:20:52 +01:00
David Huang	becade5de7	HIP: implement FlashAttention via rocWMMA for CDNA and RDNA3+ (#12032 ) Adds GGML_HIP_ROCWMMA_FATTN and rocwmma header check Adds rocWMMA support to fattn-wmma-f16 --- Signed-off-by: Carl Klemm <carl@uvos.xyz> Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Ben Jackson <ben@ben.com>	2025-03-03 22:10:54 +01:00
cmdr2	b64d7cc272	cuda: unary ops as float + de-duplicate (ggml/1130)	2025-03-03 18:18:11 +02:00
cmdr2	0cbee131ad	cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129) ggml-ci	2025-03-03 18:18:11 +02:00
cmdr2	87abb7e903	cuda/cpu: Increase support for fp16 unary operations (ggml/1125) * Support fp16 unary operations in the CUDA backend * cpu: increase fp16 support for unary operators in the CPU backend * cuda: increase fp16 support for unary operators in the CUDA backend * Add test cases for fp16 unary operators * metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing * metal: fix PR comments for unary op support after fp16 unary tests	2025-03-03 18:18:11 +02:00
cmdr2	f54a4ba11e	Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121) * Support float16-to-float16 add/sub/mul/div operations in the CUDA backend * Add fp16 support for add/sub/mul/div on the CPU backend * Add test cases for fp16 add/sub/mul/div	2025-03-03 18:18:11 +02:00
Erik Scholz	80c41ddd8f	CUDA: compress mode option and default to size (#12029 ) cuda 12.8 added the option to specify stronger compression for binaries, so we now default to "size".	2025-03-01 12:57:22 +01:00
William Tambellini	70680c48e5	ggml : upgrade init_tensor API to return a ggml_status (#11854 ) * Upgrade init_tensor API to return a ggml_status To prepare for an 'abort-free' ggml (ggml not to abort on OOMs but return a OOM status), as agreeed with Diego in the ggml repo, upgrade the init_tensor() and view_init() APIs to return a ggml_status. * misc fixes --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-02-28 14:41:47 +01:00
Johannes Gäßler	9c42b1718c	CUDA: fix logic for V100 + GGML_CUDA_FORCE_MMQ (#12098 )	2025-02-28 09:26:43 +01:00
Johannes Gäßler	a28e0d5eb1	CUDA: app option to compile without FlashAttention (#12025 )	2025-02-22 20:44:34 +01:00
Johannes Gäßler	5fa07c2f93	CUDA: optimize FA for GQA + large batches (#12014 )	2025-02-22 12:20:17 +01:00
Gian-Carlo Pascutto	d70908421f	cuda: Add Q5_1, Q5_0, Q4_1 and Q4_0 to F32 conversion support. (#12000 )	2025-02-22 09:43:24 +01:00
PureJourney	ecc8e3aeff	CUDA: correct the lowest Maxwell supported by CUDA 12 (#11984 ) * CUDA: correct the lowest Maxwell supported by CUDA 12 --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-02-21 12:21:05 +01:00
Bodhi	0b3863ff95	MUSA: support ARM64 and enable dp4a .etc (#11843 ) * MUSA: support ARM64 and enable __dp4a .etc * fix cross entropy loss op for musa * update * add cc info log for musa * add comment for the MUSA .cc calculation block --------- Co-authored-by: Bodhi Hu <huaishun.hu@mthreads.com>	2025-02-21 09:46:23 +02:00
Johannes Gäßler	73e2ed3ce3	CUDA: use async data loading for FlashAttention (#11894 ) * CUDA: use async data loading for FlashAttention --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-17 14:03:24 +01:00
Diego Devesa	94b87f87b5	cuda : add ampere to the list of default architectures (#11870 )	2025-02-14 15:33:52 +01:00
R0CKSTAR	bd6e55bfd3	musa: bump MUSA SDK version to rc3.1.1 (#11822 ) * musa: Update MUSA SDK version to rc3.1.1 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * musa: Remove workaround in PR #10042 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2025-02-13 13:28:18 +01:00
uvos	5c4284d57b	HIP: Remove GCN from list of devices that avoid MMQ (#11831 )	2025-02-12 22:25:28 +01:00
uvos	e598697d63	HIP: Switch to std::vector in rocblas version check (#11820 )	2025-02-12 17:25:03 +01:00
Johannes Gäßler	c3d6af7cd2	CUDA: fix CUDART_VERSION checks (#11821 )	2025-02-12 13:16:39 +01:00
Johannes Gäßler	b9ab0a4d0b	CUDA: use arch list for compatibility check (#11775 ) * CUDA: use arch list for feature availability check --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-11 00:17:22 +01:00
Johannes Gäßler	d80be897ac	CUDA: fix min. version for movmatrix (#11751 )	2025-02-08 10:46:07 +01:00
Johannes Gäßler	fa62da9b2d	CUDA: support for mat. mul. with ne03 != ne13 (#11656 )	2025-02-05 08:58:31 +01:00
Johannes Gäßler	fd08255d0d	CUDA: non-contiguous (RMS) norm support (#11659 ) * CUDA: non-contiguous (RMS) norm support --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2025-02-04 22:21:42 +01:00
Johannes Gäßler	21c84b5d2d	CUDA: fix Volta FlashAttention logic (#11615 )	2025-02-03 14:25:56 +02:00
Johannes Gäßler	6eecde3cc8	HIP: fix flash_attn_stream_k_fixup warning (#11604 )	2025-02-02 23:48:29 +01:00
uvos	396856b400	CUDA/HIP: add support for selectable warp size to mmv (#11519 ) CUDA/HIP: add support for selectable warp size to mmv	2025-02-02 22:40:09 +01:00
uvos	4d0598e144	HIP: add GGML_CUDA_CC_IS_* for amd familys as increasing cc archtectures for amd gpus are not supersets of eatch other (#11601 ) This fixes a bug where RDNA1 gpus other than gfx1010 where not handled correctly	2025-02-02 22:08:05 +01:00
Johannes Gäßler	864a0b67a6	CUDA: use mma PTX instructions for FlashAttention (#11583 ) * CUDA: use mma PTX instructions for FlashAttention * __shfl_sync workaround for movmatrix * add __shfl_sync to HIP Co-authored-by: Diego Devesa <slarengh@gmail.com>	2025-02-02 19:31:09 +01:00
uvos	6af1ca48cb	HIP: Prepare reduction operators for wave 64	2025-01-30 16:25:44 +01:00
uvos	c300e68ef4	CUDA/HIP: add warp_size to cuda_device_info	2025-01-30 16:25:44 +01:00
uvos	be5ef7963f	HIP: Supress transformation warning in softmax.cu loops with bounds not known at compile time can not be unrolled. when ncols_template == 0, the bounds of the loop are not constexpr, thus llvm cant unroll the loops here.	2025-01-28 23:06:32 +01:00
Nikita Sarychev	cae9fb4361	HIP: Only call rocblas_initialize on rocblas versions with the multiple instantation bug (#11080 ) This disables the workaround on rocblas fixed versions (>=4.0.0) to eliminate the runtime cost and unnecessary VRAM allocation of loading all tensile objects.	2025-01-28 16:42:20 +01:00
Haus1	d6d24cd9ed	AMD: parse the architecture as supplied by gcnArchName (#11244 ) The value provided by minor doesn't include stepping for AMD, parse the value returned by gcnArchName instead to retrieve an accurate ID.	2025-01-27 14:58:17 +01:00
uvos	26771a1491	Hip: disable VMM on hip as it seams that it dosent work in some configurations (#11420 )	2025-01-25 21:01:12 +01:00
uvos	5f0db9522f	hip : Add hipGraph and VMM support to ROCM (#11362 ) * Add hipGraph support * Enable VMM on rocm	2025-01-25 00:02:23 +01:00
Johannes Gäßler	c5d9effb49	CUDA: fix FP16 cuBLAS GEMM (#11396 )	2025-01-24 21:02:43 +01:00
uvos	9fbadaef4f	rocBLAS: Avoid fp32->fp16->fp32 conversion on cdna (#11356 )	2025-01-24 17:50:49 +01:00
Johannes Gäßler	8137b4bb2b	CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380 )	2025-01-24 12:38:31 +01:00
Johannes Gäßler	9c8dcefe17	CUDA: backwards pass for misc. ops, add tests (#11257 ) * CUDA: backwards pass for misc. ops, add tests * remove restrict from pointers	2025-01-16 16:43:38 +01:00
Johannes Gäßler	432df2d5f9	RoPE: fix back, CUDA support for back + noncont. (#11240 ) * RoPE: fix back, CUDA support for back + noncont. * fix comments reg. non-cont. RoPE support [no-ci]	2025-01-15 12:51:37 +01:00
Andreas Kieslinger	39509fb082	cuda : CUDA Graph Compute Function Refactor (precursor for performance improvements) (#11042 ) * Refactor: Moves cuda graph executable update step to separate function. * Refactor: Moves cuda graph update check to separate function. * Refactor: Moves cuda graph maintenance (update or adjusting copy parameters) to separate function for improved readability. * Fix: Adds missing reference to maintain_cuda_graph() definition. * Refactor: Improves structure and abstractions by moving CUDA graph evaluation and capture to its own function. * Refactor: Moves node graph checks and copy ops into individual function for improved readability. * Refactor: Removes code permanently excluded from compilation to increase readability. * Style: Adds missing newline * Style: Consolidates several neighboring '#ifdef USE_CUDA_GRAPH' into a single one * Refactor: Makes 'cuda_graph_update_required' a local variable * remove double lines between functions --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-01-13 16:45:53 +01:00
Molly Sophia	ee7136c6d1	llama: add support for QRWKV6 model architecture (#11001 ) llama: add support for QRWKV6 model architecture (#11001) * WIP: Add support for RWKV6Qwen2 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV: Some graph simplification Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Add support for RWKV6Qwen2 with cpu and cuda GLA Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix some typos Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * code format changes Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix wkv test & add gla test Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Fix cuda warning Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update README.md Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * Update ggml/src/ggml-cuda/gla.cu Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Fix fused lerp weights loading with RWKV6 Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * better sanity check skipping for QRWKV6 in llama-quant thanks @compilade Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: compilade <git@compilade.net> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: compilade <git@compilade.net>	2025-01-10 09:58:08 +08:00
hydai	8d59d91171	fix: add missing msg in static_assert (#11143 ) Signed-off-by: hydai <z54981220@gmail.com>	2025-01-08 20:03:28 +00:00
Johannes Gäßler	46e3556e01	CUDA: add BF16 support (#11093 ) * CUDA: add BF16 support	2025-01-06 02:33:52 +01:00
HimariO	ba1cb19cdd	llama : add Qwen2VL support + multimodal RoPE (#10361 ) * Barebone Qwen2VL LLM convertor * Add Qwen2VL cli entrypoint * [WIP] add qwen2vl arch * Verify m-rope output * Add vl-rope/2d-rope support for qwen2vl ViT * update qwen2vl cli tool * update 5D tensor op workaround * [WIP] qwen2vl vision model * make batch and clip utils compatible with qwen2vl * [WIP] create inference workflow, gguf convert script but fix * correcting vision-rope behavior, add the missing last layer back to ViT * add arg parser to qwen2vl_surgery * replace variable size array with vector * cuda-gdb cmake preset * add fp32 mrope, vision rope kernel * add fp16 support for qwen2vl and m-rope * add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION` * fix rope op mode switching, out dated func args * update `llama_hparams` * update to keep up stream changes * resolve linter, test errors * add makefile entry, update speical image padding token * add mrope unit test, fix few compiler warnings * rename `mrope` related function, params * minor updates on debug util, bug fixs * add `m-rope` testcase to `test-backend-ops` * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix traililng whitespce * store `llama_hparams.rope_sections` with fixed size array * update position id tensor size check in GGML_OP_ROPE * minor updates * update `ggml_backend__supports_op` of unsupported backends remote old `rope_section` compare operator --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-14 14:43:46 +02:00
a3sh	8faa1d4dd4	CUDA: faster non-contiguous concat (#10760 ) * faster uncontiguous concat * Use a lambda to avoid code duplication Co-authored-by: Diego Devesa <slarengh@gmail.com> * Update ggml/src/ggml-cuda/concat.cu * add constexpr and static assert --------- Co-authored-by: Diego Devesa <slarengh@gmail.com>	2024-12-12 19:09:50 +01:00
Andreas Kieslinger	750cb3e246	CUDA: rename macros to avoid conflicts with WinAPI (#10736 ) * Renames NVIDIA GPU-architecture flags to avoid name clashes with WinAPI. (e.g. CC_PASCAL, GPU architecture or WinAPI pascal compiler flag?) * Reverts erroneous rename in SYCL-code. * Renames GGML_CUDA_MIN_CC_DP4A to GGML_CUDA_CC_DP4A. * Renames the rest of the compute capability macros for consistency.	2024-12-10 18:23:24 +01:00
Johannes Gäßler	26a8406ba9	CUDA: fix shared memory access condition for mmv (#10740 )	2024-12-09 20:07:12 +01:00
Djip007	19d8762ab6	ggml : refactor online repacking (#10446 ) * rename ggml-cpu-aarch64.c to .cpp * reformat extra cpu backend. - clean Q4_0_N_M and IQ4_0_N_M - remove from "file" tensor type - allow only with dynamic repack - extract cpu extra bufts and convert to C++ - hbm - "aarch64" - more generic use of extra buffer - generalise extra_supports_op - new API for "cpu-accel": - amx - aarch64 * clang-format * Clean Q4_0_N_M ref Enable restrict on C++ * add op GGML_OP_MUL_MAT_ID for Q4_0_N_M with runtime repack * added/corrected control on tensor size for Q4 repacking. * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * Update ggml/src/ggml-cpu/ggml-cpu-aarch64.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * add debug logs on repacks. --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-12-07 14:37:50 +02:00
mahorozte	e9e661bd59	CUDA: remove unnecessary warp reduce in FA (ggml/1032) * kqmax_new_j in every thread within warp is same after operate at line 199,this reduce can be omit * same problem in vec32 --------- Co-authored-by: ZhaoXiaoYu <zhao.xiaoyu@zte.com.cn>	2024-12-03 20:04:49 +02:00
uvos	3ad5451f3b	Add some minimal optimizations for CDNA (#10498 ) * Add some minimal optimizations for CDNA * ggml_cuda: set launch bounds also for GCN as it helps there too	2024-11-27 17:10:08 +01:00
Georgi Gerganov	ab96610b1e	cmake : enable warnings in llama (#10474 ) * cmake : enable warnings in llama ggml-ci * cmake : add llama_get_flags and respect LLAMA_FATAL_WARNINGS * cmake : get_flags -> ggml_get_flags * speculative-simple : fix warnings * cmake : reuse ggml_get_flags ggml-ci * speculative-simple : fix compile warning ggml-ci	2024-11-26 14:18:08 +02:00
Diego Devesa	5931c1f233	ggml : add support for dynamic loading of backends (#10469 ) * ggml : add support for dynamic loading of backends --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-11-25 15:13:39 +01:00
Diego Devesa	a5e47592b6	cuda : optimize argmax (#10441 ) * cuda : optimize argmax * remove unused parameter ggml-ci * fixup : use full warps ggml-ci * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * fix ub * ggml : check ne00 <= INT32_MAX in argmax and argsort --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-11-21 18:18:50 +01:00
Diego Devesa	3ee6382d48	cuda : fix CUDA_FLAGS not being applied (#10403 )	2024-11-19 14:29:38 +01:00
Diego Devesa	d3481e6316	cuda : only use native when supported by cmake (#10389 )	2024-11-18 18:43:40 +01:00
Johannes Gäßler	76e9e58b78	CUDA: fix MMV kernel being used for FP16 src1 (#10357 )	2024-11-17 23:20:42 +01:00
Johannes Gäßler	ce2e59ba10	CMake: fix typo in comment [no ci] (#10360 )	2024-11-17 12:59:38 +01:00
Johannes Gäßler	c3ea58aca4	CUDA: remove DMMV, consolidate F16 mult mat vec (#10318 )	2024-11-17 09:09:55 +01:00
Johannes Gäßler	467576b6cc	CMake: default to -arch=native for CUDA build (#10320 )	2024-11-17 09:06:34 +01:00
Johannes Gäßler	8a43e940ab	ggml: new optimization interface (ggml/988)	2024-11-17 08:30:29 +02:00
Diego Devesa	ae8de6d50a	ggml : build backends as libraries (#10256 ) * ggml : build backends as libraries --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>	2024-11-14 18:04:35 +01:00
SXX	5b359bb1e3	ggml: fix zero division in ‘dne’ calculation in CUDA COUNT_EQUAL operator when ‘ne’ is small (#10213 )	2024-11-09 08:35:46 +01:00
Georgi Gerganov	841f27abdb	metal : optimize FA kernels (#10171 ) * ggml : add ggml_flash_attn_ext_get_prec * metal : use F16 precision in FA kernels ggml-ci * metal : minor clean-up * metal : compile-guard bf16 FA kernels ggml-ci * build : remove obsolete compile flag [no ci] * metal : prevent int overflows [no ci] * cuda : disable BF16 FA ggml-ci * metal : fix BF16 requirement for FA kernels ggml-ci * make : clean-up [no ci]	2024-11-08 13:47:22 +02:00
Zhiyuan Li	3bcd40b3c5	Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133 ) * rwkv6: rename to wkv6 * rwkv6: support avx2 avx512 armv8 armv9 * rwkv6: update cuda file name * rwkv6: rename params * wkv on sycl * sycl: add some ops * sycl: Enhance OP support judgment * wkv6: drop armv9 and tranfer to GGML style ggml-ci * sync : ggml * update the function to use appropriate types * fix define error * Update ggml/src/ggml-cpu.c * add appropriate asserts * move element-wise functions outside * put the declaration outside the loop * rewrite to be more inline with the common pattern for distributing threads * use recommended way GGML_TENSOR_LOCALS --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Diego Devesa <slarengh@gmail.com> Co-authored-by: Plamen Minev <pacominev@gmail.com> Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com> Co-authored-by: Meng, Hengyu <airdldl@163.com>	2024-11-07 15:19:10 +08:00
bssrdf	8c60a8a462	increase cuda_cpy block size (ggml/996) Co-authored-by: bssrdf <bssrdf@gmail.com>	2024-10-26 10:33:56 +03:00
Johannes Gäßler	c39665f589	CUDA: fix MMQ for non-contiguous src0, add tests (#10021 ) * CUDA: fix MMQ for non-contiguous src0, add tests * revise test code	2024-10-24 11:09:36 +02:00
Johannes Gäßler	80273a306d	CUDA: fix 1D im2col, add tests (ggml/993)	2024-10-23 16:50:02 +03:00
agray3	13dca2a54a	Vectorize load instructions in dmmv f16 CUDA kernel (#9816 ) * Vectorize load instructions in dmmv f16 CUDA kernel Replaces scalar with vector load instructions, which substantially improves performance on NVIDIA HBM GPUs, e.g. gives a 1.27X overall speedup for Meta-Llama-3-8B-Instruct-F16 BS1 inference evaluation on H100 SXM 80GB HBM3. On GDDR GPUs, there is a slight (1.01X) speedup. * addressed comment * Update ggml/src/ggml-cuda/dmmv.cu Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-10-14 02:49:08 +02:00
Johannes Gäßler	fabdc3bda3	ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980)	2024-10-03 21:17:26 +03:00
Johannes Gäßler	aaa4099925	CUDA: remove bad assert (ggml/972)	2024-09-29 21:15:37 +03:00
Ivan	116efee0ee	cuda: add q8_0->f32 cpy operation (#9571 ) llama: enable K-shift for quantized KV cache It will fail on unsupported backends or quant types.	2024-09-24 02:14:24 +02:00
R0CKSTAR	c35e586ea5	musa: enable building fat binaries, enable unified memory, and disable Flash Attention on QY1 (MTT S80) (#9526 ) * mtgpu: add mp_21 support Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: disable flash attention on qy1 (MTT S80); disable q3_k and mul_mat_batched_cublas Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: enable unified memory Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * mtgpu: map cublasOperation_t to mublasOperation_t (sync code to latest) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-22 16:55:49 +02:00
Johannes Gäßler	a5b57b08ce	CUDA: enable Gemma FA for HIP/Pascal (#9581 )	2024-09-22 09:34:52 +02:00
Molly Sophia	2a63caaa69	RWKV v6: RWKV_WKV op CUDA implementation (#9454 ) * ggml: CUDA unary op EXP Signed-off-by: Molly Sophia <mollysophia379@gmail.com> * ggml: rwkv_wkv op CUDA impl Signed-off-by: Molly Sophia <mollysophia379@gmail.com> --------- Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-09-22 04:29:12 +02:00
agray3	41f477879f	Update CUDA graph on scale change plus clear nodes/params (#9550 ) * Avoid using saved CUDA graph if scale changes and reset nodes/params on update Fixes https://github.com/ggerganov/llama.cpp/issues/9451 * clear before resize	2024-09-21 02:41:07 +02:00
Georgi Gerganov	d13edb17ed	ggml : fix builds (#0 ) ggml-ci	2024-09-20 21:15:05 +03:00
Johannes Gäßler	424c5d00a9	ggml/examples: add backend support for numerical optimization (ggml/949) * CUDA eval works * stochastic gradient descent op * Adam except decay * CUDA CROSS_ENTROPY_LOSS_BACK * CUDA mnist-fc training works * backend CLI arg * refactor gguf load * remove sched from opt_step_adam * implement l1 regularization (weight decay) * extra call to add optimizer * initialize gradients with ggml_graph_reset * gradient accumulation * increment iter per eval instead of epoch * adjust backend interfaces * fix ggml_graph_reset without backend * fix ggml graph export/import * fixup * rename * revert ggml_opt changes * more general CUDA repeat_back * update documentation, fix CNN * validation split * add clarifying comment * optimize PyTorch training * adjust buffer size, thread count * fix 0.0f validation split * Update examples/mnist/mnist-common.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * fix gradient accumulation * tensor flag for accumulators -> tensor hash set * Update include/ggml.h Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * Update tests/test-backend-ops.cpp Co-authored-by: slaren <slarengh@gmail.com> * fix test prints * Update src/ggml-backend.c Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * better CUDA support for noncontiguous out_prod * add comment --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: slaren <slarengh@gmail.com>	2024-09-20 21:15:05 +03:00
Johannes Gäßler	5cb12f6839	CUDA: fix sum.cu compilation for CUDA < 11.7 (#9562 )	2024-09-20 18:35:35 +02:00
Johannes Gäßler	5af118efda	CUDA: fix --split-mode row race condition (#9413 )	2024-09-11 10:22:40 +02:00
R0CKSTAR	b34e023480	musa: remove Clang builtins mapping (#9421 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-09-11 03:46:55 +02:00
Johannes Gäßler	8e6e2fbe14	CUDA: fix variable name conflict for Windows build (#9382 )	2024-09-09 14:22:53 +02:00
Georgi Gerganov	e079bffb66	cuda : fix FA Q src index (1 -> 0) (#9374 )	2024-09-08 22:01:02 +03:00
Johannes Gäßler	202084d31d	tests: add gradient tests for all backends (ggml/932) * tests: add gradient checking to test-backend-ops * remove old comment * reorder includes * adjust SIN/COS parameters * add documentation, use supports_op if possible	2024-09-08 11:05:55 +03:00
slaren	4db04784f9	cuda : fix defrag with quantized KV (#9319 )	2024-09-05 11:13:11 +02:00
Georgi Gerganov	231cff5f6f	sync : ggml	2024-08-27 22:41:27 +03:00
Johannes Gäßler	e11bd856d5	CPU/CUDA: Gemma 2 FlashAttention support (#8542 ) * CPU/CUDA: Gemma 2 FlashAttention support * apply logit_softcap to scale in kernel * disable logit softcapping tests on Metal * remove metal check	2024-08-24 21:34:59 +02:00
Daniel Bevenius	06943a69f6	ggml : move rope type enum to ggml.h (#8949 ) * ggml : move rope type enum to ggml.h This commit moves the `llama_rope_type` enum from `llama.h` to `ggml.h` and changes its name to `ggml_rope_type`. The motivation for this change is to address the TODO in `llama.h` and use the enum in ggml. Note: This commit does not change the `mode` parameter to be of type `enum ggml_rope_type`. The name `mode` and its usage suggest that it might be more generic and possibly used as a bit field for multiple flags. Further investigation/discussion may be needed to determine if `mode` should be restricted to RoPE types. * squash! ggml : move rope type enum to ggml.h This commit removes GGML_ROPE_TYPE_NONE and GGML_ROPE_TYPE_GLM from ggml.h, and back the llama_rope_type enum. I've kept the assert for GGML_ROPE_TYPE_GLM as I'm not sure if it is safe to remove it yet. * squash! ggml : move rope type enum to ggml.h This commit removes the enum ggml_rope_type from ggml.h and replaces it with a define (GGML_ROPE_TYPE_NEOX). This define is used in the code to check if the mode is set to GPT-NeoX. Also the enum llama_rope_type has been updated to reflect this change. * squash! ggml : move rope type enum to ggml.h This commit contains a suggestion enable the GGML_ROPE_TYPE_NEOX macro/define to be passed to the shader compiler. * squash! ggml : move rope type enum to ggml.h This commit fixes the editorconfig-checker warnings. * squash! ggml : move rope type enum to ggml.h Update comment for ggml_rope function. * Revert "squash! ggml : move rope type enum to ggml.h" This reverts commit `6261222bd0`. * squash! ggml : move rope type enum to ggml.h Add GGML_ROPE_TYPE_NEOX to rope_common.comp. * remove extra line --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-08-13 21:13:15 +02:00
Molly Sophia	2d5dd7bb3f	ggml : add epsilon as a parameter for group_norm (#8818 ) Signed-off-by: Molly Sophia <mollysophia379@gmail.com>	2024-08-06 10:26:46 +03:00
slaren	7a11eb3a26	cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X (#8800 ) cuda : fix dmmv cols requirement to 2GGML_CUDA_DMMV_X update asserts * only use dmmv for supported types * add test	2024-08-01 15:26:22 +02:00
R0CKSTAR	439b3fc75a	cuda : organize vendor-specific headers into vendors directory (#8746 ) Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-29 14:56:12 +02:00
R0CKSTAR	e54c35e4fb	feat: Support Moore Threads GPU (#8383 ) * Update doc for MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in Makefile Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Add GGML_MUSA in CMake Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * CUDA => MUSA Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * MUSA adds support for __vsubss4 Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> * Fix CI build failure Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com> --------- Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>	2024-07-28 01:41:25 +02:00
slaren	2b1f616b20	ggml : reduce hash table reset cost (#8698 ) * ggml : reduce hash table reset cost * fix unreachable code warnings after GGML_ASSERT(false) * GGML_ASSERT(false) -> GGML_ABORT("fatal error") * GGML_ABORT use format string	2024-07-27 04:41:55 +02:00
Jeroen Mostert	46e47417aa	Allow all RDNA2 archs to use sdot4 intrinsic (#8629 ) The check gating the use of `__builtin_amdgc_sdot4` specifically checks for gfx1030. This causes a severe perf regression for anything gfx103? that's not gfx1030 and not using `HSA_OVERRIDE_GFX_VERSION` (if you've built ROCm to support it). We already have a generic RDNA2 define, let's use it.	2024-07-23 10:50:40 +02:00
Johannes Gäßler	69c487f4ed	CUDA: MMQ code deduplication + iquant support (#8495 ) * CUDA: MMQ code deduplication + iquant support * 1 less parallel job for CI build	2024-07-20 22:25:26 +02:00
Daniel Bevenius	b078c619aa	cuda : suppress 'noreturn' warn in no_device_code (#8414 ) * cuda : suppress 'noreturn' warn in no_device_code This commit adds a while(true) loop to the no_device_code function in common.cuh. This is done to suppress the warning: ```console /ggml/src/ggml-cuda/template-instances/../common.cuh:346:1: warning: function declared 'noreturn' should not return [-Winvalid-noreturn] 346 \| } \| ^ ``` The motivation for this is to reduce the number of warnings when compilng with GGML_HIPBLAS=ON. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> * squash! cuda : suppress 'noreturn' warn in no_device_code Update __trap macro instead of using a while loop to suppress the warning. Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com> --------- Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>	2024-07-11 17:53:42 +02:00
Johannes Gäßler	808aba3916	CUDA: optimize and refactor MMQ (#8416 ) * CUDA: optimize and refactor MMQ * explicit q8_1 memory layouts, add documentation	2024-07-11 16:47:47 +02:00
John Balis	fde13b3bb9	feat: cuda implementation for `ggml_conv_transpose_1d` (ggml/854) * conv transpose 1d passing test for 1d input and kernel * working for different input and output channel counts, added test for variable stride * initial draft appears to work with stride other than 1 * working with all old and new conv1d tests * added a test for large tensors * removed use cuda hardcoding * restored test-conv-transpose.c * removed unused arugments, and fixed bug where test failure would cause subsequent tests to fail * fixed accumulator bug * added test to test-backend-ops * fixed mistake * addressed review * fixed includes * removed blank lines * style and warning fixes * return failure when test fails * fix supports_op --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-07-08 12:23:00 +03:00
Johannes Gäßler	8e558309dc	CUDA: MMQ support for iq4_nl, iq4_xs (#8278 )	2024-07-05 09:06:31 +02:00
Daniele	0a423800ff	CUDA: revert part of the RDNA1 optimizations (#8309 ) The change on the launch_bounds was causing a small performance drop in perplexity of 25 t/s	2024-07-05 09:06:09 +02:00
Johannes Gäßler	bcefa03bc0	CUDA: fix MMQ stream-k rounding if ne00 % 128 != 0 (#8311 )	2024-07-05 09:05:34 +02:00
Daniele	d23287f122	Define and optimize RDNA1 (#8085 )	2024-07-04 01:02:58 +02:00
Clint Herron	07a3fc0608	Removes multiple newlines at the end of files that is breaking the editorconfig step of CI. (#8258 )	2024-07-02 12:18:10 -04:00
Johannes Gäßler	cb5fad4c6c	CUDA: refactor and optimize IQ MMVQ (#8215 ) * CUDA: refactor and optimize IQ MMVQ * uint -> uint32_t * __dp4a -> ggml_cuda_dp4a * remove MIN_CC_DP4A checks * change default * try CI fix	2024-07-01 20:39:06 +02:00
Johannes Gäßler	85a267daaa	CUDA: fix MMQ stream-k for --split-mode row (#8167 )	2024-06-27 16:26:05 +02:00
Georgi Gerganov	f3f65429c4	llama : reorganize source code + improve CMake (#8006 ) * scripts : update sync [no ci] * files : relocate [no ci] * ci : disable kompute build [no ci] * cmake : fixes [no ci] * server : fix mingw build ggml-ci * cmake : minor [no ci] * cmake : link math library [no ci] * cmake : build normal ggml library (not object library) [no ci] * cmake : fix kompute build ggml-ci * make,cmake : fix LLAMA_CUDA + replace GGML_CDEF_PRIVATE ggml-ci * move public backend headers to the public include directory (#8122) * move public backend headers to the public include directory * nix test * spm : fix metal header --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * scripts : fix sync paths [no ci] * scripts : sync ggml-blas.h [no ci] --------- Co-authored-by: slaren <slarengh@gmail.com>	2024-06-26 18:33:02 +03:00

... 4 5 6 7 8

361 Commits