llama.cpp

History

shalinib-ibm a6a58d6478 llamafile: PowerPC Sgemm Optimization (#15558 ) This patch improves GEMM for FP32 Data Type on PowerPC Implements GEMM on large blocks with configurable block size mc, nc, kc (default: 256, 256, 256). Packing Function optimized to access blocks as per memory layout. GEMM Optimized to work on larger blocks. Isolated Packing from GEMM Operations for better MMA utilization. Verified functionality and correctness uing llama-cli and stand alone test case (performs matmul and compares final mattrix C result with base). Minor code refactoring changes: Replace macro with inline function Code Indent made consistent with 4 spaces Performance Testing: Observed 50% ~ 70% improvement in Prompt Processing Speed mesured using llama-bench with Meta-Llama3-8B FP32 Model. Similar gains observed with Mistral-7b-Instruct-v0.3 Model. model Size Params Backend Threads Test Patch Base llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp512 98.58 60.3 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp1024 95.88 57.36 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp2048 85.46 53.26 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp4096 68.66 45.78 llama 8B all F32 29.92 GiB 8.03 B CPU 20 pp6144 57.35 40.44 25 ~ 30% improvement in llama-batched-bench with Metla-Llama3-8B in Prompt Processing Speed for large prompts (256, 512, 1024, 2048, 4096)tokens with various batch sizes ( 1, 2, 4, 8, 16) Signed-off-by: Shalini Salomi Bodapati <Shalini.Salomi.Bodapati@ibm.com>		2025-08-26 23:35:25 +08:00
..
ggml-blas	ggml : fix field name when new ggml_backend (#14944 )	2025-08-08 14:37:22 +02:00
ggml-cann	CANN: ROPE cache sin/cos repeat (#15501 )	2025-08-25 10:32:21 +08:00
ggml-cpu	llamafile: PowerPC Sgemm Optimization (#15558 )	2025-08-26 23:35:25 +08:00
ggml-cuda	CUDA: return -1 for nonexistent compiled arch (#15587 )	2025-08-26 16:01:20 +02:00
ggml-hip	HIP: bump requirement to rocm 6.1 (#15296 )	2025-08-13 20:44:30 +02:00
ggml-metal	metal : optimize FA vec for large sequences and BS <= 8 (#15566 )	2025-08-26 14:22:14 +03:00
ggml-musa	CUDA: replace GGML_CUDA_F16 with CUDA arch checks (#15433 )	2025-08-20 16:58:49 +02:00
ggml-opencl	opencl: fix support ops condition for `rms_norm` (#15560 )	2025-08-25 14:18:09 -07:00
ggml-rpc	ggml-rpc: chunk send()/recv() to avoid EINVAL for very large tensors over RPC (macOS & others) (#15188 )	2025-08-13 08:54:30 +03:00
ggml-sycl	vulkan : support ggml_mean (#15393 )	2025-08-23 08:35:21 +02:00
ggml-vulkan	vulkan: Remove splitting for mul_mat_id (#15568 )	2025-08-26 06:42:44 +02:00
ggml-webgpu	ggml WebGPU: add support for quantization types (#15440 )	2025-08-22 11:28:03 -07:00
ggml-zdnn	ggml: initial IBM zDNN backend (#14975 )	2025-08-15 21:11:22 +08:00
CMakeLists.txt	ggml: initial IBM zDNN backend (#14975 )	2025-08-15 21:11:22 +08:00
ggml-alloc.c	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-backend-impl.h	ggml : upgrade init_tensor API to return a ggml_status (#11854 )	2025-02-28 14:41:47 +01:00
ggml-backend-reg.cpp	ggml: initial IBM zDNN backend (#14975 )	2025-08-15 21:11:22 +08:00
ggml-backend.cpp	sched : fix possible use of wrong ids tensor when offloading moe prompt processing (#15488 )	2025-08-21 23:09:32 +02:00
ggml-common.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-impl.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-opt.cpp	finetune: SGD optimizer, more CLI args (#13873 )	2025-08-14 12:03:57 +02:00
ggml-quants.c	ggml-quants : fix make_qp_quants NANs and IQ1 assertion errors (#15379 )	2025-08-18 09:23:56 +02:00
ggml-quants.h	llama : add gpt-oss (#15091 )	2025-08-05 22:10:36 +03:00
ggml-threading.cpp	ggml : build backends as libraries (#10256 )	2024-11-14 18:04:35 +01:00
ggml-threading.h	remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797 )	2024-12-12 19:02:49 +01:00
ggml.c	ggml: add `conv3d` op (#15182 )	2025-08-22 15:33:15 +02:00
ggml.cpp	ggml : Print backtrace on uncaught C++ exceptions (ggml/1232)	2025-06-01 13:43:57 +03:00
gguf.cpp	ggml : prevent integer overflow in gguf tensor size calculation (#14595 )	2025-07-09 14:33:53 +02:00