llama.cpp

Commit Graph

Author	SHA1	Message	Date
Victor Villar	58c81f7e81	model : fix Granite Hybrid type check for 7B.A1B (#20795 ) * Check granite hybriid expert count to set type as LLM_TYPE_7B_A1B or LLM_TYPE_1B * Use feed fwd dim instead of num of experts Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-20 15:16:09 +01:00
Xuan-Son Nguyen	fb78ad29bb	server: (doc) clarify in-scope and out-scope features (#20794 ) * server: (doc) clarify in-scope and out-scope features * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 14:03:50 +01:00
Jeff Bolz	e06c3ab2bc	vulkan: change gated_delta_net to shard a column across a subgroup (#20662 ) * vulkan: change gated_delta_net to shard a column across a subgroup This is based on https://github.com/ggml-org/llama.cpp/pull/20391, I used an LLM to port the CUDA code to Vulkan, and guided to it to make various fixes to work with Vulkan (e.g. handling different subgroup sizes, unknown mapping of subgroup to invocation id, using subgroupAdd optionally, etc.). This fixes a perf regression from the transposing of the values in memory (!20443). * vulkan: Spread columns across fewer lanes to reduce the number of workgroups	2026-03-20 12:17:15 +01:00
Ruikai Peng	dc6592431b	context: zero output buffer on allocation (#20781 ) * context: zero output buffer on allocation Address GHSA-wqq9-25mr-rw76. The logits output buffer allocated in output_reserve() uses posix_memalign(), which does not zero memory. The buffer is only written during decode when needs_raw_logits() returns true. When backend samplers cover all output sequences, needs_raw_logits() returns false and the buffer is never written, but llama_get_logits() still returns a pointer to it, exposing stale heap content. Zero the buffer after allocation to prevent information disclosure through the public logits API. Found-by: Pwno * Update src/llama-context.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-20 11:31:34 +02:00
Ruikai Peng	3adbef7776	model: assert nextn_predict_layers to prevent underflow (#20783 ) Address GHSA-645x-v54x-34w8. When nextn_predict_layers >= n_layer, n_layer - nextn_predict_layers can underflow (unsigned wrap), which corrupts n_layer_kv_from_start. Assert nextn_predict_layers immediately after parsing the GGUF key. Found-by: Pwno	2026-03-20 10:17:58 +01:00
Georgi Gerganov	ab9d4c3678	server : improve mtmd ctx checkpoints (#20726 ) * server : improve mtmd ctx checkpoints * server : fix off-by-one in pos_min_thold	2026-03-20 11:13:12 +02:00
hipudding	1af9dab32b	CANN: add BF16 support for core operators (#20152 ) * CANN: add BF16 support for core operators Add BF16 (bfloat16) type support to the CANN backend for the following operators: MUL_MAT, MUL_MAT_ID, GET_ROWS, SET_ROWS, CPY, CONT, and OUT_PROD. This enables BF16 models to run on Ascend NPUs. * CANN: skip NZ weight format for BF16 and add 310P compile guards NZ weight format conversion does not support BF16 tensors, skip it in set_tensor, get_alloc_size and mul_mat. Remove BF16 from MUL_MAT_ID and OUT_PROD as there are no BF16 use cases. Add #ifndef ASCEND_310P guards for all BF16 operator support since 310P does not support BF16.	2026-03-20 17:08:39 +08:00
Seyoung Jeong	6d99b44c7e	docs : fix Metal backend op support status in ops.md (#20779 ) Regenerate docs/ops/Metal.csv using test-backend-ops on Apple M5 and rebuild docs/ops.md via scripts/create_ops_docs.py. Five ops were incorrectly marked as not supported (❌) for Metal: - DIAG: ❌ → ✅ - POOL_1D: ❌ → ✅ - SET: ❌ → ✅ - SOLVE_TRI: ❌ → ✅ - GATED_DELTA_NET:❌ → 🟡 (partial, depends on head_size % 32)	2026-03-20 11:06:38 +02:00
Georgi Gerganov	464fd0e71f	ai : update find-related action (#20790 ) * ai : update "related issues" prompt * cont * cont * cont	2026-03-20 10:28:14 +02:00
Ruikai Peng	21c8045214	jinja : fix heap OOB read in value equality comparison (#20782 ) Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp. The three-iterator overload of std::equal in value_array_t::equivalent() and value_object_t::equivalent() reads past the end of the shorter container when comparing arrays or objects of different lengths. Use the four-iterator overload (C++14) which checks both range lengths. Found-by: Pwno	2026-03-20 07:15:17 +01:00
James O'Leary	c46583b86b	common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777 ) * chat : fix out_of_range crash in throw path (#20424 regression) #20424 introduced effective_input = generation_prompt + input, but the throw path uses input.substr(result.end) where result.end is a position within effective_input. Every thinking model with a non-empty generation_prompt crashes with std::out_of_range instead of the intended error message. Test crashes on unpatched master, passes with fix: cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF cmake --build build --target test-chat ./build/bin/test-chat * Update test-chat.cpp * Update test-chat.cpp * Update test-chat.cpp --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>	2026-03-20 02:37:22 +01:00
Ben Racicot	c1b911654a	server: fix router mode deadlock on child crash and TOCTOU race in models_max (#20763 ) Two bugs in `server_models::load()` that affect router mode reliability: Bug 1: Deadlock when child process crashes When a child process is killed (e.g., SIGKILL from OS code signature validation), the monitoring thread deadlocks on `stopping_thread.join()` because the stopping_thread's wait predicate (`is_stopping`) is never satisfied — the model name was never inserted into `stopping_models`. `update_status()` is never reached and the model stays stuck in LOADING state permanently. Fix: extend the stopping_thread's wait predicate to also wake when the child process is no longer alive (`!subprocess_alive()`). When woken by a dead child, the thread skips the shutdown sequence and returns immediately. The original `stopping_models.erase()` logic is preserved for normal unloads. Bug 2: TOCTOU race bypasses `--models-max` (ref #20137) `unload_lru()` is called outside the mutex, then `load()` acquires the lock afterward. Under concurrent requests, multiple threads observe capacity and all proceed to load, exceeding the limit. Fix: re-check capacity under the lock after `unload_lru()` returns. If another thread filled the slot in the window between `unload_lru()` and the lock acquisition, reject with an error instead of silently exceeding the limit.	2026-03-19 22:16:05 +01:00
Tomeamis	b739738dad	docs: Update server README to reflect PR #20297 (#20560 )	2026-03-19 21:28:44 +01:00
Sundaram krishnan	a0bbcdd9b6	ggml: guard KleidiAI DOWNLOAD_EXTRACT_TIMESTAMP for cmake < 3.24 (#20767 )	2026-03-19 21:36:23 +02:00
Georgi Gerganov	6c72646a61	ci : improve action for duplicate issue (#20772 ) * ci : show thinking traces of the agent * cont : increase thinking * cont : remove agent files * cont : move the model selection to the provider	2026-03-19 21:11:53 +02:00
Rail Chabdarov	340807273b	hip: Avoid compiler bug in RDNA code generation during debug builds on Windows (#20655 )	2026-03-19 19:14:08 +01:00
Ryan Goulden	26c9ce1288	server: Add cached_tokens info to oaicompat responses (#19361 ) * tests : fix fetch_server_test_models.py * server: to_json_oaicompat cached_tokens Adds OpenAI and Anthropic compatible information about the number of cached prompt tokens used in a response.	2026-03-19 19:09:33 +01:00
James O'Leary	76f2dc70c3	chat : handle tool calls with no required args in TAG_WITH_TAGGED format (#20764 ) * chat : handle tool calls with no required args in TAG_WITH_TAGGED format * Update tests/test-chat.cpp [no ci] Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-03-19 17:53:11 +01:00
Georgi Gerganov	900efd531d	ci : clarify gh command for viewing issues (#20766 )	2026-03-19 18:43:54 +02:00
Yiwei Shao	74c42ee1f4	hexagon: add Matrix Extensions (HMX) for Hexagon NPU backend (#20693 ) * migrate(vtcm): unify VTCM management for HMX merge - Add HMX fields to htp_context (#ifdef HTP_HAS_HMX): hmx_enabled, hmx_dma, vtcm_scratch_size, exp2_table - Add HTP_VTCM_SESSION_HOLD CMake option (default ON): hold VTCM for entire session instead of per-op acquire/release - Add vtcm_op_acquire/vtcm_op_release inline wrappers: no-op in session-hold mode, delegate in per-op mode - Add VTCM tail reservation for precompute tables (256KB, 64KB aligned) in htp_iface_start under HTP_HAS_HMX - Add HMX init/cleanup hooks in htp_iface_start/stop - Add precompute table recovery in vtcm_acquire after VTCM preemption - Do NOT migrate vtcm_mgr from htp-ops-lib (replaced by tail reservation) * migrate(repack): replace x4x2 with HMX tile-permuted super-block format - Add hmx_block_q4_0/q8_0 struct definitions (scales-first + sequential quants) - Implement forward repack: repack_q4_0_to_hmx_superblock, repack_q8_0_to_hmx_superblock, repack_f16_to_tile_permuted - Implement inverse repack for get_tensor debug verification - Route set_tensor/get_tensor via opt_arch >= 73 to HMX path, else existing HVX x4x2 - MXFP4 on v73+ falls back to HVX x4x2 repack (not memcpy) - Extend supports_op: add IQ4_NL for v73+, F16 tile alignment checks - Tail blocks (K not multiple of 256): repack to x4x2 via pad-repack-truncate - Add CMake GGML_HEXAGON_HMX_TAIL_HVX option (default ON); OFF rejects non-256-aligned K in supports_op * migrate(dma): add dma_queue_push_1d() convenience wrapper for HMX ops Add 1D linear DMA transfer helper to hex-dma.h for upcoming HMX op migration. Reuses existing dma_queue_flush() for sync points instead of adding redundant dma_queue_drain(). * migrate(hmx): reorganize HMX files into htp/hmx/ and simplify HMX locking Move all 14 HMX-related files from htp/ to htp/hmx/ subdirectory for cleaner separation between HVX and HMX code. Simplify HMX hardware locking by replacing the two-level lock design (SHARED HAP lock + custom asm spin-lock) with direct HAP_compute_res_hmx_lock/unlock on the existing vtcm_rctx, which already has HMX capability. Key changes: - Create htp/hmx/ subdirectory with all HMX infrastructure and ops - Replace hmx_mgr_ctx_id + spin-lock with HAP_compute_res_hmx_lock(vtcm_rctx) - Remove hmx_manager_enable/disable_execution() (SHARED lock no longer needed) - Add hmx_set_vtcm_state() call in main.c (was missing, caused null globals) - Update main.c includes to use hmx/ prefix - Clean up duplicate declarations from hmx-worker-pool.h * migrate(hmx-infra): consolidate HMX infrastructure into htp_context - Remove hmx-mgr.c/h: eliminate global HMX state singleton, thread htp_context through all HMX ops - Remove hmx-worker-pool.c/h: replace separate HMX worker pool with main worker_pool API (worker_pool_run_func) - Replace hmx_unit_acquire/release with direct HAP_compute_res_hmx_lock/unlock on ctx->vtcm_rctx - Remove HTP_VTCM_SESSION_HOLD compile option: always use per-op vtcm_acquire/release - Remove hmx_dma from htp_context: HMX ops use ctx->dma[0] instead of separate DMA queue - Simplify main.c init/cleanup: remove hmx_manager_setup/reset and vtcm_op_acquire/release wrappers - Delete upstream llama.cpp AGENTS.md (not applicable to fork) * migrate(flash-attn): remove HTP_EXP2_TABLE_COPIES, use single exp2 table - Remove HTP_EXP2_TABLE_COPIES compile definition and CMake cache variable - Remove table duplication loop in precompute-table.c - Remove worker_index % N sub-table indexing in hmx-flash-attn-ops.c - Fix table_size to 65536 (single 64 KB copy) in main.c The exp2 lookup table is read-only; concurrent VTCM reads do not cause bank conflicts, so duplicating the table wastes 192 KB of VTCM for no benefit. * migrate(dsp-main): add HMX priority dispatch in packet_callback - Add proc_hmx_matmul_req() wrapper for HMX mat_mul (F16 and quantized types) - Add proc_hmx_flash_attn_req() wrapper for HMX simple_flash_attn (FP16 only, falls back to HVX for non-FP16) - Add proc_hmx_rms_norm_req() wrapper using hvx_rms_norm_f32 - Route MUL_MAT, FLASH_ATTN_EXT, RMS_NORM through HMX path when ctx->hmx_enabled - Split RMS_NORM and SCALE into separate case blocks for independent dispatch - All HMX wrappers guarded by #ifdef HTP_HAS_HMX * migrate(cmake-dsp): add HMX source files and -mhmx for v73+ skels Add HTP_VTCM_SESSION_HOLD option (default ON) and v73+ HMX build integration: compile hmx-matmul-ops, hmx-flash-attn-ops, hmx-rms-norm-ops and precompute-table into v73/v75/v79/v81 skels with -mhmx flag and HTP_HAS_HMX=1 definition. v68/v69 skels remain unchanged. * migrate(hmx-ops): fix compile errors in HMX ops for ggml struct compatibility - hmx-matmul-ops.c: include ggml-common.h for block_q4_0/block_q8_0 definitions - hmx-matmul-ops.c: rename quants->qs, scale->d to match upstream ggml field names - hmx-flash-attn-ops.c: suppress -Wunused-function/-Wunused-variable warnings - hmx-flash-attn-ops.c: inline ctx->n_threads, remove unused n_workers variable * hmx: set Q/O element type to fp16 for flash attention The llama.cpp integration passes fp16 Q/O tensors, so qo_fp32_element should be false to match the actual data layout. * hexagon: unify HMX weight format to x4x2, add IQ4_NL and DSP-side fallback Remove the v73+ HMX-specific super-block/tile-permuted weight format and unify all architectures on the HVX x4x2 packed format. The DSP now decides at runtime whether to use the HMX or HVX matmul path based on dimension constraints (M%32, N%32, K%256 alignment), rather than the host rejecting ops in supports_op. This simplifies the host repack logic, eliminates ~400 lines of HMX super-block code, and adds IQ4_NL quantization support across host and DSP. Key changes: - Remove hmx_block_q4_0/q8_0 types, repack functions, and F16 tile permutation (ggml-hexagon.cpp, hmx-quants.h) - Simplify set_tensor/get_tensor to always use x4x2 repack, add IQ4_NL - Force is_host=false so tensor copies go through format conversion - Add HTP_TYPE_IQ4_NL to DSP message protocol (htp-msg.h) - Rewrite DSP dequantizers to work directly on x4x2 layout (hmx-matmul-ops.c) - Fix mxclracc.hf placement: clear per output tile, not once globally - Move HMX eligibility checks to DSP proc_hmx_matmul_req (main.c) - Remove dma_queue_push_1d wrapper, use 2D DMA for weight sub-blocks - Add VTCM allocation overflow asserts - Remove GGML_HEXAGON_HMX_TAIL_HVX build option (CMakeLists.txt) * Enhance HMX debugging capabilities with new tile dumping functions - Introduced hmx_dump_tile_mem and hmx_dump_fp32_tile_region for improved memory layout visualization of tile data. - Updated hmx_dump_tile_rows to provide raw memory output for debugging. - Added debug logging for activation and weight tile pairs during processing to facilitate troubleshooting. - Refined existing macros for dumping HVX vector values to streamline debugging output. These changes aim to enhance the debugging experience for HMX matmul operations, ensuring better visibility into data handling and transformations. * OK for small mat mul * hexagon: fix UDMA roiwidth 16-bit overflow in HMX matmul DMA transfers The UDMA descriptor roiwidth field is 16-bit (max 65535), but large matrix DMA transfers (e.g. 32×2304 = 73728 bytes) exceeded this limit, causing truncated transfers and NaN results. Fix by using 2D DMA (per-row stride × n_rows) instead of 1D (total_size × 1) for all 4 DMA push calls in both x4x2 and fp16 weight paths. Also includes: - Use standard vlut16 instead of _nomatch variant for dequantization - Add per-tile vscatter drain barrier for correctness - Add compile-time HMX_DEBUG_TRACE_VALUES instrumentation (disabled by default) * hexagon: remove HMX RMS norm fallback and re-enable matmul pipeline Remove hmx-rms-norm-ops.c as the HVX RMS norm offers no benefit over the generic unary path. Re-enable DMA pipeline mode for QK matmul. * hexagon: guard all HMX matmul DMA transfers against UDMA 16-bit field overflow All UDMA type1 descriptor fields (roiwidth, roiheight, srcstride, dststride) are 16-bit (max 65535). Commit 40d2a9cc fixed roiwidth overflow in the non-pipeline path by switching from 1D to 2D DMA, but the pipeline path (3 call sites) was left unchanged and still used 1D DMA with chunk_size = n_cols * row_stride as roiwidth, which overflows for any practical matrix size when the pipeline is active. Add a local hmx_dma_push_safe() helper that transparently handles overflow: - Fast path (zero overhead): all params fit in 16 bits -> direct call. - Contiguous block: reshapes into a single 2D descriptor with sub_width that fits in 16 bits, preserving async DMA behavior. - Stride overflow: row-by-row fallback for future large-k models where per-row stride itself exceeds 65535. Convert all 8 external dma_queue_push calls in hmx-matmul-ops.c to use the safe helper, including the 3 pipeline sites (1D -> 2D fix), the FP16 and x4x2 weight paths, qweight_fetch sub-block DMA, and the output-stationary activation fetch. * hexagon: multithread activation/output transfer and add HMX matmul fallback - Replace single-threaded transfer_activation_chunk_fp32_to_fp16 with transfer_activation_chunk_multithread across all HMX matmul paths - Add multi-threaded transfer_output_chunk_multithread for FP16-to-FP32 output store, following the same worker pool pattern - Rename transfer_activation_chunk_no_prefetch back to transfer_activation_chunk_fp32_to_fp16 and clean up stale comments - Add HVX fallback in proc_hmx_matmul_req when HMX matmul returns error * [todo]: dynamic alloc vtcm, cause prefill regression. * hexagon: constrain HMX mxmem tile load region to avoid VTCM bank boundary faults Set activation/weight mxmem Rt to 2047 for single-tile loads and document the 4MB VTCM bank boundary constraint, preventing precise bus errors when dynamic VTCM allocation places tiles near bank edges. * hexagon: split unaligned-M HMX matmul into HMX+HVX phases - keep HMX for the 32-aligned head rows and process tail rows with HVX - force re-quantization for HVX tail after HMX phase to avoid stale VTCM state - preserve fallback behavior when N is unaligned or no aligned M rows exist * hexagon: batch-4 Q4_0 dequantize fast path and remove debug traces Add dequantize_x4x2_q4_0_x4groups_hvx() that processes 4 contiguous K-tiles with a single vmemu + vlut16 per row, reducing per-tile overhead. The dequantize loop now takes the batch-4 path when 4 aligned K-tiles are available within the same column tile, falling back to the original single-tile path otherwise. Also removes HMX_DEBUG_TRACE_VALUES instrumentation blocks that are no longer needed. * hexagon: abort on DSP error and fix HMX-to-HVX fallback quantize flag Promote DSP response error from log to GGML_ABORT for fail-fast behavior. Clear SKIP_QUANTIZE flag when falling back from HMX to HVX matmul so the HVX path correctly re-quantizes activations. * hexagon: support batch matmul. This fix perplexity issue The problem comes from Grouped-Query Attention(GQA). Strides between batches are not well respected TODO: optimize batch matmul to reuse weights between batches. * hexagon: reuse weights in fp16 batch matmul * hexagon: remove unused HMX flash attention operations and precomputation table, remove the log system for test * hexagon: remove unused HVX math helpers, debug infrastructure, and stale build options * hexagon: fix HMX not enabled due to missing force_hvx parameter in IDL * hexagon: remove the unnecessary changes not related to HMX * hexagon: bypass HMX by default * hexagon: add upstream repo link to htp-ops-lib ported file headers * hexagon: restore host buffer support * hexagon: add HMX=1 option for the adb scripts * hex-hmx: improve DMA pipelining * hex-hmx: further improvements to dma pipelining * hex-hmx: minor cleanup * hex-hmx: move hmx lock out of inner loops/calls * hex-hmx: remove unnecessary state and wrappers * hex-hmx: remove hmx dir and unify f32 to f16 conversions * hex-hmx: further unify hvx conversions * hex-hmx: revert f16 converter to the original for now * hex-hmx: minor cleanup for f16 to f32 converter * hex-mm: replace incorrect fp16-to-fp32 hmx converter and reformated related code * hex-dma: move chanied dma push into hex-dma.h header and update hmx-mm * hex-mm: use hex_is_aligned instead of a duplicated hmx_is_aligned * hex-mm: use hvx_vec_splat_f16 in the hmx code * hex-mm: use VLEN and HTP types in hmx-code * hex-mm: remove duplicate QK and defs * hexagon: pre-shuffle quants before vlut16 * hexagon: enable HMX by default * hex-mm: code indent fixes for hmx-matmul * hexagon: update hex-utils to include align/smin/etc helpers and use that in hmx mm * hex-mm: more formatting fixes * hex-mm: minor naming updates in hmx code * hex-mm: remove leftover from rebase conflict * Fix the incorrect indents --------- Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>	2026-03-19 09:11:06 -07:00
uvos	b49d8b8757	ci : add hip quality check (#20430 ) * CI: add hip quality check * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update .github/workflows/hip-quality-check.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update scripts/hip/gcn-cdna-vgpr-check.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Revert "Update .github/workflows/hip-quality-check.yml" This reverts commit `efa0bfcdb0`. * scripts: gcn-cdna-vgpr-check.py: enforce int type for total_vgprs * scripts: gcn-cdna-vgpr-check.py: add flash attention instances to ignore list * Bump ccache version * Add mssing seperators to list --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 17:05:44 +01:00
Piotr Wilkin (ilintar)	5e54d51b19	common/parser: add proper reasoning tag prefill reading (#20424 ) * Implement proper prefill extraction * Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp * Update tools/server/server-task.cpp * refactor: move grammars to variant, remove grammar_external, handle exception internally * Make code less C++y Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2026-03-19 16:58:21 +01:00
Reese Levine	c1258830b2	ggml webgpu: ops support for qwen3.5 (SET, TRI_SOLVE, SSM_CONV, GATED_DELTA_NET) + GET_ROWS optimization (#20687 ) * Implement l2_norm, set, tri * Add DIAG/SOLVE_TRI * Add SSM_CONV * Better get_rows and gated_delta_net to support qwen3.5 * Clean up, update ops.md * Fix binding_index type for wasm * Fix read write annotations * cleanups	2026-03-19 08:45:28 -07:00
ddh0	922b90e567	common : add LLAMA_ARG_SPEC_TYPE (#20744 )	2026-03-19 16:16:55 +01:00
Georgi Gerganov	f071ce67c9	ci : add action for finding duplicate issues (#20756 ) * ci : add action for finding duplicates issues * cont : gen info * cont : formatting * cont : fix * cont : instructions * cont : bump checkout action Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 16:17:37 +02:00
Pascal	4065c1a3a6	Server becomes the source of truth for sampling parameter defaults (#20558 ) * webui: make server the source of truth for sampling defaults * webui: fix Custom badge for sampling parameters * webui: log user overrides after server sync * chore: update webui build output * fix: Default values for sampling settings config object * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-19 13:20:39 +01:00
Xuan-Son Nguyen	1e64534570	mtmd: add clip_graph::build_mm() (#20751 ) * clip: add build_mm() * apply to all models * add TODO for bias overload	2026-03-19 13:11:39 +01:00
Pascal	cd708db0cc	WebUI: Persist the on/off state of the MCP servers for new conversations (#20750 ) * webui: add persistent storage for MCP server on/off state in new chats * webui: simplify MCP enabled checks, remove dead server.enabled fallback * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>	2026-03-19 12:54:06 +01:00
Aleksander Grygier	512bba6ee0	webui: Improve model parsing logic + add unit tests (#20749 ) * add tests for model id parser * add test case having activated params * add structured tests for model id parser * add ToDo * feat: Improve model parsing logic + tests * chore: update webui build output --------- Co-authored-by: bluemoehre <bluemoehre@gmx.de>	2026-03-19 12:25:50 +01:00
Dowon	b486c17b3e	convert : support is_causal hyperparameter (#20746 ) * convert : support is_causal hyperparameter Check for the `is_causal` attribute in the Hugging Face model configuration and include it in the GGUF metadata. * Update convert_hf_to_gguf.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * style: fix F541 f-string is missing placeholders --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-19 11:41:11 +01:00
Aldehir Rojas	1b9bbaa357	common : fix gpt-oss content removal (#20745 )	2026-03-19 11:40:39 +01:00
Eve	07feeaa92e	vulkan: dequantize iq4_xs 4 at a time (#20657 )	2026-03-19 11:32:04 +01:00
Charles Xu	3fee84e156	cmake : fix build warning when kleidiai is enabled (#20457 ) * cmake : fix build warning when kleidiai is enabled * remove LLAMA_ARG_THREADS from KleidiAI backend	2026-03-19 10:14:48 +02:00
Sigbjørn Skjæret	811397745e	vocab : assert array size of scores and toktypes (#20737 )	2026-03-19 08:34:04 +01:00
Kevin Hannon	c014c3f83a	docs: add information about openvino in the docker page (#20743 )	2026-03-19 15:08:47 +08:00
Chenguang Li	7f2cbd9a4d	CANN: handle in-place ROPE on non-contiguous f32 tensors (#20274 ) RotaryPositionEmbedding on CANN fails when src and dst share the same non-contiguous buffer (inplace + view), because the operator overwrites source data before it is fully read. Add a branch that detects this case and uses contiguous temporary buffers: copy src to temp, run ROPE into another temp, then copy back to the non-contiguous dst. Fixes 20 failing ROPE tests (f32, v=1, inplace=1). Signed-off-by: noemotiovon <757486878@qq.com>	2026-03-19 14:05:01 +08:00
Masashi Yoshimura	509a31d00f	ggml-webgpu: Update the `RMS_NORM` preprocessor and add `L2_NORM` (#20665 ) * Update the preprocessor of RMS_NORM and add L2_NORM. * Fix the name of rms_norm to row_norm.	2026-03-18 21:08:59 -07:00
Masashi Yoshimura	ea01d196d7	ggml-webgpu: Add supports for `DIAG` and `TRI` (#20664 ) * Add supports for DIAG and TRI. * Remove extra ttype and add a comment for TRI op.	2026-03-18 21:08:35 -07:00
Chenguang Li	07ba6d275b	CANN: support flash attention for head dim not multiple of 16, fix ALiBi slope offset (#20031 ) - Allow FLASH_ATTN_EXT when head dimension D is not a multiple of 16 by padding Q/K/V to D_padded = GGML_PAD(D, 16), running FusedInferAttentionScoreV2, then slicing the output back to D (ggml-cann.cpp + aclnn_ops.cpp). - Fix aclnn_get_slope second-part offset: use ggml_type_size(dtype) instead of sizeof(float) so ALiBi slopes are correct when dtype is F16 (e.g. GQA with 48 heads); fixes buffer overflow and large numerical errors in those cases.	2026-03-19 11:02:42 +08:00
Michael Grau	6729d4920c	model : add control vector support where missing (#20653 ) * Add control vector functions to qwen3.5 and qwen-next models * Add missing cvec compatibility to the rest of the models * Adjust comments and formatting * cleanup * whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2026-03-18 23:25:12 +01:00
Sigbjørn Skjæret	d13d60af1d	gguf-py : cleaner way to get the first key (#20727 )	2026-03-18 23:21:42 +01:00
crsawyer	5744d7ec43	Rebuild index.html.gz (#20724 )	2026-03-18 18:49:57 +01:00
Reese Levine	8ced5f41f9	Move to no timeout for WaitAny in graph submission to avoid deadlocks in some cases on llvm-pipe backends (#20618 )	2026-03-18 10:23:47 -07:00
Shaw Nguyen	78d550b541	ggml-cpu/x86: fix unused changemask warning in repack (#20692 )	2026-03-18 18:45:06 +02:00
Georgi Gerganov	4efd326e71	sync : ggml	2026-03-18 15:17:28 +02:00
Georgi Gerganov	b08f7322ee	ggml : bump version to 0.9.8 (ggml/1442)	2026-03-18 15:17:28 +02:00
Georgi Gerganov	79187f2fb8	ggml : restore ggml_type_sizef() to aboid major version bump (ggml/1441)	2026-03-18 15:17:28 +02:00
Julien Chaumond	48e61238e1	webui: improve tooltip wording for attachment requirements (#20688 ) * webui: improve tooltip wording for attachment requirements Co-Authored-By: Claude <Agents+claude@huggingface.co> * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Claude <Agents+claude@huggingface.co>	2026-03-18 14:01:02 +01:00
Pop Flamingo	312cf03328	llama : re-enable manual LoRA adapter free (#19983 ) * Re-enable manual LoRA adapter free * Remove stale "all adapters must be loaded before context creation" stale comments	2026-03-18 12:03:26 +02:00
Masato Nakasaka	f4049ad735	tests : fix test-jinja-py Windows failures by bypassing command-line args [no ci] (#20483 ) * Fix errors occurring on Windows * Reverted fix #20365 will take care of CRLF isue * Changed to write to directly to stdin * Prevent fclose to happen twice	2026-03-18 10:43:31 +01:00

1 2 3 4 5 ...

8555 Commits All Branches Search

8555 Commits

All Branches