* feat: add mixed precision dot product implementation and function declaration
* feat: implement mixed precision vector dot product and conversion functions
* fix: update data type handling in matrix multiplication implementation
* fix: adjust row count handling in matrix multiplication implementation for accurate slicing
* fix: optimize matrix multiplication implementation by unroll loop
* update performance tracking for matrix multiplication implementation
* add fetching
* wip
* fix: support F16 * F32 multiplication in is_mul_mat_supported function
* fix: improve src0 fetching logic in vec_dot_product_mixed_impl for better alignment handling
* fix test failure for row width 67
* try fix failed test
* fix: rename aligned_address to align_down for clarity in vector alignment handling
* wip
* qnn fix: update device capabilities for quantized types in qnn-lib to improve compatibility
* fix test failure at width == 193
* fix: replace zero vector initialization with previous vector in mixed dot product implementation
* wip
* fix: improve handling of last vector in mixed dot product implementation
* wip
* wip
* wip
* wip
* Enhance mul_mat_f32 function to support quantized types and improve static assertions
* rename
* Refactor dequantization functions to use npu_device_fp16_t and improve type handling
* Optimize dequantization in dequantize_row_q8_0 by replacing qf32 multiplication with qf16
* Optimize dequantization in dequantize_row_q4_0 by replacing qf32 multiplication with qf16
* Add hvx_vsf_convert_vhf function for improved vector conversion
* add perf logs
* Refactor dequantize_row_q4_0 for alignment
* Update logging in supports_op_impl and supports_op to use ggml_op_desc for better clarity
* Add support for ROPE operation in NPU capabilities and related functions
* Implement ROPE operation in tensor and op_rope, including cache initialization and correction dimension calculations
* enable ROPE by adding operation validation
* add support to freq is null case
* wip
* Refactor rope_f32 to improve indexing by introducing total_planes calculation
* reformat
* Refactor rope_f32 to optimize data access patterns by introducing row and plane pointers
* Add performance tracking to rope_f32 function for enhanced profiling
* Refactor rope_f32 to use a templated implementation
* Refactor rope_impl to replace loop with memcpy for improved performance
* Refactor mul_mat_impl to support quantization as a template parameter
* wip
* wip
* Refactor rope_impl to optimize plane indexing in the processing loop
* Add aligned vector dot product implementation for mixed precision types
* wip
* Enhance matrix multiplication for F32 and F16 types with alignment checks
* Optimize vec_dot_product_mix_aligned_impl for improved performance with additional vector sums
* Add alignment checks for matrix multiplication and vector dot products
* Refactor matrix multiplication to use function pointers for improved readability and maintainability
* Fix alignment check in is_dot_product_aligned to ensure correct vector size handling
* Remove unused f16_to_f32_table parameter from quantization and dequantization functions
* wip
* Add L2 fetch for src1 plane rows in matrix multiplication implementation
* wip
* Refactor hvx_vsf_convert_vhf to accept an additional parameter for flexibility in vector multiplication
* Refactor vec_dot_product_mix_aligned_impl to improve variable naming for clarity
* Refactor load_dual_block_generic and dequantize_row_q4_0 to improve performance
* Refactor vector operation functions to improve clarity and consistency in variable usage
* wip
* wip
* Refactor dequantize_row_q4_0_impl for improved clarity and performance in vector operations
* wip
* Update load_dual_block_generic to use intrinsics
* Refactor load_dual_block_generic and load_qual_block_generic for improved performance and clarity
* wip
* wip
* Optimize dequantize_row_q8_0 for improved performance by unrolling for loop
* wip
* wip
* fix typo
* initial commit for handling extra template kwargs
* enable_thinking and assistant prefill cannot be enabled at the same time
* can set chat_template_kwargs in command line
* added doc
* fixed formatting
* add support for extra context in generic template init
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* coding standard: common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Apply suggestions from code review
coding standard: cosmetic changes
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix merge conflict
* chat.cpp: simplify calls to apply to ensure systematic propagation of extra_context (+ the odd existing additional_context)
* normalize environment variable name
* simplify code
* prefill cannot be used with thinking models
* compatibility with the new reasoning-budget parameter
* fix prefill for non thinking models
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
* SYCL: disable faulty fp16 CPU exponent for now
* Revert "SYCL: disable faulty fp16 CPU exponent for now"
This reverts commit ed0aab1ec3.
* SYCL: disable faulty fp16 CPU exponent for now
* Fix logic of disabling exponent kernel
* implement unary REGLU/GEGLU/SWIGLU cpu ops
* relax constraints
* duplicate shape of source
* fix ggml_vec_geglu_f16
* special case gated ops
* implement unary REGLU/GEGLU/SWIGLU cuda ops
* tighten constraints again
* refactor into GGML_GLU_OP
* metal : add glu kernels
ggml-ci
* add CUDA_GLU_BLOCK_SIZE [no ci]
* more constraints and use 64bit ints
ggml-ci
* 64bit multiplication [no ci]
* implement swapped variants (cpu/cuda)
* update comment [no ci]
ggml-ci
* Vulkan: Add GLU ops and shaders
* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate
* ggml : implement GLU for split up/gate (#14181)
* implement GLU for split up/gate
* add tests for ggml_glu_split
* Vulkan: Implement glu_split logic and shader support
* add split to logging [no ci]
* SYCL: refactor element_size ops and add split up and gate support to gated kernels
* SYCL: switch GEGLU to use tanh approximation
---------
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
* GGML: increase OP count in assertion
* Refactor: Optimize SYCL element-wise operations with unary function inlining
This commit refactors the SYCL element-wise operations to improve performance by:
- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.
The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.
* vulkan: Increase workgroup size for GLU, for performance (#14345)
* vulkan: Increase workgroup size for GLU, for performance
* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup
* merge fix
* metal : add support for split and swap
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* vulkan: Add fusion support for RMS_NORM+MUL
- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.
* extract some common fusion logic
* fix -Winconsistent-missing-override
* move ggml_can_fuse to a common function
* build fix
* C and C++ versions of can_fuse
* move use count to the graph to avoid data races and double increments when used in multiple threads
* use hash table lookup to find node index
* change use_counts to be indexed by hash table slot
* minimize hash lookups
style fixes
* last node doesn't need single use.
fix type.
handle mul operands being swapped.
* remove redundant parameter
---------
Co-authored-by: slaren <slarengh@gmail.com>
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched
* Review: add type traits and make function more generic
* Review: make check more explicit, add back comments, and fix formatting
* Review: fix formatting, remove useless type conversion, fix naming for bools
* ggml : add ggml_set_rows
Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.
ref: #8366
* use I64 for indices
* ggml : add repeat impl for i64
* ggml : add ggml_is_contiguous_rows
* ggml : ggml_set_rows support broadcast
* ggml : ggml_set_rows support quantized dst
ggml-ci
* ggml : support GGML_TYPE_F32 ".from_float" trait
* ggml : ggml_set_rows update comment + better index name
* tests : add ggml_set_rows
* metal : add ggml_set_rows implementation
ggml-ci
* ggml : simplify forward_dup_f32
* ggml : fix supports_op
* tests : add comment to set_rows
* ggml : leave the repeat_i64 for a separate PR
ggml-ci
* ggml : set_rows use std::min instead of MIN
* ggml : better error message for set_rows unsupported type
* metal : perform op->type check only once
* tests : more consistent implementation + more tests
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Mistral Small 2506 models using Pixtral vision encoder were running out
of GPU memory when processing images larger than 1024x1024 pixels due to
exponential memory growth from unlimited image size.
This fix applies the same 1024x1024 limit used by Qwen2VL models to
prevent OOM issues while maintaining compatibility with existing models.