* ggml-cpu:fix RISC-V Q4_0 repack select and RVV feature reporting
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* using the name VLEN instead of CNT
* Update ggml/include/ggml-cpu.h
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* tests: update barrier test to check for race condition in active threads
* cpu: combine n_graph and n_threads into a single atomic update
* tests: add multi-graph test for test_barrier
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* cmake: enable RISC-V zihintpause extension for Spacemit builds
* readme : add ZIHINTPAUSE support for RISC-V
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* Feat: Added vulkan circular tiling support
* Feat: Added cpu circular
* Feat: Added cuda kernels
* Added tests
* Added tests
* Removed non-pad operations
* Removed unneded changes
* removed backend non pad tests
* Update test-backend-ops.cpp
* Fixed comment on pad test
* removed trailing whitespace
* Removed unneded test in test-backend-ops
* Removed removed test from calls
* Update ggml/src/ggml-vulkan/vulkan-shaders/pad.comp
Co-authored-by: Ruben Ortlam <picard12@live.de>
* Fixed alignment
* Formatting
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
* Format pad
* Format
* Clang format
* format
* format
* don't change so much stuff
* clang format and update to bool
* fix duplicates
* don't need to fix the padding
* make circular bool
* duplicate again
* rename vulkan to wrap around
* Don't need indent
* moved to const expr
* removed unneded extra line break
* More readable method calls
* Minor wording changes
* Added final newline
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/include/ggml.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Added circular pad ext tests
* Gate non circular pad devices
* Cleaned gating of non-circular pad devices
---------
Co-authored-by: Phylliida <phylliidadev@gmail.com>
Co-authored-by: Ruben Ortlam <picard12@live.de>
Co-authored-by: Aman Gupta <amangupta052@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Fix .gitignore
* Add memory64 option and remove unneeded macros for setting threads to 1
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
* Qwen3 Next - cleaned up version
* Whitespaces and stuff
* Correct minor errors
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Misc. fixes.
* Clean up code, add missing hybrid qualifier
* Did someone transpose the SOLVE_TRI result matrix? Perhaps...
* Whitespace
* Proper tensors for cb calls
* Use llama-graph.h vertical alignment
* BROKEN: chunking
* Set new tensors as inputs.
* Proper chunk logic
* It's the circle of life...
* More shenanigans for n_seq > 1
* Nail in the coffin?
* Fix Windows build
* Eh, one fails on Windows, the other fails on Mac... just use general capture.
* quant : cleanup
* model : cleanup
* qwen3 : cleanup
* cont : cleanup
* cont : cleanup
* ggml : revert change
* qwen3 : cleanup
* cont : cleanup
* Readd cmath
* qwen3 : fix typo
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Usual suspects
* fix my bad suggestion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Enabled q4_K_4x8 path
* Fixed generic Q4_K 8x4 implementation
* wip: dotprod gemm
* Working arm q4_K dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Undo acc rename
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Q4_K arm dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix: q4_qs reinterpret from uint to int
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Removed comments
* Fixed macro guards
* Fixed unused vars in generic implementation
* Fixed unused vars in 8x4 repack
* Fixed unused vars in generic implementation, unneeded comment
* Missing arch fallback for x86
* minor : style
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
On arm64 with `cmake` version 3.31.6, the final feature verification fails:
-- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SME
-- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* ggml: add RISC-V cpu-feats
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* fix comment[1]
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* ggml-cpu: handle 3d tensors in repack mul_mat
* Removed unnecessary branch, removed need for <algorithm>
* Fixed dst_ptr pointer in chunk + clang_format
* GGML_ASSERT to check wdata within bounds
* Accidental ggml.h inclusion
* Improved GGML_ASSERT on wdata boundaries
* Address performance regression in Qwen and llama.cpp due to chunking
* extract rotate_pairs logic from ggml_compute_forward_rope_f32
* templateify ggml_compute_forward_rope_f32 and _f16
* abort when rope type not supported, remove GLM from test-rope
* add imrope branch to switch
* add rope tests for perf
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add i8mm route with SVE ggml_vec_dot_q4_K_q8_K and ggml_vec_dot_q6_K_q8_K
* Surround SVE function with compiler directive
* fix compile switch
* fix coding style
* ggml : fix indent
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
When using GCC 9 and GCC 12 on the arm64 platform of ubuntu 2004,
the command "gcc -mcpu=native -E -v -" fails to detect the correct CPU flags,
which results in compilation failures for certain extended instructions,
but the correct CPU flags can be obtained by using gcc -march.
Signed-off-by: lizhenneng <lizhenneng@kylinos.cn>
Co-authored-by: lizhenneng <lizhenneng@kylinos.cn>