Previously, cmake was forcing `_WIN32_WINNT=0x0A00` for MinGW builds,
This caused "macro redefined" warnings with toolchains that define the version.
This also removes the `GGML_WIN_VER` variable as it is no longer needed.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Faster tensors (#8)
Add fast matrix and matrix/vector multiplication.
* Use map for shader replacements instead of pair of strings
* Wasm (#9)
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Remove extra whitespace
* Move wasm single-thread logic out of test-backend-ops for cpu backend
* Disable multiple threads for emscripten single-thread builds in ggml_graph_plan
* Fix .gitignore
* Add memory64 option and remove unneeded macros for setting threads to 1
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
- Compute row size for the temp buffer based on the output of the first pass.
- Update shader addressing math to use the output row size
- Pass the output row size as "ncols_output", what used to be "ncols_output" is now "k"
For the common case of K=40 and src0=(200000,1,1,1), this reduces the temporary buffer
from about 3.2MB to 500KB.
This commit removes a redundant check for sched->n_copies > 1 when
setting input and output flags on tensor copies in
ggml_backend_sched_split_graph.
The motivation for this change is to clarify the code as the outer if
statement already performs this check.
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Qwen3 Next - cleaned up version
* Whitespaces and stuff
* Correct minor errors
* Update src/llama-model.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Misc. fixes.
* Clean up code, add missing hybrid qualifier
* Did someone transpose the SOLVE_TRI result matrix? Perhaps...
* Whitespace
* Proper tensors for cb calls
* Use llama-graph.h vertical alignment
* BROKEN: chunking
* Set new tensors as inputs.
* Proper chunk logic
* It's the circle of life...
* More shenanigans for n_seq > 1
* Nail in the coffin?
* Fix Windows build
* Eh, one fails on Windows, the other fails on Mac... just use general capture.
* quant : cleanup
* model : cleanup
* qwen3 : cleanup
* cont : cleanup
* cont : cleanup
* ggml : revert change
* qwen3 : cleanup
* cont : cleanup
* Readd cmath
* qwen3 : fix typo
* Update convert_hf_to_gguf.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Usual suspects
* fix my bad suggestion
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* enable mmf for rdna4
* move some mmvf to mmf
* revert lds128 for wmma loading
* Revert "revert lds128 for wmma loading"
This reverts commit db9ae8b6b4.
* Revert "enable mmf for rdna4"
This reverts commit 698c9f2418.
* Revert "move some mmvf to mmf"
This reverts commit 99b92bd665.
* enable mul_mat for rdna4
---------
Co-authored-by: zhang hui <you@example.com>
* Enabled q4_K_4x8 path
* Fixed generic Q4_K 8x4 implementation
* wip: dotprod gemm
* Working arm q4_K dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Undo acc rename
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Q4_K arm dotprod gemm
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Fix: q4_qs reinterpret from uint to int
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
* Removed comments
* Fixed macro guards
* Fixed unused vars in generic implementation
* Fixed unused vars in 8x4 repack
* Fixed unused vars in generic implementation, unneeded comment
* Missing arch fallback for x86
* minor : style
---------
Signed-off-by: Alberto Cabrera <alberto.cabrera@liquid.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* vulkan: Implement top-k
Each pass launches workgroups that each sort 2^N elements (where N is usually 7-10)
and discards all but the top K. Repeat until only K are left. And there's a fast
path when K==1 to just find the max value rather than sorting.
* fix pipeline selection
* vulkan: Add N-ary search algorithm for topk
* microoptimizations
On arm64 with `cmake` version 3.31.6, the final feature verification fails:
-- ARM detected flags: -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod
-- Performing Test GGML_MACHINE_SUPPORTS_dotprod - Success
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm
-- Performing Test GGML_MACHINE_SUPPORTS_i8mm - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sve
-- Performing Test GGML_MACHINE_SUPPORTS_sve - Success
-- Performing Test GGML_MACHINE_SUPPORTS_sme
-- Performing Test GGML_MACHINE_SUPPORTS_sme - Failed
-- Performing Test GGML_MACHINE_SUPPORTS_nosme
-- Performing Test GGML_MACHINE_SUPPORTS_nosme - Success
-- Checking for ARM features using flags:
-- -U__ARM_FEATURE_SME
-- -mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
-- Performing Test HAVE_DOTPROD
-- Performing Test HAVE_DOTPROD - Failed
-- Performing Test HAVE_SVE
-- Performing Test HAVE_SVE - Failed
-- Performing Test HAVE_MATMUL_INT8
-- Performing Test HAVE_MATMUL_INT8 - Failed
-- Performing Test HAVE_FMA
-- Performing Test HAVE_FMA - Success
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC
-- Performing Test HAVE_FP16_VECTOR_ARITHMETIC - Failed
-- Performing Test HAVE_SME
-- Performing Test HAVE_SME - Failed
-- Adding CPU backend variant ggml-cpu: -U__ARM_FEATURE_SME;-mcpu=neoverse-v2+crc+sve2-aes+sve2-sha3+nossbs+dotprod+i8mm+sve+nosme
We need to explicitly replace `;` with spaces from the list to make
`CMAKE_REQUIRED_FLAGS` work correctly...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* patch failed test case MUL_MAT(type_a=q4_0,type_b=f32,m=576,n=512,k=576,bs=[1,1],nr=[1,1],per=[0,1,2,3],k_v=0,o=1) for enabling WMMA on RDNA4
* Quick clean up on mma.cuh to add ggml_cuda_memcpy_1 back in for half2 and bfloat162
* CANN: ROPE supports both MROPE and IMROPE.
1. Optimize the caching logic of rope_cache_init.
2. Add support for mRoPE and i-mRoPE.
Note that on Ascend 910B devices, it is necessary to disable FA
in CLIP and disable NZ-format conversion. These two issues are
still under investigation.
* Resolve review comments