* ggml-cpu: handle 3d tensors in repack mul_mat
* Removed unnecessary branch, removed need for <algorithm>
* Fixed dst_ptr pointer in chunk + clang_format
* GGML_ASSERT to check wdata within bounds
* Accidental ggml.h inclusion
* Improved GGML_ASSERT on wdata boundaries
* Address performance regression in Qwen and llama.cpp due to chunking
* vulkan: remove shell call from vulkan-shaders-gen tool
* use string vector for command execution
* Fix condition
* use string, remove const_cast
* Fix dependency file quotation on Windows
---------
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* update L2_NORM op support
* update L2_NORM op support
* remove extra whitespace
* cann: update cross_entropy_loss op support
* remove trailing whitespaces
* rebase the latest code in the main repository and remove the l2_norm operator that already exists in another pull request.
* undo the l2_norm operator deletion
* CUDA: add fused rope
* move k forward_expand up
* create helper function instead of re-using params
* make assert statement more in line with comment
* rope_norm: coalesced writes to global mem
* webgpu : fix build on emscripten
* more debugging stuff
* test-backend-ops: force single thread on wasm
* fix single-thread case for init_tensor_uniform
* use jspi
* add pthread
* test: remember to set n_thread for cpu backend
* Add buffer label and enable dawn-specific toggles to turn off some checks
* Intermediate state
* Fast working f16/f32 vec4
* Working float fast mul mat
* Clean up naming of mul_mat to match logical model, start work on q mul_mat
* Setup for subgroup matrix mat mul
* Basic working subgroup matrix
* Working subgroup matrix tiling
* Handle weirder sg matrix sizes (but still % sg matrix size)
* Working start to gemv
* working f16 accumulation with shared memory staging
* Print out available subgroup matrix configurations
* Vectorize dst stores for sg matrix shader
* Gemv working scalar
* Minor set_rows optimization (#4)
* updated optimization, fixed errors
* non vectorized version now dispatches one thread per element
* Simplify
* Change logic for set_rows pipelines
---------
Co-authored-by: Neha Abbas <nehaabbas@macbookpro.lan>
Co-authored-by: Neha Abbas <nehaabbas@ReeseLevines-MacBook-Pro.local>
Co-authored-by: Reese Levine <reeselevine1@gmail.com>
* Comment on dawn toggles
* Working subgroup matrix code for (semi)generic sizes
* Remove some comments
* Cleanup code
* Update dawn version and move to portable subgroup size
* Try to fix new dawn release
* Update subgroup size comment
* Only check for subgroup matrix configs if they are supported
* Add toggles for subgroup matrix/f16 support on nvidia+vulkan
* Make row/col naming consistent
* Refactor shared memory loading
* Move sg matrix stores to correct file
* Working q4_0
* Formatting
* Work with emscripten builds
* Fix test-backend-ops emscripten for f16/quantized types
* Use emscripten memory64 to support get_memory
* Add build flags and try ci
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* hexagon: explicitly check for ops with zero nrows
llm_graph_context::build_inp_out_ids() can generate tensors with zero nrows.
Somehow other backends seems to handle this without obvious explicit checks.
In the hexagon case we need to check explicitly and skip them.
* hexagon: introduce fastdiv, fix test-backend-ops for ADD/SUB/MUL
Co-authored-by: chraac <chraac@gmail.com>
* hexagon: use fastdiv in ADD_ID
* hexagon: use ggml_op_is_empty and ggml_is_empty to check for NOPs
---------
Co-authored-by: chraac <chraac@gmail.com>
* extract rotate_pairs logic from ggml_compute_forward_rope_f32
* templateify ggml_compute_forward_rope_f32 and _f16
* abort when rope type not supported, remove GLM from test-rope
* add imrope branch to switch
* add rope tests for perf
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update ggml/src/ggml-cpu/ops.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
When compiling llama.cpp in Yocto, it fails QA checks because the generated so files aren't versioned. This applies a version to all generated so files, allowing the package to build without errors.
Register UMT5Model as a supported architecture variant for T5 model conversion.
This allows the conversion to work for models downloaded with AutoModel.
* feat(memory): Only fail partial erasure of recurrent tail
The recurrent state is always assumed to be the state as of the last update
from the final token in the sequence. When doing a partial erasure, if the
range does not include the final token, the erasure can be considered a
success since any memory used for the sequence prior to the final token
(which is no memory) has been successfully removed.
There is one potential case that this doesn't address which is the pruning
of cache to remove sensitive data from the context. This wouldn't work for
attention cache partial removal (in the middle) either since the KV state
is linearly-dependent and states in later sequence positions would still be
based on the state from the sensitive data, even if that data is no longer
cached, so I don't think this is relevant, but it is worth noting that the
semantics of this change for a partial erasure in the middle of the cache
are essentially "my context is already compressed" and not "all trace of
the removed tokens has been removed."
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(main): Check the output of seq_rm for prefix matching
This prefix matching is explicitly attempting to remove the tokens at the
end of the sequence that don't match. This is the operation that can't be
performed on a recurrent cache due to the state being updated in place, so
if this removal fails, we need to clear the whole cache.
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix(memory): Fix condition for partial erasure failure if p0 > pos
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
* style: Fix extra parens
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* fix(main.cpp): Set n_matching_session_tokens to 0 on cache clear
https://github.com/ggml-org/llama.cpp/issues/16768
Branch: HybridContextShift-16768
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: compilade <git@compilade.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>