The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
Add verbose output to server_task_result_cmpl_final::to_json_oaicompat_chat_stream, making it conform with server_task_result_cmpl_final::to_json_oaicompat_chat, as well as the other to_json methods.
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders
* vulkan: Optimize mul_mat_vec p021 and nc shaders.
These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).
Using subgroupAdd in the p021 shader also helps, use that conditionally.
* move op key generate function to kOpCaps
* fix op desc print
* try fix rms_norm
* Revert "try fix rms_norm"
This reverts commit 33b296098012909cb482fc29b52b28098dc971cd.
* add quantization type support by converting them to float
* enable quantization tensor for mulmat in gpu/npu
* fix asan error
* add log and assert
* insert output convert operator after mulmat
* add log
* fix some error in running
* disable permute again
* add log
* add error function
* Revert "add error function"
This reverts commit f92ff47798ac8053fb776c55efbb1a98469c7af1.
* add log
* more log
* disable convert op in graph
* wip
* add f16 config for graph
* set f16 precision for f16 graph
* fix override data type
* add comment
* add config flag to enable quantize type
* add log
* more quantized type for cpu and gpu backend
* enable all quant types for cpu and gpu backend
* rename
* wip
* add log
* remove unused functions
* skip permute
* remove get_qnn_op_input_param_count
* fallback to generic_get_op_desc if no op_desc
* revert 'skip permute'
* Revert "revert 'skip permute'"
This reverts commit 5761e31fd23c69c4cabf6fd9fac1a0d3e5a74968.
* wip
* add log
* print qnn tensor type
* add log
* limit the max size of tensor
* add log
* fix tensor size limiter
* small improve on tensor info printer
* disable sqrt and div to pass test-backend-ops for 8 gen 2
* remove debug log in release build
* add log
* skip permute in src
* wip
* disable reshape
* skip mul at decoder start
* wip
* add log
* add qnn_scoped_timer
* add perf tracker in graph
* add cmake options GGML_QNN_ENABLE_PERFORMANCE_TRACKING
* fix flag name
* use milli-second
* wip
* fix comment string
* add file for profiler
* change qnn-cpu to GGML_BACKEND_DEVICE_TYPE_ACCEL, so that we can run tests on cpu
* wip
* profiler: refactoring
* wip
* add implement for print_profile_events
* set-up profiler for graph
* set profiler to graph execute
* pretty print events
* unified log print prefix
* print event count
* enable optrace
* print duration at event end
* wip
* add more detailed soc information
* wip
* move device caps array into qnn-lib.cpp
* remove lib_name in device_context
* move get_graph_key_from_cgraph to graph.cpp
* add override type for tensor key
* use override_type instead of original data type for graph key
* append op type to tensor name to fix error in qwen
* remove todo
* wip
* [SYCL] Fix build on Windows when ccache enabled (#9954)
* take effect only on windows and force it to icl
---------
Co-authored-by: Romain Biessy <romain.biessy@codeplay.com>
* webui: Make textarea uncontrolled to eliminate devastating lag
* Update index.html.gz
* use signal-style implementation
* rm console log
* no duplicated savedInitValue set
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Add block interleaving support for Q4_K quantization
* Remove whitespaces and fix CI/CD issues
* Update pointer of bsums from int16_t to const int16_t
* Add vector version of quantize_q8_K_4x8 function
* Update code formatting based on review comments
tokenizer.added_tokens_decoder returns a fresh dict every time relatively slowly (~0.04s on average) which results in massive slowdowns when we have a huge number of added tokens
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1
Fixes Issue: #12182
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* ci: add visionOS build workflow
Add a new GitHub Actions workflow for building on visionOS with CMake and Xcode.
* ggml: Define _DARWIN_C_SOURCE for visionOS to fix missing u_xxx typedefs
* ci: remove define hacks for u_xxx system types
---------
Co-authored-by: Giovanni Petrantoni <7008900+sinkingsugar@users.noreply.github.com>
* Add support for GPT2, Bloom and CodeShell tied word embeddings
* Deduplicate tied word embeddings weights
* Workaround for incorrect weight map
It appears transformer.wte.weight is in the weight map even though the weights are not there, remove it if output weights are encountered first.
* check++
* fatfingers--
I've been seeing significantly worse performance for tg with flash attention
enabled vs disabled, and it seems to be related to the submit heuristic.
Change the heuristic to check how many bytes worth of weight matrix are
used and flush every 100MB, and ramp up after the first few submits.
This seems to resolve the issue, and also increases perf for non-FA a bit.
* opencl: more profiling timing
* opencl: generate trace for profiling
* opencl: reduce profiling overhead
* Populate profiling timing info at the end rather than after each
kernel run
* opencl: fix for chrome tracing
* graph : normalize Q, K, V shapes and add comments
ggml-ci
* context : synchronize before getting cross attention data
* model : fix command-r attention norm check
* Enable CUDA Graph on CTK < 12.x
`cudaGraphExecUpdate` API was changed on 12.x. For this reason CUDA graph support was disabled on older CUDA toolkit. This change enables CUDA support in CTK version < 12.x by using older API if CTK < 12.x.
* Fix compilation errors with MUSA
* Disable CUDA Graph for MUSA