* common : add standard Hugging Face cache support
- Use HF API to find all files
- Migrate all manifests to hugging face cache at startup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check with the quant tag
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Cleanup
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Improve error handling and report API errors
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Restore common_cached_model_info and align mmproj filtering
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Prefer main when getting cached ref
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use cached files when HF API fails
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Use final_path..
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* Check all inputs
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* hex-dma: make chained dma the default to handle newer models
This also includes some new instrumentation that we can remove later.
* hexagon: add uint32 dump helper
* hexagon: use single-page VTCM allocation to avoid issues with large gather ops in ssm-conv
ssm-conv uses HVX gather instruction and that instruction cannot handle cases where the base+offset
spans page boundaries.
* hexagon: update ssm-conv to make base-addr compute a bit easier to read
* hex-dma: use 1d mode for reshaping, it supports sizes up to 24-bits (>16MB)
* hex-bin: fix incorrect stride logic
* hexagon: make sure repack buffs are dumped for verbose > 2
* hex-bin: consistently use dma_queue_push even for dummy dst transactions
* hex-dma: start using 2d-wide mode on v75 and up
The removes the need to deal with the 16-bit limitaion for the strides.
* hex-bin: cleanup kernel selection logic
* hex-bin: cleanup binary op core and fix transposed tensor handling
* snapdragon: update run-bench to use larger ubatch and fa-on
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* metal:add conv_3d backend
Rebased with master and resolved conflicts.
* Resolved issues related to changes in variable names
* kernel void kernel_upscale_bilinear_f32 was missing in my branch, added back, should pass all tests now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
ACL graph capture disallows host-to-device memcpy and device memory
malloc/free on the captured stream. Pre-load the RoPE cache before
capture so that:
- Host-to-device copies and allocations run on the non-captured stream
- Cache metadata is populated and memory pool is warmed up
- During capture, only on-device computations are recorded; host-side
and allocation branches are skipped
* fix(openvino): explicit memset in buffer_context allocation
* minor
---------
Co-authored-by: Dan Hoffman <dhoffman@cyket.net>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* added support for internvl's dynamic high-resolution (Qianfan-OCR needed)
* add min/max dynamic patch to gguf meta
* clean up
* simplified handling min/max dynamic patch
* reuse llava_uhd logic for slice images
* provide default values for older models
* flake8
* prevent writing 0 value to gguf
* remove duplicated resolution candidates with a better algorithm
* fix indentation
* format
* add protection from divide by zero
* change to 0 to be safe
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* ggml-cuda: native bf16 flash attention for vec and tile kernels
mma kernel still converts bf16 to fp16 before launch, native mma bf16 todo
* ggml-cuda: address code owner review feedback
reverted tile kernel changes to avoid larger refactor
* fix ci failures on turing and hip
* fix bf16 vec kernel compile on hip v_dot2 platforms
* add comments
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Increase per-thread work if the K-dimension is small
With tensor parallelism, the K-dimension of the FFN-down matrices is split, which makes it quite small, especially for MOEs. For example, Qwen3-30b-A3B has a K-dimension of 768, and Qwen3235B-A22B has k-dimension of 1536.
The current heuristic uses a group of 4 warps irrespective of K-dimension size, resulting in some of the threads being idle. This results in poor performance for these matrices.
This change increases the number of output elements per block for such cases.
* Limit this change to ncols_dst = 1
* tab to space