This is a configuration of the hparams in the GraniteHybrid architecture
that devolves to the Granite (or GraniteMoe) architecture (ie Granite 3.x).
It may be used for some models in the Granite 4 family with the
GraniteHybrid architecture acting as a superset arch. Rather than support
it directly in the c++ graph, we simply coerce the architecture flag back
to the correct "granite" or "granitemoe" architecture.
Branch: gabe-l-hart/GraniteNonHybridConversion
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Disable 'performance-enum-size' checking:
Enum 'llama_token_type' uses a larger base type ('unsigned int', size: 4 bytes)
than necessary for its value set, consider using 'std::uint8_t' (1 byte) as the
base type to reduce its size.
* implement set_rows with i32 index
* template fix
* test quantized path
warnings--
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* forgotten name change
* deduplicate cuda/sycl and test-fix
* indent++
* vulkan: support set_rows with i32 index type (#16162)
* disable i32 index for webgpu for now
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
* common : use the json parser
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
* common : enable --offline mode without CURL support
This change refactors the download logic to properly support offline mode
even when the project is built without CURL.
Without this commit, using `--offline` would give the following error:
error: built without CURL, cannot download model from the internet
even if all the files are already cached.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
---------
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
These two local variables 'arg' and 'arg_prefix' have been overriden by:
1. for (const auto & arg : opt.args)
2. for (int i = 1; i < argc; i++) {
const std::string arg_prefix = "--";
std::string arg = argv[i];
* Vulkan: add conv_transpose_2d operation
* Vulkan: fix typo in conv_transpose_2d shader(s0mp, s0L, s1mp, s1L)
* Vulkan: fix incorrect indentation in conv_transpose_2d shader
* Vulkan: add checking the push constants size limit and reuse conv2d_mm.comp for conv_transpose_2d operation
* Vulkan: revert the order of the index calculation and bound check in conv_2d shader
* Vulkan: explicity check push constants limit in supports_op() for conv_transpose_2d operation.
* Vulkan: remove unnecessary lower bound checks for H/W_idx in the conv_2d shader.
* contrib : update roles
* contrib : merge PR sections + add link to CI instructions
Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
* ci : migrate ggml ci to a self-hosted runners
* ci : add T4 runner
* ci : add instructions for adding self-hosted runners
* ci : disable test-backend-ops from debug builds due to slowness
* ci : add AMD V710 runner (vulkan)
* cont : add ROCM workflow
* ci : switch to qwen3 0.6b model
* cont : fix the context size
* vulkan: optimize UMA buffer operations and fix driver hangs
The previous implementation was blocking the GPU for extended periods,
causing the i915 driver to reset the context due to the hangcheck
protection.
[32628.443070] i915 0000:00:02.0: [drm] GPU HANG: ecode 12:1:85dffffb, in llama-server [194114]
[32628.443091] i915 0000:00:02.0: [drm] llama-server[194114] context reset due to GPU hang
* vulkan: implement deferred_memset on UMA
---------
Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>