* more log
* split graph implementation into cpp file
* rename: ggml_qnn_graph -> qnn_graph
* add imput/output tensor to graph
* fix assert
* wip
* add _ggml_tensor field in qnn tensor
* add comments
* add set_data_buffer with raw memory buffer
* use set_data_buffer
* op param buffer use qnn_buffer_ptr
* add qnn_mem_buffer_slice
* use qnn_buffer_ptr as tensor buffer
* use new set_data_buffer to reduce copy
* ggml_qnn_op_config: add function to set input/output tensor before init node
* remove ggml_qnn_connectable_op_config and use ggml_qnn_single_op_config instead
* wip
* add initialize_op_nodes without tensor params
* wip
* add op caps table
* merge kGgmlOpToQnnOp and kOpCaps tables
* wip
* add cache parameter to create_tensors
* add init_from_ggml_graph
* disable gelu for all backend
* wip
* move op index calc to op config module
* use the ggml_tensor as parameter of build_graph
* add log
* use create_operation_from_op_tensor in old build_graph function
* remove unused constructors
* fix parameter count
* remove unused member func/var
* make init_from_ggml_graph as a class member: build_graph_from_ggml_graph
* move graph finalize into member function `finalize()`
* get graph key from ggml op tensor directly
* append output type
* reduce tensor key length
* add function to generate key from ggml_cgraph
* simplify graph cache insert and delete
* remove template param at get_qnn_graph_from_cache
* wip
* merge kQnnUnaryOpsTable and kQnnBinaryOpsTable
* refactor device_supports_op
* add log
* wip
* use framework function to check same shape
* wip
* extract some logic into separated function
* wip
* add execution function that runs graph
* add function to create qnn graph from ggml_cgraph with cache
* execute graph directly
* return null graph key for empty graph
* add more qualcomm chipset enums
* add cap for reshape
* disable some ops
* try to skip GGML_OP_VIEW
* moew log for view tensor
* append param tensor into intermedia tensor key
* use 'ordered' set
* fix warning in release
* wip
* SYCL: refactor ggml_sycl_compute_forward
* SYCL: add back GGML_USED(dst) to ggml_sycl_cpy
* SYCL: add function name to noop debug
* SYCL: Some device info print refactoring and add details of XMX availability
Since NVIDIA does not release CUDA for in-maintenance versions of Fedora, the process of setting up the CUDA toolkit on Fedora has become quite involved. This guide should help mere mortals install CUDA for development in a Fedora 39 toolbox environment, without affecting the host system.
* server : add tooltips to settings and themes btn
This commit adds tooltips to the settings and themes buttons in the
webui. The tooltip will be displayed below the actual buttons when
hovered over.
The motivation for this change is to clarify the purpose of the themes
button.
* squash! server : add tooltips to settings and themes btn
This commit adds a tooltip to the '...' button when a chat has been
started. The tooltip is "Chat options" which think could be a good
description as the dropdown contains options to delete or download the
current chat.
* rm tooltip for 3 dots button
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* Moved scripts dir and fixed pyproject.toml
* updated readme
* fixed README urls
* bump pypi gguf to v0.14.0
* retrigger ci
* empty commit - trigger ci
The main motivation for this change is it was not handing
ctrl-c/ctrl-d correctly. Modify `read_user_input` to handle EOF,
"/bye" command, and empty input cases. Introduce `get_user_input`
function to manage user input loop and handle different return
cases.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
* (wip) support mergekit-extracted lora
* support mergekit-extract-lora
* use lora->get_scale
* correct comment
* correct norm name & condition
* add some hints
This change upstreams llamafile's cpu matrix
multiplication kernels for ppc64le using MMA
builtins for quantised int8 datatype.
This change results in 10% - 70% improvement
in total speed(ie all tokens/total time), across
various batch sizes.
The patch is tested with Meta-Lllama-3-8B,
Mistral-7B, Llama-2-7B-chat-hf models on a
IBM POWER10 machine.
Signed-off-by: Amrita H S <amritahs@linux.vnet.ibm.com>
* Disable GL_KHR_cooperative_matrix Vulkan extension if not available.
* Perform Vulkan extensions checks in a more sensible order
* Remove unnecessary #ifdef directive
* GGUF: C++ refactor, backend support, misc fixes
remove ggml_tensor.backend
update CODEOWNERS [no ci]
remove gguf_get_data from API
revise GGUF API data types
* SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6
* Revert "SYCL: Use get_multi_ptr instead of deprecated get_pointer in wkv6"
This reverts commit f62dc45f31.
* Reland: Use get_multi_ptr instead of deprecated get_pointer in wkv6
This commit renames the `batch` parameter to `ubatch` in the
`llama_kv_cache_find_slot`, `llm_build_inp_embd`, and
`llm_build_mamba` functions.
The motivation for this is that this should have been done as part of
Commit 19d900a756 ("llama : rename batch
to ubatch (#9950)") but for some reason I missed these functions in
that commit and only noticed them now (sorry).
* convert : extend DEEPSEEK2 model architecture to support DeepseekV3ForCausalLM by adding EXPERT_WEIGHTS_NORM and EXPERT_GATING_FUNC model parameters and FFN_EXP_PROBS_B tensor type
* vocab : add DeepSeek V3 pre-tokenizer regexes
* unicode : handle ACCENT_MARK and SYMBOL categories in regex
* llama : add DeepSeek V3 chat template, handle new model parameters and tensor types
---------
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
This commit attempts to improve the log message for the inputs of the
splits in the sched_print_assignments function.
The motivation for this change is that currently even if there are no
inputs a colon is displayed at the end of the line, which can make it a
little confusing when reading the output as it could be interpreted as
the line below are inputs when they are in fact nodes. With this change
the colon will only be printed if there actually are inputs.