Commit Graph

491 Commits

Author SHA1 Message Date
Zijun Yu 9789c4ecdc
ggml : add OpenVINO backend (#15307)
* Update build doc

* Add cgraph tensor output name to OV op name

* Update openvino build instructions

* Add initial NPU support

* draft NPU support version 2: prefill + kvcache

* NPU support version 2: prefill + kvcache

* Change due to ggml cgraph changes, not correct yet

* Change due to ggml cgraph changes, llama-3.2 CPU work

* Add AMD64 to CMakeLists

* Change due to ggml cgraph changes, all device work

* Refactor: clean, fix warning

* Update clang-format

* Statful transformation for CPU GPU

* Add SwiGLU

* Fuse to SDPA

* Replace Concat with Broadcast in MulMat for GQA

* Pull out indices creation for kv cache update

* Refactor: remove past_token_len from extra_inputs

* Fix Phi3 SwiGLU and SoftMax

* Pull out sin cos from rope

* Reduce memory: free ov weights node after graph conversion

* Fix CPY due to cgraph change

* Added OpenVINO CI/CD. Updated docs

* Fix llama-cli

* Fix Phi3 ROPE; Add test-backend-ops

* Fix NPU

* Fix llama-bench; Clang-format

* Fix llama-perplexity

* temp. changes for mark decomp

* matmul in fp32

* mulmat input conversion fix

* mulmat type conversion update

* add mark decomp pass

* Revert changes in fuse_to_sdpa

* Update build.md

* Fix test-backend-ops

* Skip test-thread-safety; Run ctest only in ci/run.sh

* Use CiD for NPU

* Optimize tensor conversion, improve TTFT

* Support op SET_ROWS

* Fix NPU

* Remove CPY

* Fix test-backend-ops

* Minor updates for raising PR

* Perf: RMS fused to OV internal RMS op

* Fix after rebasing

- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run

* Change openvino device_type to GPU; Enable flash_attn

* Update supports_buft and supports_op for quantized models

* Add quant weight conversion functions from genai gguf reader

* Quant models run with accuracy issue

* Fix accuracy: disable cpu_repack

* Fix CI; Disable test-backend-ops

* Fix Q4_1

* Fix test-backend-ops: Treat quantized tensors as weights

* Add NPU Q4_0 support

* NPU perf: eliminate zp

* Dequantize q4_1 q4_k q6_k for NPU

* Add custom quant type: q8_1_c, q4_0_128

* Set m_is_static=false as default in decoder

* Simpilfy translation of get_rows

* Fix after rebasing

* Improve debug util; Eliminate nop ReshapeReshape

* STYLE: make get_types_to_requant a function

* Support BF16 model

* Fix NPU compile

* WA for npu 1st token acc issue

* Apply EliminateZP only for npu

* Add GeGLU

* Fix Hunyuan

* Support iSWA

* Fix NPU accuracy

* Fix ROPE accuracy when freq_scale != 1

* Minor: not add attention_size_swa for non-swa model

* Minor refactor

* Add Q5_K to support phi-3-q4_k_m

* Requantize Q6_K (gs16) to gs32 on GPU

* Fix after rebasing

* Always apply Eliminate_ZP to fix GPU compile issue on some platforms

* kvcachefusion support

* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added

* Fix for Phi3

* Fix llama-cli (need to run with --no-warmup)

* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working

* fix after rebasing

* Fix llama-3-8b and phi3-mini q4_0 NPU

* Update to OV-2025.3 and CMakeLists.txt

* Add OV CI cache

* Apply CISC review and update CI to OV2025.3

* Update CI to run OV dep install before build

* Update OV dockerfile to use OV2025.3 and update build docs

* Style: use switch in supports_ops

* Style: middle ptr and ref align, omit optional struct keyword

* NPU Unify PD (#14)

* Stateless. Fix llama-cli llama-server

* Simplify broadcast op in attention

* Replace get_output_tensor+memcpy with set_output_tensor

* NPU unify PD. Unify dynamic and static dims

* Clean placeholders in ggml-openvino.cpp

* NPU unify PD (handled internally)

* change graph to 4d, support multi sequences

* Fix llama-bench

* Fix NPU

* Update ggml-decoder.cpp

Hitting error while compiling on windows:

error C3861: 'unsetenv': identifier not found

Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.

Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.

This keeps cross-platform compatibility.

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Update ggml-decoder.cpp

* Remove the second decoder for node. Moving the function into the model decoder

* Fix error for naive

* NPU prefill chunking

* NPU fix llama-bench

* fallback naive run with accuracy issue

* NPU support llma-perplexity -b 512 --no-warmup

* Refactor: split ov_graph_compute for dynamic and static

* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)

* minor update due to ov 2025.4

* remove unused API GgmlOvDecoder::get_output_names()

* remove unused API get_output_shape(const std::string & name)

* Modified API GgmlOvDecoder::get_output_type(const std::string & name)

* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)

* Removed API get_output_ggml_tensor(const std::string & name)

* Removed API m_outputs

* Removed m_output_names

* Removed API GgmlOvDecoder::get_input_names()

* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)

* Removed API get_input_type

* Removed API get_input_type

* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)

* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)

* Fix error for decoder cache

* Reuse cached decoder

* GPU remove Q6_K requantization

* NPU fix wrong model output shape

* NPU fix q4 perf regression

* Remove unused variable nodes

* Fix decoder can_reuse for llama-bench

* Update build.md for Windows

* backend buffer: allocate on host

* Use shared_buffer for GPU NPU; Refactor

* Add ov_backend_host_buffer; Use cached remote context

* Put kvcache on GPU

* Use ggml_aligned_malloc

* only use remote tensor for kvcache

* only use remote tensor for kvcache for GPU

* FIX: use remote tensor from singleton

* Update build.md to include OpenCL

* NPU always requant to q4_0_128

* Optimize symmetric quant weight extraction: use single zp

* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant

* Update build.md

* Support -ctk f32

* Initial stateful graph support

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* code cleanup

* npu perf fix

* requant to f16 for Q6 embed on NPU

* Update ggml/src/ggml-openvino/ggml-decoder.cpp

* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp

* Create OPENVINO.md in llama.cpp backend docs

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update build.md

* Update OPENVINO.md

* Update OPENVINO.md

* Update OPENVINO.md

* kq_mask naming fix

* Syntax correction for workflows build file

* Change ov backend buffer is_host to false

* Fix llama-bench -p -n where p<=256

* Fix --direct-io 0

* Don't put kvcache on GPU in stateful mode

* Remove hardcode names

* Fix stateful shapes

* Simplification for stateful and update output shape processing

* Remove hardcode names

* Avoid re-compilation in llama-bench

* Extract zp directly instead of bias

* Refactor weight tensor processing

* create_weight_node accept non-ov backend buffer

* remove changes in llama-graph.cpp

* stateful masking fix (#38)

Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.

* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add

* hardcoded name handling for rope_freqs.weight

* Suppress logging and add error handling to allow test-backend-ops to complete

* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases

* Use bias instead of zp in test-backend-ops

* Update OV in CI, Add OV CI Tests in GH Actions

* Temp fix for multithreading bug

* Update OV CI, fix review suggestions.

* fix editorconfig-checker, update docs

* Fix tabs to spaces for editorconfig-checker

* fix editorconfig-checker

* Update docs

* updated model link to be GGUF model links

* Remove GGML_CPU_REPACK=OFF

* Skip permuted ADD and MUL

* Removed static variables from utils.cpp

* Removed initializing non-existing variable

* Remove unused structs

* Fix test-backend-ops for OV GPU

* unify api calling

* Update utils.cpp

* When the dim is dynamic, throw an error, need to is stastic forst

* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using

* No need to return

* Fix test-backend-ops for OV GPU LNL

* Fix test-thread-safety

* use the shape from infer request of output tensor create to avoid issue

* fix dynamic output shape  issue

* fix issue for the unused node in tests

* Remove unused lock

* Add comment

* Update openvino docs

* update to OV release version 2026.0

* add ci ov-gpu self hosted runner

* fix editorconfig

* Fix perplexity

* Rewrite the model inputs finding mechanism  (#54)

* Rewrite the model inputs finding logistic

* Put stateful shape handle in get input shape

* Put the iteration logistic in func

* Added ggml-ci-intel-openvino-gpu and doc update

* .hpp files converted to .h

* fix ggml-ci-x64-intel-openvino-gpu

* Fix for stateful execution bug in llama-bench

* Minor updates after stateful llama-bench fix

* Update ggml/src/ggml-openvino/utils.cpp

Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>

* Remove multiple get_shape calls

* Bring back mutex into compute

* Fix VIEW op, which slice the input node

* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access

* Temp. fix for test requant errors

* Update to OV ggml-ci to low-perf

* ci : temporary disable "test-llama-archs"

* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag

* docs : update url

* Fix OV link in docker and Update docs

---------

Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-03-14 07:56:55 +02:00
Christopher Maher a95047979a
readme : update infra list (#20212) 2026-03-08 12:42:28 +02:00
YardenTal44 2b10b62677
hexagon: add fp16 support for binary ops: add,sub,mul,div (#20139)
* hexagon: add fp16 support for binary ops: add,sub,mul,div

* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)

* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad

* snapdragon: fix readme link

---------

Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
2026-03-05 18:29:13 -08:00
Kevin Pouget f5e7734ff2
ggml-virtgpu: add backend documentation (#19354)
* ggml-virtgpu: add backend documentation

Assisted-by-AI: Claude Code

* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget

* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md

* docs/ggml-virt: add link to testing + configuration

* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"

This reverts commit 8ece8e72e2.

* drop the ggml- prefix

* s/ggerganov/ggml-org

* Relocate VirtGPU.md

* reorganize the text

* turn turn the ascii diagram into a mermaid

* README.md: update the link to the main doc
2026-02-09 20:15:42 +08:00
Antonis Makropoulos b316895ff9
docs: Add LlamaLib to UI projects (#19181) 2026-01-30 14:54:28 +08:00
Molly Sophia 1243f93a2d
readme: update RWKV7 model links (#19061)
Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2026-01-25 09:11:19 +02:00
Xuan-Son Nguyen c15395f73c
common : implement new jinja template engine (#18462)
* jinja vm

* lexer

* add vm types

* demo

* clean up

* parser ok

* binary_expression::execute

* shadow naming

* bin ops works!

* fix map object

* add string builtins

* add more builtins

* wip

* use mk_val

* eval with is_user_input

* render gemma tmpl ok

* track input string even after transformations

* support binded functions

* keyword arguments and slicing array

* use shared_ptr for values

* add mk_stmt

* allow print source on exception

* fix negate test

* testing more templates

* mostly works

* add filter_statement

* allow func to access ctx

* add jinja-value.cpp

* impl global_from_json

* a lot of fixes

* more tests

* more fix, more tests

* more fixes

* rm workarounds

* demo: type inferrence

* add placeholder for tojson

* improve function args handling

* rm type inference

* no more std::regex

* trailing spaces

* make testing more flexible

* make output a bit cleaner

* (wip) redirect minja calls

* test: add --output

* fix crash on macro kwargs

* add minimal caps system

* add some workarounds

* rm caps_apply_workarounds

* get rid of preprocessing

* more fixes

* fix test-chat-template

* move test-chat-jinja into test-chat-template

* rm test-chat-jinja from cmake

* test-chat-template: use common

* fix build

* fix build (2)

* rename vm --> interpreter

* improve error reporting

* correct lstrip behavior

* add tojson

* more fixes

* disable tests for COMMON_CHAT_FORMAT_GENERIC

* make sure tojson output correct order

* add object.length

* fully functional selectattr / rejectattr

* improve error reporting

* more builtins added, more fixes

* create jinja rendering tests

* fix testing.h path

* adjust whitespace rules

* more fixes

* temporary disable test for ibm-granite

* r/lstrip behavior matched with hf.js

* minimax, glm4.5 ok

* add append and pop

* kimi-k2 ok

* test-chat passed

* fix lstrip_block

* add more jinja tests

* cast to unsigned char

* allow dict key to be numeric

* nemotron: rm windows newline

* tests ok

* fix test

* rename interpreter --> runtime

* fix build

* add more checks

* bring back generic format support

* fix Apertus

* [json.exception.out_of_range.403] key 'content' not found

* rm generic test

* refactor input marking

* add docs

* fix windows build

* clarify error message

* improved tests

* split/rsplit with maxsplit

* non-inverse maxsplit

forgot to change after simplifying

* implement separators for tojson and fix indent

* i like to move it move it

* rename null -- > none

* token::eof

* some nits + comments

* add exception classes for lexer and parser

* null -> none

* rename global -> env

* rm minja

* update docs

* docs: add input marking caveats

* imlement missing jinja-tests functions

* oops

* support trim filter with args, remove bogus to_json reference

* numerous argument fixes

* updated tests

* implement optional strip chars parameter

* use new chars parameter

* float filter also has default

* always leave at least one decimal in float string

* jinja : static analysis + header cleanup + minor fixes

* add fuzz test

* add string.cpp

* fix chat_template_kwargs

* nits

* fix build

* revert

* unrevert

sorry :)

* add fuzz func_args, refactor to be safer

* fix array.map()

* loosen ensure_vals max count condition, add not impl for map(int)

* hopefully fix windows

* check if empty first

* normalize newlines

---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2026-01-16 11:22:06 +01:00
Adrien Gallouët 516a4ca9b5
refactor : remove libcurl, use OpenSSL when available (#18828) 2026-01-14 18:02:47 +01:00
thom-dev-fr 79456a690a
readme : update UIs (#18751) 2026-01-11 13:46:50 +02:00
Adrien Gallouët 56d2fed2b3
tools : remove llama-run (#18661)
* tools : remove llama-run
* Remove licenses/LICENSE-linenoise

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2026-01-07 16:18:26 +01:00
Naco Siren 5c0d18881e
llama.android : Rewrite Android binding (w/o cpu_features dep) (#17413)
* UI: implement basic UI components

* util: implement performance monitor; wrap it with a viewmodel

* util: implement user preferences utility

* UI: implement core flow's screens

* UI: add a new MainActivity; update manifest

* [WIP] DI: implement simple local vm factory provider

* UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark

* UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark

* UI: split a nested parent settings screen into separate child settings screens

* UI: polish system prompt setup UI

* Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject

* DB: setup Room database

* data: introduce repo for System Prompt; flow data from Room to VM

* bugfix: properly handle user's quitting conversation screen while tokens in generation

* UI: rename `ModeSelection` to `ModelLoading` for better clarity

* UI: update app name to be more Arm

* UI: polish conversation screen

* data: code polish

* UI: code polish

* bugfix: handle user quitting on model loading

* UI: locks user in alert dialog when model is unloading

* vm: replace token metrics stubs with actual implementation

* UI: refactor top app bars

* nit: combine temperatureMetrics and useFahrenheit

* DI: introduce Hilt plugin + processor + lib dependencies

* DI: make app Hilt injectable

* DI: make viewmodels Hilt injectable

* DI: replace manual DI with Hilt DI

* UI: optimize AppContent's composing

* bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController

* UI: navigation with more natural animated transitions

* DI: Optimize AppModule

* Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule

* UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel

* DI: abstract the protocol of SystemPromptRepository; update AppModule

* data: [WIP] prepare for ModelRepository refactor & impl

* data: introduce Model entity and DAO; update DI module

* UI: replace Models Management screen's stubbing with instrumentation

* UI: polish sort order menu

* data: import local model with file picker

* bugfix: use List instead of Collection for ModelDao's deletion

* data: add a util file for extracting file name & size and model metadata

* UI: enrich ModelManagementState; extract filename to show correct importing UI

* UI: implement multiple models deletion; update Models Management screen

* UI: handle back navigation when user is in multi-selection mode

* util: extract file size formatting into ModelUtils

* UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog

* UI: extract a shared ModelCard component

* UI: replace model selection screen's data stubbing; add empty view

* nit: tidy SystemPromptViewModel

* Util: split FileUtils from ModelUtils; extract copy methods into FileUtils

* data: pass through getModelById from ModelDao into ModelRepository

* core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine

* vm: split mono MainViewModel into separate individual ViewModels

* vm: merge SystemPromptViewModel into ModelLoadingViewModel

* core: break down InferenceManager due to Interface Segregation Principle

* UI: show model card in Model Loading screen

* UI: show model card in Conversation screen

* UI: unify Model Card components

* core: swap in LLamaAndroid and mark stub engine for testing only

* data: allow canceling the ongoing model import

* UI: update UI ongoing model import's cancellation

* LLama: update engine state after handling the cancellation of sendUserPrompt

* VM: handle the cancellation of ongoing token generation

* LLama: refactor loadModel by splitting the system prompt setting into a separate method

* feature: check for available space before copying local model

* UI: centralize the AppScaffold and modularize its configs

* UI: refactor BottomBarConfig.ModelsManagement APIs

* UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig

* UI: replace ugly optional as casts in AppScaffold with extension functions

* UI: fix the typo `totalGb` in `StorageMetrics`

* UI: remove code duplication in sort menu

* LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages

* UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent

* UI: implement BenchmarkScreen's individual back handling

* LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized

* UI: Introduce an abstract ViewModel to handle additional model unloading logics

* UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt

* UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen

* UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen

* nit: extract app name into a constant value; remove unused onBackPressed callbacks

* UI: update AppContent to pass in correct navigation callbacks

* nit: polish ModelLoadingScreen UI

* core: throw Exception instead of returning null if model fails to load

* navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens

* gguf: add GGUF metadata data holder and its corresponding extractor implementation

* DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library

* GGUF: make GgufMetadata serializable in order to be compatible with Room

* nit: refactor data.local package structure

* nit: rename lastUsed field to dateLastUsed; add dateAdded field

* UI: refactor ModelCard UI to show GGUF metadata

* UI: update ModelSelectionScreen with a preselect mechanism

* UI: polish model card

* nit: allow deselect model on Model Selection screen

* nit: revert accidental committing of debug code

* UI: polish ModelLoading screen

* util: extract formatting helper functions from FileUtils into a new FormatUtils

* UI: polish model cards on Benchmark and Conversation screens to show model loading metrics

* UI: show a Snack bar to warn user that system prompt is not always supported

* UI: handle back press on Model Selection screen

* UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation

* feature: support searching on Model Selection screen

* nit: move scaffold related UI components into a separate package

* UI: extract InfoView out into a separate file for reusability

* data: move Model related actions (query, filter, sort) into ModelInfo file

* UI: animate FAB on model preselection states

* feature: support filtering in Model Management screen

* ui: show empty models info in Model Management screen

* ui: add filter off icon to "Clear filters" menu item

* [WIP] ui: polish Benchmark screen; implement its bottom app bar

* ui: polish Benchmark screen; implement its bottom app bar's rerun and share

* nit: disable mode selection's radio buttons when loading model

* feature: implement Conversation screen's bottom app bar

* pkg: restructure BottomAppBars into separate files in a child package

* pkg: restructure TopBarApps into separate files in a child package

* pkg: restructure system metrics into a separate file

* UI: polish Conversation screen

* data: update system prompt presets

* UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement

* data: update & enhance system prompt presets

* deps: introduce Retrofit2

* data: implement HuggingFace data model, data source with Retrofit API

* data: update Model data repository to support fetching HuggingFace models

* [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call

* UI: map language codes into country Emojis

* ui: add "clear results" action to Benchmark screen

* nit: print current pp & tg in llama-bench

* UI: disable landscape mode; prevent duplicated benchmark running

* llama: migrate C/CXX flags into CMakeList

* [WIP] llama: ABI split builds five .so artifacts.

However, all .so are performing on SVE level

* [WIP] llama: ABI split where five tiers are built sequentially.

* [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE

* [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below

```CMake
if (NOT SME_ENABLED MATCHES -1)
...
    set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2")
...
```

* core: add Google's cpu_features as a submodule

* core: implement cpu_detector native lib

* core: swap out hardcoded LlamaAndroid library loading

* core: add back OpenMP due to huge perf loss on TG128

* misc: reorg the pkg structure

* misc: rename LlamaAndroid related class to InferenceEngine prefixes

* [WIP] lib: move GgufMetadata into the lib submodule

* lib: expose GgufMetadataReader as interface only

* lib: replace the naive & plain SharedPreferences with DataStore implementation

* lib: hide the internal implementations, only expose a facade and interfaces

* lib: expose Arm features

* di: add a stub TierDetection; provide both actual impl and stub in AppModule

* UI: add visualizer UI for Arm features

* misc: UI polish

* lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier

* UI: support `NONE` Llama Tier in general settings

* lib: optimize engine loader; always perform a fresh detection when cache is null

* remote: add HuggingFaceModelDetails data class

* remote: refine HuggingFaceModel data class

* nit: remove `trendingScore` field from HuggingFace model entities, weird...

* remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource

* remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response

* UI: scaffold Models Management screen and view model

* UI: implement a dialog UI to show fetched HuggingFace models.

* UI: use a broadcast receiver to listen for download complete events and show local import dialog.

* data: handle network exceptions elegantly

* pkg: restructure `data`'s packages

* data: extract local file info, copy and cleanup logics into LocalFileDataSource

* nit: minor UI patch; add missing comments

* bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action.

* UI: improve autoscroll during token generation

* lib: tested on JFrog Artifactory for Maven publishing

* UI: show RAM warning if model too large

* UI: polish model management screen's error dialog

* util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code

* llm: properly propagate error to UI upon failing to load selected model

* UI: avoid duplicated calculation of token metrics

* lib: read & validate the magic number from the picked source file before executing the import

* UI: add "Learn More" hyperlinks to Error dialog upon model import failures

* lib: refactor the GgufMetadataReader to take  InputStream instead of absolute path as argument

* lib: fix the `SIMD` typo in Tier description

* core: verify model file path is readable

* lib: add UnsupportedArchitectureException for triaged error message

* util: split FormatUtils into multiple utils for better readability

* UI: change benchmark screen from raw markdown to table view

* bugfix: reset preselection upon running the preselected model

* misc: linter issue

* bugfix: fix the malfunctioning monitoring switch

* UI: update Arm features indicator; fix the broken hyperlinks

* UI: add quick action buttons to benchmark screen's result card

* UI: hide share fab after clearing all benchmark results

* UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen;

* UI: hide the stubbing actions in Conversation screen

* UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder

* UI: add a info button to explain token metrics

* misc: remove the redundant `Companion` added due to refactoring

* UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator

* UI: add info button to System Prompt switch; expand the model card by default

* UI: disable tag & language chips; add section headers to explain what they are

* misc: replace top bar indicator's spacer with padding

* UI: merge the Model Selection and Model Management into a unified Models screen

* UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity

* UI: add model loading in progress view; polish the empty model info view

* UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models

* build: [BREAKING] bump the versions of libraries and plugins

* UI: fix the breaking build

* UI: add Tooltip on Import FAB for user onboarding

* UI: adds AppPreferences to track user onboarding status

* UI: tracks user's first success on importing a model

* data: add hand crafted rules to filter the models fetched from HuggingFace API

* UI: update app name & about; polish top bars' indicators & buttons

* UI: polish Hugging Face download dialog UI

* UX: implement onboarding tooltips for model import and onboarding

* misc: use sentence case for CTA button labels

* [WIP] UI: add Arm color palette from Philip.Watson3

* UI: address Rojin's UX feedbacks

* UI: address Rojin's UX feedbacks - part 2

* UI: update Arm color palette from Philip.Watson3

* data: make sure fetch preselected models in the same order of their IDs

* UI: fix UI issues in the generic settings screen and navigation drawer

* nit: address Rojin's feedbacks on model import message again

* nit: append `®` to all `Arm` labels

* UI: extract a reusable InfoAlertDialog

* core: support GGML_CPU_ALL_VARIANTS on Android!

* core: restructure Kleidi-Llama library

* core: organizing cmake arguments

* data: sort preselected models according to device's available RAM

* app: update adaptive + themed + legacy icons and app name

* UI: fix the font size auto scaling for ArmFeaturesVisualizer

* core: further improve the performance on native methods

* UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label

* UI: make more room for assistant message bubble's width

* UI: better usage of tertiary colors to highlight model cards but not for warnings

* UI: fix the layout issue on large font sizes

* lib: support x86-64 by dynamically set Arm related definitions

* lib: replace the factory pattern for  deprecated tiered lib loading with single instance pattern

* llama: update the library name in JNI and CMake project

* llama: update the library's package name and namespace

* llama: update the app's package name and namespace

* app: bump ksp version

* app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge

* app: extract AppContent from MainActivity to a separate file in ui package

* lib: add File version for GGUF Magic number verification

* lib: perform engine state check inclusively instead of exclusively

* lib: change `LlamaTier` to `ArmCpuTier`

* lib: remove kleidi-llama related namings

* cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample

Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here.

* [WIP] doc: update main and Android README docs; add self to code owners

* lib: revert System.load back to System.loadLibrary

* jni: introduce a logging util to filter different logging levels on different build types

* lib: enable app optimization

* doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list

* Remove cpu_features

* Fix linters issues in editorconfig-checker job

https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413

* Remove unnecessary Android CMake flag

* purge include/cpu_features directory

---------

Co-authored-by: Han Yin <han.yin@arm.com>
2025-12-17 10:14:47 +02:00
Andrew Aladjev 4a4f7e6550
cli: fixed dead links to tools/main for cli and completion, fixed code owners (#17993)
Co-authored-by: Andrew Aladjev <andrew.aladjev@gmail.com>
2025-12-15 11:47:04 +01:00
Xuan-Son Nguyen 6c2131773c
cli: new CLI experience (#17824)
* wip

* wip

* fix logging, add display info

* handle commands

* add args

* wip

* move old cli to llama-completion

* rm deprecation notice

* move server to a shared library

* move ci to llama-completion

* add loading animation

* add --show-timings arg

* add /read command, improve LOG_ERR

* add args for speculative decoding, enable show timings by default

* add arg --image and --audio

* fix windows build

* support reasoning_content

* fix llama2c workflow

* color default is auto

* fix merge conflicts

* properly fix color problem

Co-authored-by: bandoti <bandoti@users.noreply.github.com>

* better loading spinner

* make sure to clean color on force-exit

* also clear input files on "/clear"

* simplify common_log_flush

* add warning in mtmd-cli

* implement console writter

* fix data race

* add attribute

* fix llama-completion and mtmd-cli

* add some notes about console::log

* fix compilation

---------

Co-authored-by: bandoti <bandoti@users.noreply.github.com>
2025-12-10 15:28:59 +01:00
ixgbe 79d61896d3
ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support (#17784)
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>

* cmake: enable RISC-V zihintpause extension for Spacemit builds

* readme : add ZIHINTPAUSE support for RISC-V

---------

Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-12-08 10:41:34 +02:00
Vishal Singh 017761daf5
ggml-zendnn : add ZenDNN backend for AMD CPUs (#17690)
* ggml-zennn: add ZenDNN backend support

* ggml-zendnn : address ZenDNN backend review fixes and suggestions

* docs : apply blockquote syntax to ZenDNN docs

---------

Co-authored-by: Manoj Kumar <mkumar@zettabolt.com>
2025-12-07 00:13:33 +08:00
Xuan-Son Nguyen ec18edfcba
server: introduce API for serving / loading / unloading multiple models (#17470)
* server: add model management and proxy

* fix compile error

* does this fix windows?

* fix windows build

* use subprocess.h, better logging

* add test

* fix windows

* feat: Model/Router server architecture WIP

* more stable

* fix unsafe pointer

* also allow terminate loading model

* add is_active()

* refactor: Architecture improvements

* tmp apply upstream fix

* address most problems

* address thread safety issue

* address review comment

* add docs (first version)

* address review comment

* feat: Improved UX for model information, modality interactions etc

* chore: update webui build output

* refactor: Use only the message data `model` property for displaying model used info

* chore: update webui build output

* add --models-dir param

* feat: New Model Selection UX WIP

* chore: update webui build output

* feat: Add auto-mic setting

* feat: Attachments UX improvements

* implement LRU

* remove default model path

* better --models-dir

* add env for args

* address review comments

* fix compile

* refactor: Chat Form Submit component

* ad endpoint docs

* Merge remote-tracking branch 'webui/allozaur/server_model_management_v1_2' into xsn/server_model_maagement_v1_2

Co-authored-by: Aleksander <aleksander.grygier@gmail.com>

* feat: Add copy to clipboard to model name in model info dialog

* feat: Model unavailable UI state for model selector

* feat: Chat Form Actions UI logic improvements

* feat: Auto-select model from last assistant response

* chore: update webui build output

* expose args and exit_code in API

* add note

* support extra_args on loading model

* allow reusing args if auto_load

* typo docs

* oai-compat /models endpoint

* cleaner

* address review comments

* feat: Use `model` property for displaying the `repo/model-name` naming format

* refactor: Attachments data

* chore: update webui build output

* refactor: Enum imports

* feat: Improve Model Selector responsiveness

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* refactor: Formatters

* chore: update webui build output

* refactor: Copy To Clipboard Icon component

* chore: update webui build output

* refactor: Cleanup

* chore: update webui build output

* refactor: UI badges

* chore: update webui build output

* refactor: Cleanup

* refactor: Cleanup

* chore: update webui build output

* add --models-allow-extra-args for security

* nits

* add stdin_file

* fix merge

* fix: Retrieve lost setting after resolving merge conflict

* refactor: DatabaseStore -> DatabaseService

* refactor: Database, Conversations & Chat services + stores architecture improvements (WIP)

* refactor: Remove redundant settings

* refactor: Multi-model business logic WIP

* chore: update webui build output

* feat: Switching models logic for ChatForm or when regenerating messges + modality detection logic

* chore: update webui build output

* fix: Add `untrack` inside chat processing info data logic to prevent infinite effect

* fix: Regenerate

* feat: Remove redundant settigns + rearrange

* fix: Audio attachments

* refactor: Icons

* chore: update webui build output

* feat: Model management and selection features WIP

* chore: update webui build output

* refactor: Improve server properties management

* refactor: Icons

* chore: update webui build output

* feat: Improve model loading/unloading status updates

* chore: update webui build output

* refactor: Improve API header management via utility functions

* remove support for extra args

* set hf_repo/docker_repo as model alias when posible

* refactor: Remove ConversationsService

* refactor: Chat requests abort handling

* refactor: Server store

* tmp webui build

* refactor: Model modality handling

* chore: update webui build output

* refactor: Processing state reactivity

* fix: UI

* refactor: Services/Stores syntax + logic improvements

Refactors components to access stores directly instead of using exported getter functions.

This change centralizes store access and logic, simplifying component code and improving maintainability by reducing the number of exported functions and promoting direct store interaction.

Removes exported getter functions from `chat.svelte.ts`, `conversations.svelte.ts`, `models.svelte.ts` and `settings.svelte.ts`.

* refactor: Architecture cleanup

* feat: Improve statistic badges

* feat: Condition available models based on modality + better model loading strategy & UX

* docs: Architecture documentation

* feat: Update logic for PDF as Image

* add TODO for http client

* refactor: Enhance model info and attachment handling

* chore: update webui build output

* refactor: Components naming

* chore: update webui build output

* refactor: Cleanup

* refactor: DRY `getAttachmentDisplayItems` function + fix UI

* chore: update webui build output

* fix: Modality detection improvement for text-based PDF attachments

* refactor: Cleanup

* docs: Add info comment

* refactor: Cleanup

* re

* refactor: Cleanup

* refactor: Cleanup

* feat: Attachment logic & UI improvements

* refactor: Constants

* feat: Improve UI sidebar background color

* chore: update webui build output

* refactor: Utils imports + move types to `app.d.ts`

* test: Fix Storybook mocks

* chore: update webui build output

* test: Update Chat Form UI tests

* refactor: Tooltip Provider from core layout

* refactor: Tests to separate location

* decouple server_models from server_routes

* test: Move demo test  to tests/server

* refactor: Remove redundant method

* chore: update webui build output

* also route anthropic endpoints

* fix duplicated arg

* fix invalid ptr to shutdown_handler

* server : minor

* rm unused fn

* add ?autoload=true|false query param

* refactor: Remove redundant code

* docs: Update README documentations + architecture & data flow diagrams

* fix: Disable autoload on calling server props for the model

* chore: update webui build output

* fix ubuntu build

* fix: Model status reactivity

* fix: Modality detection for MODEL mode

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-12-01 19:41:04 +01:00
Daniel Han dd0f321941
readme : add Unsloth exporting to GGUF in tools (#17411) 2025-11-20 20:07:36 +01:00
ixgbe 307772fcda
readme : add RVV,ZVFH,ZFH,ZICBOP support for RISC-V (#17259)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-14 09:12:56 +02:00
Georgi Gerganov afd353246d
readme : update hot topics (#17002) 2025-11-04 17:21:31 +02:00
amirai21 8d8862829c
docs : add Jamba to Text-only models list (#16778) 2025-10-26 13:01:20 +01:00
Max Krasnyansky 63d2fc46e1
Add experimental ggml-hexagon backend for the Hexagon NPU (#16547)
* model: add support for extra bufs for all devices

* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU

This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.

Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX

**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.

Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>

* hexagon: fix format checker errors

* hexagon: update readme and cmake presets

* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions

* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input

* hexagon: move ADB helper scripts into scripts/snapdragon/adb

* hexagon: replace all f/printfs with GGML_LOG_...

* readme: add hexagon to the list supported backends

* hexagon: stack malmuts with quantized inputs only

* hexagon: add TODO for fixing issues in hexagon_graph_optimize

* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC

* scripts: fix lint errors

* scripts: update qdc pytest script to make linter happy

* hexagon: add reduce sum in fp32

* hexagon: reduce number of vector stores in matmul output

* hexagon: remove the need for vdelta in reduce-multiply-x8

* hexagon: consistent use of reduce_sum_fp32 for row_sums

* hexagon: some more matmul optimizations and comments

Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.

* hexagon: update cmake presets

* hexagon: add OPMASK support for run-bench.sh wrapper

* hexagon: update to use GGML_BACKEND_API

* hexagon: remove unused logic for setting tensor flags for the views

* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors

Same asserts as the CPU backend.

* hexagon: use cpy_tensor slow path for non-host buffers

* hexagon: error checks in the buffer allocator

* cmake: move include(extProj) under ggml-hexagon

* hexagon: don't forget to delete the backend on free

* hexagon: set/get_tensor size assert apply only to quantized tensors

* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now

GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.

* docs: typos in hexagon developer docs (libggm-...)

* hexagon: overhaul error handling in the session/device allocation

this should handle all failure paths in the session allocation.

* hexagon: update cmake presets to enable fp16 vectors

* hexagon: remove unused time_usec function

* hexagon: don't forget to release buffer contexts

* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)

* hexagon: remove custom can_repeat function and use ggml_can_repeat

---------

Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
2025-10-22 13:47:09 -07:00
Sigbjørn Skjæret 84bf3c6778
model : add BailingMoeV2 support (#16063)
* add BailingMoeV2 support

* update llm types

* undo

* undo

* update llm types

* add model collection link

* update

* almost working

* correct group selection and rename n_group_exp

* avoid large top_k and use argmax instead for now

if we had something like argmax2 that would be equivalent, but this works fine until then

* poke

* skip group selection when there are no tokens

* fix 1T conversion

* hopefully fixed expert group selection

third time's the charm?

* make expert group selection generally available

The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.

* allow n_expert_groups to be 1 (Kimi K2)

* address review suggestions
2025-10-20 21:38:20 +02:00
Ron Evans 72d53e6c4d
readme: update bindings (#16651)
Signed-off-by: deadprogram <ron@hybridgroup.com>
2025-10-20 11:20:04 +03:00
rtaluyev 27052978e4
readme : update bindings (#16144)
Link to Java JNA bindings to llama.cpp native libraries
2025-09-25 18:20:34 +03:00
Aaron Teo 264f1b5187
zdnn: refactor codebase + add docs (#16178)
* zdnn: initial matmul refactor

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm static from funcs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: update ggml-zdnn.h

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: change header files to hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: switch to common.hpp

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: move mulmat forward around

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: rm inline from utils

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* ggml-zdnn: code cleanup

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* docs: add zDNN docs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-09-23 14:53:05 +08:00
Georgi Gerganov 5c6106a696
contrib : update roles (#16113)
* contrib : update roles

* contrib : merge PR sections + add link to CI instructions

Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
2025-09-22 10:58:02 +03:00
Jie Fu (傅杰) 4795c91c32
docs : add Hunyuan to models section (#15707)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-01 10:34:59 +03:00
Tarek Dakhran e288693669
readme : model : mtdm : lfm2 improvements (#15476)
* Support untied embeddings

* Increase number of image tokens to 1024

* Add LFM2-VL to readme

* Actually use untied embeddings
2025-08-22 09:29:08 +02:00
Georgi Gerganov 3007baf201
readme : update hot topics (#15397) 2025-08-18 18:11:44 +03:00
Georgi Gerganov 1a01899b61
readme : update hot topics (#15315) 2025-08-14 17:16:03 +03:00
Zagaj 27093afe78
readme : update infra list (#15234) 2025-08-11 15:27:54 +03:00
Georgi Gerganov be42642581
readme : update hot topics (#15097) 2025-08-05 20:19:33 +03:00
Radoslav Gerganov 2ba1333b35
docs : fix backends table in README.md (#14796) 2025-07-21 14:03:49 +02:00
Aman Gupta 2be60cbc27
docs : fix link for tools/perplexity in README.md (#14780) 2025-07-20 20:13:47 +02:00
Reese Levine 21c021745d
ggml: Add initial WebGPU backend (#14521)
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults

* Initialize webgpu device

* Making progress on setting up the backend

* Finish more boilerplate/utility functions

* Organize file and work on alloc buffer

* Add webgpu_context to prepare for actually running some shaders

* Work on memset and add shader loading

* Work on memset polyfill

* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it

* Implement get_tensor and buffer_clear

* Finish rest of setup

* Start work on compute graph

* Basic mat mul working

* Work on emscripten build

* Basic WebGPU backend instructions

* Use EMSCRIPTEN flag

* Work on passing ci, implement 4d tensor multiplication

* Pass thread safety test

* Implement permuting for mul_mat and cpy

* minor cleanups

* Address feedback

* Remove division by type size in cpy op

* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends

* Fix name

* Fix macos dawn prefix path
2025-07-16 18:18:51 +03:00
Tarek Dakhran 67eade1bf9
docs : add LFM2 to models section (#14650)
* readme : add LFM2 to models section

* fix copy paste...
2025-07-12 19:07:08 +02:00
Georgi Gerganov aaa088d87f
readme : add hot PRs (#14636)
* readme : add hot PRs

* cont

* readme : update title

* readme : hot PRs links

* cont
2025-07-11 16:07:55 +03:00
Georgi Gerganov b7cc7745e3
readme : remove survey link (#14168) 2025-06-13 11:55:44 +03:00
Georgi Gerganov a681b4ba83
readme : remove project status link (#14149) 2025-06-12 14:43:09 +03:00
Olexandr88 d01d112abb
readme : add badge (#13938) 2025-06-05 10:50:55 +03:00
Xuan-Son Nguyen ea1431b0fa
docs : add "Quick start" section for new users (#13862)
* docs : add "Quick start" section for non-technical users

* rm flox

* Update README.md
2025-06-03 13:09:36 +02:00
ddh0 8726392d3d
readme : update bindings (#13950) 2025-06-01 11:44:30 +03:00
Xuan-Son Nguyen 797990c4bc
mtmd : add ultravox audio input (#13623)
* convert ok, load ok

* warmup ok

* test

* still does not work?

* fix padding

* temporary give up

* fix merge conflict

* build_ultravox()

* rm test

* fix merge conflict

* add necessary mtmd APIs

* first working version (only 4s of audio)

* will this monster compile?

* fix compile

* please compile

* fPIC

* fix windows

* various fixes

* clean up audio_helpers

* fix conversion

* add some debug stuff

* long audio input ok

* adapt the api

* add --audio arg

* final touch UX

* add miniaudio to readme

* fix typo

* refactor kv metadata

* mtmd_default_marker()
2025-05-22 20:42:48 +02:00
R0CKSTAR 33983057d0
musa: Upgrade MUSA SDK version to rc4.0.1 and use mudnn::Unary::IDENTITY op to accelerate D2D memory copy (#13647)
* musa: fix build warning (unused parameter)

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: upgrade MUSA SDK version to rc4.0.1

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

* Update ggml/src/ggml-cuda/cpy.cu

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-05-21 09:58:49 +08:00
Xuan-Son Nguyen 06c1e4abc1
readme : add list of dependencies and their license (#13591) 2025-05-16 20:04:18 +02:00
Xuan-Son Nguyen 33eff40240
server : vision support via libmtmd (#12898)
* server : (experimental) vision support via libmtmd

* mtmd : add more api around mtmd_image_tokens

* mtmd : add more api around mtmd_image_tokens

* mtmd : ability to calc image hash

* shared_ptr for mtmd_image_tokens

* move hash to user-define ID (fixed)

* abstract out the batch management

* small fix

* refactor logic adding tokens to batch

* implement hashing image

* use FNV hash, now hash bitmap instead of file data

* allow decoding image embedding to be split into batches

* rm whitespace

* disable some features when mtmd is on

* fix --no-mmproj-offload

* mtmd_context_params no timings

* refactor server_inp to server_tokens

* fix the failing test case

* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* improve server_input struct

* clip :  fix confused naming ffn_up and ffn_down

* rm ffn_i/o/g naming

* rename n_embd, n_ff

* small fix

* no check n_ff

* fix detokenize

* add const to various places

* add warning about breaking changes

* add c api

* helper: use mtmd_image_tokens_get_n_pos

* fix ctx_shift

* fix name shadowing

* more strict condition

* support remote image_url

* remote image_url log

* add CI test

* do not log base64

* add "has_multimodal" to /props

* remove dangling image

* speculative: use slot.cache_tokens.insert

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* rm can_be_detokenized

* on prmpt processing done, assert cache_tokens.size

* handle_completions_impl returns void

* adapt the new web ui

* update docs and hot topics

* rm assert

* small fix (2)

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-09 19:29:37 +02:00
Diego Devesa 1d36b3670b
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Xuan-Son Nguyen 00e3e5a194
mtmd : add qwen2vl and qwen2.5vl (#13141)
* llava : add clip_n_output_tokens, deprecate clip_n_patches

* mtmd : add qwen2vl and qwen2.5vl

* decode_embd_batch::set_position_...

* working version

* deprecate llama-qwen2vl-cli

* correct order W, H of clip_embd_nbytes_by_img

* edit existing line in hot topics
2025-04-29 11:47:04 +02:00
Georgi Gerganov d0a417f3c7
readme : update hot topics (#13150) 2025-04-28 12:10:18 +03:00
Xuan-Son Nguyen 84a9bf2fc2
mtmd : merge llava, gemma3 and minicpmv CLI into single `llama-mtmd-cli` (#13012)
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`

* support for minicpmv

* remove cpp files of llava and minicpmv

* update hot topics

* mtmd : add not supported msg for qwen2vl

* Update examples/llava/mtmd.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-04-21 15:32:58 +02:00