* Update build doc
* Add cgraph tensor output name to OV op name
* Update openvino build instructions
* Add initial NPU support
* draft NPU support version 2: prefill + kvcache
* NPU support version 2: prefill + kvcache
* Change due to ggml cgraph changes, not correct yet
* Change due to ggml cgraph changes, llama-3.2 CPU work
* Add AMD64 to CMakeLists
* Change due to ggml cgraph changes, all device work
* Refactor: clean, fix warning
* Update clang-format
* Statful transformation for CPU GPU
* Add SwiGLU
* Fuse to SDPA
* Replace Concat with Broadcast in MulMat for GQA
* Pull out indices creation for kv cache update
* Refactor: remove past_token_len from extra_inputs
* Fix Phi3 SwiGLU and SoftMax
* Pull out sin cos from rope
* Reduce memory: free ov weights node after graph conversion
* Fix CPY due to cgraph change
* Added OpenVINO CI/CD. Updated docs
* Fix llama-cli
* Fix Phi3 ROPE; Add test-backend-ops
* Fix NPU
* Fix llama-bench; Clang-format
* Fix llama-perplexity
* temp. changes for mark decomp
* matmul in fp32
* mulmat input conversion fix
* mulmat type conversion update
* add mark decomp pass
* Revert changes in fuse_to_sdpa
* Update build.md
* Fix test-backend-ops
* Skip test-thread-safety; Run ctest only in ci/run.sh
* Use CiD for NPU
* Optimize tensor conversion, improve TTFT
* Support op SET_ROWS
* Fix NPU
* Remove CPY
* Fix test-backend-ops
* Minor updates for raising PR
* Perf: RMS fused to OV internal RMS op
* Fix after rebasing
- Layout of cache k and cache v are unified: [seq, n_head, head_size]
- Add CPY and FLASH_ATTN_EXT, flash attn is not used yet
- Skip test-backend-ops due to flash attn test crash
- Add mutex around graph conversion to avoid test-thread-safety fali in the future
- Update NPU config
- Update GPU config to disable SDPA opt to make phi-3 run
* Change openvino device_type to GPU; Enable flash_attn
* Update supports_buft and supports_op for quantized models
* Add quant weight conversion functions from genai gguf reader
* Quant models run with accuracy issue
* Fix accuracy: disable cpu_repack
* Fix CI; Disable test-backend-ops
* Fix Q4_1
* Fix test-backend-ops: Treat quantized tensors as weights
* Add NPU Q4_0 support
* NPU perf: eliminate zp
* Dequantize q4_1 q4_k q6_k for NPU
* Add custom quant type: q8_1_c, q4_0_128
* Set m_is_static=false as default in decoder
* Simpilfy translation of get_rows
* Fix after rebasing
* Improve debug util; Eliminate nop ReshapeReshape
* STYLE: make get_types_to_requant a function
* Support BF16 model
* Fix NPU compile
* WA for npu 1st token acc issue
* Apply EliminateZP only for npu
* Add GeGLU
* Fix Hunyuan
* Support iSWA
* Fix NPU accuracy
* Fix ROPE accuracy when freq_scale != 1
* Minor: not add attention_size_swa for non-swa model
* Minor refactor
* Add Q5_K to support phi-3-q4_k_m
* Requantize Q6_K (gs16) to gs32 on GPU
* Fix after rebasing
* Always apply Eliminate_ZP to fix GPU compile issue on some platforms
* kvcachefusion support
* env variable GGML_OPENVINO_DISABLE_SDPA_OPTIMIZATION added
* Fix for Phi3
* Fix llama-cli (need to run with --no-warmup)
* Fix add_sliced_mask; Revert mulmat, softmax; Remove input attention_size, iSWA model not working
* fix after rebasing
* Fix llama-3-8b and phi3-mini q4_0 NPU
* Update to OV-2025.3 and CMakeLists.txt
* Add OV CI cache
* Apply CISC review and update CI to OV2025.3
* Update CI to run OV dep install before build
* Update OV dockerfile to use OV2025.3 and update build docs
* Style: use switch in supports_ops
* Style: middle ptr and ref align, omit optional struct keyword
* NPU Unify PD (#14)
* Stateless. Fix llama-cli llama-server
* Simplify broadcast op in attention
* Replace get_output_tensor+memcpy with set_output_tensor
* NPU unify PD. Unify dynamic and static dims
* Clean placeholders in ggml-openvino.cpp
* NPU unify PD (handled internally)
* change graph to 4d, support multi sequences
* Fix llama-bench
* Fix NPU
* Update ggml-decoder.cpp
Hitting error while compiling on windows:
error C3861: 'unsetenv': identifier not found
Reason: unsetenv() is a POSIX function; it doesn’t exist on Windows. Visual Studio (MSVC) won’t recognize it.
Proposed fix: Use _putenv_s() (Windows equivalent)
This is supported by MSVC and achieves the same effect: it removes the environment variable from the process environment.
This keeps cross-platform compatibility.
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Update ggml-decoder.cpp
* Remove the second decoder for node. Moving the function into the model decoder
* Fix error for naive
* NPU prefill chunking
* NPU fix llama-bench
* fallback naive run with accuracy issue
* NPU support llma-perplexity -b 512 --no-warmup
* Refactor: split ov_graph_compute for dynamic and static
* remove unused API GgmlOvDecoder::get_output_stride(const std::string & name)
* minor update due to ov 2025.4
* remove unused API GgmlOvDecoder::get_output_names()
* remove unused API get_output_shape(const std::string & name)
* Modified API GgmlOvDecoder::get_output_type(const std::string & name)
* Removed API GgmlOvDecoder::get_output_op_params(const std::string & name)
* Removed API get_output_ggml_tensor(const std::string & name)
* Removed API m_outputs
* Removed m_output_names
* Removed API GgmlOvDecoder::get_input_names()
* Removed API GgmlOvDecoder::get_input_stride(const std::string& name)
* Removed API get_input_type
* Removed API get_input_type
* Removed API GgmlOvDecoder::get_input_shape(const std::string & name)
* Removed API GgmlOvDecoder::get_input_op_params(const std::string & name)
* Fix error for decoder cache
* Reuse cached decoder
* GPU remove Q6_K requantization
* NPU fix wrong model output shape
* NPU fix q4 perf regression
* Remove unused variable nodes
* Fix decoder can_reuse for llama-bench
* Update build.md for Windows
* backend buffer: allocate on host
* Use shared_buffer for GPU NPU; Refactor
* Add ov_backend_host_buffer; Use cached remote context
* Put kvcache on GPU
* Use ggml_aligned_malloc
* only use remote tensor for kvcache
* only use remote tensor for kvcache for GPU
* FIX: use remote tensor from singleton
* Update build.md to include OpenCL
* NPU always requant to q4_0_128
* Optimize symmetric quant weight extraction: use single zp
* Use Q8_0_C in token embd, lm_head, and for 5 and 6 bits quant
* Update build.md
* Support -ctk f32
* Initial stateful graph support
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* code cleanup
* npu perf fix
* requant to f16 for Q6 embed on NPU
* Update ggml/src/ggml-openvino/ggml-decoder.cpp
* Update ggml/src/ggml-openvino/ggml-openvino-extra.cpp
* Create OPENVINO.md in llama.cpp backend docs
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update build.md
* Update OPENVINO.md
* Update OPENVINO.md
* Update OPENVINO.md
* kq_mask naming fix
* Syntax correction for workflows build file
* Change ov backend buffer is_host to false
* Fix llama-bench -p -n where p<=256
* Fix --direct-io 0
* Don't put kvcache on GPU in stateful mode
* Remove hardcode names
* Fix stateful shapes
* Simplification for stateful and update output shape processing
* Remove hardcode names
* Avoid re-compilation in llama-bench
* Extract zp directly instead of bias
* Refactor weight tensor processing
* create_weight_node accept non-ov backend buffer
* remove changes in llama-graph.cpp
* stateful masking fix (#38)
Fix for stateful accuracy issues and cl_out_of_resources error in stateful GPU with larger context sizes.
* Fix test-backend-ops crash glu, get_rows, scale, rms_norm, add
* hardcoded name handling for rope_freqs.weight
* Suppress logging and add error handling to allow test-backend-ops to complete
* Fix MUL_MAT with broadcast; Add unsupported MUL_MAT FLASH_ATTN cases
* Use bias instead of zp in test-backend-ops
* Update OV in CI, Add OV CI Tests in GH Actions
* Temp fix for multithreading bug
* Update OV CI, fix review suggestions.
* fix editorconfig-checker, update docs
* Fix tabs to spaces for editorconfig-checker
* fix editorconfig-checker
* Update docs
* updated model link to be GGUF model links
* Remove GGML_CPU_REPACK=OFF
* Skip permuted ADD and MUL
* Removed static variables from utils.cpp
* Removed initializing non-existing variable
* Remove unused structs
* Fix test-backend-ops for OV GPU
* unify api calling
* Update utils.cpp
* When the dim is dynamic, throw an error, need to is stastic forst
* Add interface compute_model_outputs(), which get the model output through computing the node use count & status in the cgraph to avoid the flag using
* No need to return
* Fix test-backend-ops for OV GPU LNL
* Fix test-thread-safety
* use the shape from infer request of output tensor create to avoid issue
* fix dynamic output shape issue
* fix issue for the unused node in tests
* Remove unused lock
* Add comment
* Update openvino docs
* update to OV release version 2026.0
* add ci ov-gpu self hosted runner
* fix editorconfig
* Fix perplexity
* Rewrite the model inputs finding mechanism (#54)
* Rewrite the model inputs finding logistic
* Put stateful shape handle in get input shape
* Put the iteration logistic in func
* Added ggml-ci-intel-openvino-gpu and doc update
* .hpp files converted to .h
* fix ggml-ci-x64-intel-openvino-gpu
* Fix for stateful execution bug in llama-bench
* Minor updates after stateful llama-bench fix
* Update ggml/src/ggml-openvino/utils.cpp
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
* Remove multiple get_shape calls
* Bring back mutex into compute
* Fix VIEW op, which slice the input node
* Added token_len_per_seq existence check before slicing masks and moved node retrieval inside guarded block to prevent missing-key access
* Temp. fix for test requant errors
* Update to OV ggml-ci to low-perf
* ci : temporary disable "test-llama-archs"
* ci : cache v4 -> v5, checkout v4 -> v6, fix runner tag
* docs : update url
* Fix OV link in docker and Update docs
---------
Co-authored-by: Ravi Panchumarthy <ravi.panchumarthy@intel.com>
Co-authored-by: Cavus Mustafa <mustafa.cavus@intel.com>
Co-authored-by: Arshath <arshath.ramzan@intel.com>
Co-authored-by: XuejunZhai <Xuejun.Zhai@intel.com>
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
Co-authored-by: Xuejun Zhai <Xuejun.Zhai@intel>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* hexagon: add fp16 support for binary ops: add,sub,mul,div
* hexagon: fix test-backend-ops failures for fp16 binary ops on older arches (<v79)
* hexagon: decide on n_threads (aka n_jobs) early to avoid overallocating scratchpad
* snapdragon: fix readme link
---------
Co-authored-by: Max Krasnyansky <maxk@qti.qualcomm.com>
* ggml-virtgpu: add backend documentation
Assisted-by-AI: Claude Code
* CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget
* README: add the link to docs/backend/GGML-VirtGPU/ggml-virt.md
* docs/ggml-virt: add link to testing + configuration
* Revert "CODEOWNERS: add /docs/backend/GGML-VirtGPU/ -> kpouget"
This reverts commit 8ece8e72e2.
* drop the ggml- prefix
* s/ggerganov/ggml-org
* Relocate VirtGPU.md
* reorganize the text
* turn turn the ascii diagram into a mermaid
* README.md: update the link to the main doc
* UI: implement basic UI components
* util: implement performance monitor; wrap it with a viewmodel
* util: implement user preferences utility
* UI: implement core flow's screens
* UI: add a new MainActivity; update manifest
* [WIP] DI: implement simple local vm factory provider
* UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark
* UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark
* UI: split a nested parent settings screen into separate child settings screens
* UI: polish system prompt setup UI
* Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject
* DB: setup Room database
* data: introduce repo for System Prompt; flow data from Room to VM
* bugfix: properly handle user's quitting conversation screen while tokens in generation
* UI: rename `ModeSelection` to `ModelLoading` for better clarity
* UI: update app name to be more Arm
* UI: polish conversation screen
* data: code polish
* UI: code polish
* bugfix: handle user quitting on model loading
* UI: locks user in alert dialog when model is unloading
* vm: replace token metrics stubs with actual implementation
* UI: refactor top app bars
* nit: combine temperatureMetrics and useFahrenheit
* DI: introduce Hilt plugin + processor + lib dependencies
* DI: make app Hilt injectable
* DI: make viewmodels Hilt injectable
* DI: replace manual DI with Hilt DI
* UI: optimize AppContent's composing
* bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController
* UI: navigation with more natural animated transitions
* DI: Optimize AppModule
* Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule
* UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel
* DI: abstract the protocol of SystemPromptRepository; update AppModule
* data: [WIP] prepare for ModelRepository refactor & impl
* data: introduce Model entity and DAO; update DI module
* UI: replace Models Management screen's stubbing with instrumentation
* UI: polish sort order menu
* data: import local model with file picker
* bugfix: use List instead of Collection for ModelDao's deletion
* data: add a util file for extracting file name & size and model metadata
* UI: enrich ModelManagementState; extract filename to show correct importing UI
* UI: implement multiple models deletion; update Models Management screen
* UI: handle back navigation when user is in multi-selection mode
* util: extract file size formatting into ModelUtils
* UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog
* UI: extract a shared ModelCard component
* UI: replace model selection screen's data stubbing; add empty view
* nit: tidy SystemPromptViewModel
* Util: split FileUtils from ModelUtils; extract copy methods into FileUtils
* data: pass through getModelById from ModelDao into ModelRepository
* core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine
* vm: split mono MainViewModel into separate individual ViewModels
* vm: merge SystemPromptViewModel into ModelLoadingViewModel
* core: break down InferenceManager due to Interface Segregation Principle
* UI: show model card in Model Loading screen
* UI: show model card in Conversation screen
* UI: unify Model Card components
* core: swap in LLamaAndroid and mark stub engine for testing only
* data: allow canceling the ongoing model import
* UI: update UI ongoing model import's cancellation
* LLama: update engine state after handling the cancellation of sendUserPrompt
* VM: handle the cancellation of ongoing token generation
* LLama: refactor loadModel by splitting the system prompt setting into a separate method
* feature: check for available space before copying local model
* UI: centralize the AppScaffold and modularize its configs
* UI: refactor BottomBarConfig.ModelsManagement APIs
* UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig
* UI: replace ugly optional as casts in AppScaffold with extension functions
* UI: fix the typo `totalGb` in `StorageMetrics`
* UI: remove code duplication in sort menu
* LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages
* UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent
* UI: implement BenchmarkScreen's individual back handling
* LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized
* UI: Introduce an abstract ViewModel to handle additional model unloading logics
* UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt
* UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen
* UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen
* nit: extract app name into a constant value; remove unused onBackPressed callbacks
* UI: update AppContent to pass in correct navigation callbacks
* nit: polish ModelLoadingScreen UI
* core: throw Exception instead of returning null if model fails to load
* navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens
* gguf: add GGUF metadata data holder and its corresponding extractor implementation
* DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library
* GGUF: make GgufMetadata serializable in order to be compatible with Room
* nit: refactor data.local package structure
* nit: rename lastUsed field to dateLastUsed; add dateAdded field
* UI: refactor ModelCard UI to show GGUF metadata
* UI: update ModelSelectionScreen with a preselect mechanism
* UI: polish model card
* nit: allow deselect model on Model Selection screen
* nit: revert accidental committing of debug code
* UI: polish ModelLoading screen
* util: extract formatting helper functions from FileUtils into a new FormatUtils
* UI: polish model cards on Benchmark and Conversation screens to show model loading metrics
* UI: show a Snack bar to warn user that system prompt is not always supported
* UI: handle back press on Model Selection screen
* UI: finally support theme modes; remove hardcoded color schemes, default to dynamic color scheme implementation
* feature: support searching on Model Selection screen
* nit: move scaffold related UI components into a separate package
* UI: extract InfoView out into a separate file for reusability
* data: move Model related actions (query, filter, sort) into ModelInfo file
* UI: animate FAB on model preselection states
* feature: support filtering in Model Management screen
* ui: show empty models info in Model Management screen
* ui: add filter off icon to "Clear filters" menu item
* [WIP] ui: polish Benchmark screen; implement its bottom app bar
* ui: polish Benchmark screen; implement its bottom app bar's rerun and share
* nit: disable mode selection's radio buttons when loading model
* feature: implement Conversation screen's bottom app bar
* pkg: restructure BottomAppBars into separate files in a child package
* pkg: restructure TopBarApps into separate files in a child package
* pkg: restructure system metrics into a separate file
* UI: polish Conversation screen
* data: update system prompt presets
* UI: allow hide or show model card on Conversation & Benchmark screens; fix message arrangement
* data: update & enhance system prompt presets
* deps: introduce Retrofit2
* data: implement HuggingFace data model, data source with Retrofit API
* data: update Model data repository to support fetching HuggingFace models
* [WIP] UI: replace the HuggingFace stub in Model Management screen with actual API call
* UI: map language codes into country Emojis
* ui: add "clear results" action to Benchmark screen
* nit: print current pp & tg in llama-bench
* UI: disable landscape mode; prevent duplicated benchmark running
* llama: migrate C/CXX flags into CMakeList
* [WIP] llama: ABI split builds five .so artifacts.
However, all .so are performing on SVE level
* [WIP] llama: ABI split where five tiers are built sequentially.
* [WIP] llama: disable OpenMP in ABI split since most SoCs are big.LITTLE
* [WIP] llama: enable KleidiAI and disable tier 4 due to `+sve+sve2` bug caused by `ggml_add_cpu_backend_variant_impl` as explained below
```CMake
if (NOT SME_ENABLED MATCHES -1)
...
set(PRIVATE_ARCH_FLAGS "-fno-tree-vectorize;${PRIVATE_ARCH_FLAGS}+sve+sve2")
...
```
* core: add Google's cpu_features as a submodule
* core: implement cpu_detector native lib
* core: swap out hardcoded LlamaAndroid library loading
* core: add back OpenMP due to huge perf loss on TG128
* misc: reorg the pkg structure
* misc: rename LlamaAndroid related class to InferenceEngine prefixes
* [WIP] lib: move GgufMetadata into the lib submodule
* lib: expose GgufMetadataReader as interface only
* lib: replace the naive & plain SharedPreferences with DataStore implementation
* lib: hide the internal implementations, only expose a facade and interfaces
* lib: expose Arm features
* di: add a stub TierDetection; provide both actual impl and stub in AppModule
* UI: add visualizer UI for Arm features
* misc: UI polish
* lib: refactored InferenceEngineLoader; added a `NONE` Llama Tier
* UI: support `NONE` Llama Tier in general settings
* lib: optimize engine loader; always perform a fresh detection when cache is null
* remote: add HuggingFaceModelDetails data class
* remote: refine HuggingFaceModel data class
* nit: remove `trendingScore` field from HuggingFace model entities, weird...
* remote: refactor HuggingFaceApiService; implement download feature in HuggingFaceRemoteDataSource
* remote: fix the incorrect parse of HuggingFace's inconsistent & weird JSON response
* UI: scaffold Models Management screen and view model
* UI: implement a dialog UI to show fetched HuggingFace models.
* UI: use a broadcast receiver to listen for download complete events and show local import dialog.
* data: handle network exceptions elegantly
* pkg: restructure `data`'s packages
* data: extract local file info, copy and cleanup logics into LocalFileDataSource
* nit: minor UI patch; add missing comments
* bugfix: tapping "Home" in navigation drawer should simply close it without any navigation action.
* UI: improve autoscroll during token generation
* lib: tested on JFrog Artifactory for Maven publishing
* UI: show RAM warning if model too large
* UI: polish model management screen's error dialog
* util: add more items into the mapping table of ISO 639-1 language code to ISO 3166-1 country code
* llm: properly propagate error to UI upon failing to load selected model
* UI: avoid duplicated calculation of token metrics
* lib: read & validate the magic number from the picked source file before executing the import
* UI: add "Learn More" hyperlinks to Error dialog upon model import failures
* lib: refactor the GgufMetadataReader to take InputStream instead of absolute path as argument
* lib: fix the `SIMD` typo in Tier description
* core: verify model file path is readable
* lib: add UnsupportedArchitectureException for triaged error message
* util: split FormatUtils into multiple utils for better readability
* UI: change benchmark screen from raw markdown to table view
* bugfix: reset preselection upon running the preselected model
* misc: linter issue
* bugfix: fix the malfunctioning monitoring switch
* UI: update Arm features indicator; fix the broken hyperlinks
* UI: add quick action buttons to benchmark screen's result card
* UI: hide share fab after clearing all benchmark results
* UI: fix the model unload dialog message; elevate the model card and hide it by default on Conversation screen;
* UI: hide the stubbing actions in Conversation screen
* UI: add show/hide stats control to conversation screen's assistant message bubble; fix placeholder
* UI: add a info button to explain token metrics
* misc: remove the redundant `Companion` added due to refactoring
* UI: show corresponding system metrics detailed info upon tapping RAM / storage / temperature indicator
* UI: add info button to System Prompt switch; expand the model card by default
* UI: disable tag & language chips; add section headers to explain what they are
* misc: replace top bar indicator's spacer with padding
* UI: merge the Model Selection and Model Management into a unified Models screen
* UI: split the ModelsManagementViewModel from a unified ModelsViewModel due to huge complexity
* UI: add model loading in progress view; polish the empty model info view
* UI: polish the bottom bars and info view when no models found; show loading in progress while fetching models
* build: [BREAKING] bump the versions of libraries and plugins
* UI: fix the breaking build
* UI: add Tooltip on Import FAB for user onboarding
* UI: adds AppPreferences to track user onboarding status
* UI: tracks user's first success on importing a model
* data: add hand crafted rules to filter the models fetched from HuggingFace API
* UI: update app name & about; polish top bars' indicators & buttons
* UI: polish Hugging Face download dialog UI
* UX: implement onboarding tooltips for model import and onboarding
* misc: use sentence case for CTA button labels
* [WIP] UI: add Arm color palette from Philip.Watson3
* UI: address Rojin's UX feedbacks
* UI: address Rojin's UX feedbacks - part 2
* UI: update Arm color palette from Philip.Watson3
* data: make sure fetch preselected models in the same order of their IDs
* UI: fix UI issues in the generic settings screen and navigation drawer
* nit: address Rojin's feedbacks on model import message again
* nit: append `®` to all `Arm` labels
* UI: extract a reusable InfoAlertDialog
* core: support GGML_CPU_ALL_VARIANTS on Android!
* core: restructure Kleidi-Llama library
* core: organizing cmake arguments
* data: sort preselected models according to device's available RAM
* app: update adaptive + themed + legacy icons and app name
* UI: fix the font size auto scaling for ArmFeaturesVisualizer
* core: further improve the performance on native methods
* UI: minor color palette changes; emphasize the bottom bar FABs; fix Settings Screen menu item label
* UI: make more room for assistant message bubble's width
* UI: better usage of tertiary colors to highlight model cards but not for warnings
* UI: fix the layout issue on large font sizes
* lib: support x86-64 by dynamically set Arm related definitions
* lib: replace the factory pattern for deprecated tiered lib loading with single instance pattern
* llama: update the library name in JNI and CMake project
* llama: update the library's package name and namespace
* llama: update the app's package name and namespace
* app: bump ksp version
* app: remove deprecated SystemUIController from accompanist by migrating to EdgeToEdge
* app: extract AppContent from MainActivity to a separate file in ui package
* lib: add File version for GGUF Magic number verification
* lib: perform engine state check inclusively instead of exclusively
* lib: change `LlamaTier` to `ArmCpuTier`
* lib: remove kleidi-llama related namings
* cleanup: remove Arm AI Chat/Playground app source code; replace with the basic sample app from https://github.com/hanyin-arm/Arm-AI-Chat-Sample
Note: the full Google Play version of AI Chat app will be open will be open sourced in another repo soon, therefore didn't go through the trouble of pruning the history using `git filter-repo` here.
* [WIP] doc: update main and Android README docs; add self to code owners
* lib: revert System.load back to System.loadLibrary
* jni: introduce a logging util to filter different logging levels on different build types
* lib: enable app optimization
* doc: replace stub Google Play app URL with the actual link add screenshots; add my GitHub ID to maintainer list
* Remove cpu_features
* Fix linters issues in editorconfig-checker job
https://github.com/ggml-org/llama.cpp/actions/runs/19548770247/job/55974800633?pr=17413
* Remove unnecessary Android CMake flag
* purge include/cpu_features directory
---------
Co-authored-by: Han Yin <han.yin@arm.com>
* ggml-cpu: add ggml_thread_cpu_relax with Zihintpause support
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* cmake: enable RISC-V zihintpause extension for Spacemit builds
* readme : add ZIHINTPAUSE support for RISC-V
---------
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
* model: add support for extra bufs for all devices
* hexagon: add experimental ggml-hexagon backend for the Hexagon NPU
This commit introduces a new experimental backend `ggml-hexagon` with support for the Hexagon NPU.
Highlights:
- Supports Hexagon versions: v73, v75, v79, and v81
- Targets Android devices based on Snapdragon SoCs: Gen3, 8-Elite, and 8-Elite Gen5
- Supports Q4_0, Q8_0, MXFP4, and FP32 data types
- Implements core LLM ops: MUL_MAT/MUL_MAT_ID, ADD/SUB/MUL/ADD_ID, RMS_NORM, ROPE, GLU/SWIGLU, SOFTMAX
**Note:** This backend is experimental and may exhibit instability or limited performance across supported devices.
It is intended for early testing and feedback from llama.cpp/ggml developer and user community.
Co-Authored-By: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-Authored-By: Todor Boinovski <todorb@qti.qualcomm.com>
* hexagon: fix format checker errors
* hexagon: update readme and cmake presets
* ci: add android-ndk-build jobs that build plain ARM64 and Snapdragon versions
* hexagon: add simple graph optimizer for stacking MUL_MAT ops with the same input
* hexagon: move ADB helper scripts into scripts/snapdragon/adb
* hexagon: replace all f/printfs with GGML_LOG_...
* readme: add hexagon to the list supported backends
* hexagon: stack malmuts with quantized inputs only
* hexagon: add TODO for fixing issues in hexagon_graph_optimize
* hexagon: update to hex-sdk 6.4.0 and add scripts for running on QDC
* scripts: fix lint errors
* scripts: update qdc pytest script to make linter happy
* hexagon: add reduce sum in fp32
* hexagon: reduce number of vector stores in matmul output
* hexagon: remove the need for vdelta in reduce-multiply-x8
* hexagon: consistent use of reduce_sum_fp32 for row_sums
* hexagon: some more matmul optimizations and comments
Optimize cases where tensor dims are not multiple of 1024 (e.g in Qwen models).
We've handled those cases already but at a higher overhead.
* hexagon: update cmake presets
* hexagon: add OPMASK support for run-bench.sh wrapper
* hexagon: update to use GGML_BACKEND_API
* hexagon: remove unused logic for setting tensor flags for the views
* hexagon: add asserts to set/get_tensor to make sure we handle complete tensors
Same asserts as the CPU backend.
* hexagon: use cpy_tensor slow path for non-host buffers
* hexagon: error checks in the buffer allocator
* cmake: move include(extProj) under ggml-hexagon
* hexagon: don't forget to delete the backend on free
* hexagon: set/get_tensor size assert apply only to quantized tensors
* hexagon: reintroduce HEX_VERBOSE wrapper for GGML_LOG_DEBUG for now
GGML_LOG_DEBUG is always enabled for test-backend-ops and the output gets in the way.
Ideally we need a bit more finer log levels.
* docs: typos in hexagon developer docs (libggm-...)
* hexagon: overhaul error handling in the session/device allocation
this should handle all failure paths in the session allocation.
* hexagon: update cmake presets to enable fp16 vectors
* hexagon: remove unused time_usec function
* hexagon: don't forget to release buffer contexts
* hexagon: fixed indents in hvx-utils (missed clang-format auto-format failure)
* hexagon: remove custom can_repeat function and use ggml_can_repeat
---------
Co-authored-by: Rajdeep Ganguly <rganguly@qti.qualcomm.com>
Co-authored-by: Todor Boinovski <todorb@qti.qualcomm.com>
* add BailingMoeV2 support
* update llm types
* undo
* undo
* update llm types
* add model collection link
* update
* almost working
* correct group selection and rename n_group_exp
* avoid large top_k and use argmax instead for now
if we had something like argmax2 that would be equivalent, but this works fine until then
* poke
* skip group selection when there are no tokens
* fix 1T conversion
* hopefully fixed expert group selection
third time's the charm?
* make expert group selection generally available
The new LLaDA2Moe model uses this method too, make it generally available regardless of architecture.
* allow n_expert_groups to be 1 (Kimi K2)
* address review suggestions
* contrib : update roles
* contrib : merge PR sections + add link to CI instructions
Updated pull request guidelines for contributors and collaborators, and clarified merging practices for maintainers.
* Minimal setup of webgpu backend with dawn. Just prints out the adapter and segfaults
* Initialize webgpu device
* Making progress on setting up the backend
* Finish more boilerplate/utility functions
* Organize file and work on alloc buffer
* Add webgpu_context to prepare for actually running some shaders
* Work on memset and add shader loading
* Work on memset polyfill
* Implement set_tensor as webgpu WriteBuffer, remove host_buffer stubs since webgpu doesn't support it
* Implement get_tensor and buffer_clear
* Finish rest of setup
* Start work on compute graph
* Basic mat mul working
* Work on emscripten build
* Basic WebGPU backend instructions
* Use EMSCRIPTEN flag
* Work on passing ci, implement 4d tensor multiplication
* Pass thread safety test
* Implement permuting for mul_mat and cpy
* minor cleanups
* Address feedback
* Remove division by type size in cpy op
* Fix formatting and add github action workflows for vulkan and metal (m-series) webgpu backends
* Fix name
* Fix macos dawn prefix path
* convert ok, load ok
* warmup ok
* test
* still does not work?
* fix padding
* temporary give up
* fix merge conflict
* build_ultravox()
* rm test
* fix merge conflict
* add necessary mtmd APIs
* first working version (only 4s of audio)
* will this monster compile?
* fix compile
* please compile
* fPIC
* fix windows
* various fixes
* clean up audio_helpers
* fix conversion
* add some debug stuff
* long audio input ok
* adapt the api
* add --audio arg
* final touch UX
* add miniaudio to readme
* fix typo
* refactor kv metadata
* mtmd_default_marker()
* musa: fix build warning (unused parameter)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: upgrade MUSA SDK version to rc4.0.1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* musa: use mudnn::Unary::IDENTITY op to accelerate D2D memory copy
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Update ggml/src/ggml-cuda/cpy.cu
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* musa: remove MUDNN_CHECK_GEN and use CUDA_CHECK_GEN instead in MUDNN_CHECK
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* server : (experimental) vision support via libmtmd
* mtmd : add more api around mtmd_image_tokens
* mtmd : add more api around mtmd_image_tokens
* mtmd : ability to calc image hash
* shared_ptr for mtmd_image_tokens
* move hash to user-define ID (fixed)
* abstract out the batch management
* small fix
* refactor logic adding tokens to batch
* implement hashing image
* use FNV hash, now hash bitmap instead of file data
* allow decoding image embedding to be split into batches
* rm whitespace
* disable some features when mtmd is on
* fix --no-mmproj-offload
* mtmd_context_params no timings
* refactor server_inp to server_tokens
* fix the failing test case
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* improve server_input struct
* clip : fix confused naming ffn_up and ffn_down
* rm ffn_i/o/g naming
* rename n_embd, n_ff
* small fix
* no check n_ff
* fix detokenize
* add const to various places
* add warning about breaking changes
* add c api
* helper: use mtmd_image_tokens_get_n_pos
* fix ctx_shift
* fix name shadowing
* more strict condition
* support remote image_url
* remote image_url log
* add CI test
* do not log base64
* add "has_multimodal" to /props
* remove dangling image
* speculative: use slot.cache_tokens.insert
* Apply suggestions from code review
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* rm can_be_detokenized
* on prmpt processing done, assert cache_tokens.size
* handle_completions_impl returns void
* adapt the new web ui
* update docs and hot topics
* rm assert
* small fix (2)
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* llava : add clip_n_output_tokens, deprecate clip_n_patches
* mtmd : add qwen2vl and qwen2.5vl
* decode_embd_batch::set_position_...
* working version
* deprecate llama-qwen2vl-cli
* correct order W, H of clip_embd_nbytes_by_img
* edit existing line in hot topics
* mtmd : merge `llava-cli` and `gemma3-cli` into single `mtmd-cli`
* support for minicpmv
* remove cpp files of llava and minicpmv
* update hot topics
* mtmd : add not supported msg for qwen2vl
* Update examples/llava/mtmd.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>