Commit Graph

1765 Commits

Author SHA1 Message Date
Han Yin a8dc825aef UI: handle back press on Model Selection screen 2025-10-28 11:39:17 -07:00
Han Yin d1b018e375 UI: show a Snack bar to warn user that system prompt is not always supported 2025-10-28 11:39:17 -07:00
Han Yin 56a7272858 UI: polish model cards on Benchmark and Conversation screens to show model loading metrics 2025-10-28 11:39:17 -07:00
Han Yin 10ca2fa834 util: extract formatting helper functions from FileUtils into a new FormatUtils 2025-10-28 11:39:17 -07:00
Han Yin d7afcc41d5 UI: polish ModelLoading screen 2025-10-28 11:39:17 -07:00
Han Yin 57b5001f5c nit: revert accidental committing of debug code 2025-10-28 11:39:17 -07:00
Han Yin ec47fa3d14 nit: allow deselect model on Model Selection screen 2025-10-28 11:39:17 -07:00
Han Yin 6b74c49e6b UI: polish model card 2025-10-28 11:39:17 -07:00
Han Yin c12ef7a779 UI: update ModelSelectionScreen with a preselect mechanism 2025-10-28 11:39:17 -07:00
Han Yin b81a0c6e6d UI: refactor ModelCard UI to show GGUF metadata 2025-10-28 11:39:17 -07:00
Han Yin 9056f27a91 nit: rename lastUsed field to dateLastUsed; add dateAdded field 2025-10-28 11:39:17 -07:00
Han Yin 7540c2a8b9 nit: refactor data.local package structure 2025-10-28 11:39:17 -07:00
Han Yin 7ed79319e5 GGUF: make GgufMetadata serializable in order to be compatible with Room 2025-10-28 11:39:17 -07:00
Han Yin 8ae0c3d2fa DB: introduce Kotlin serialization extension's library and plugin; add Room runtime library 2025-10-28 11:39:17 -07:00
Han Yin 67499727ef gguf: add GGUF metadata data holder and its corresponding extractor implementation 2025-10-28 11:39:17 -07:00
Han Yin a9466c0370 navigation: sink model loading state management from AppContent down into ModelLoadingScreen; pass ModelLoadingMetrics to Benchmark and Conversation screens 2025-10-28 11:39:17 -07:00
Han Yin 8a682ff85d core: throw Exception instead of returning null if model fails to load 2025-10-28 11:39:17 -07:00
Han Yin f313362ced nit: polish ModelLoadingScreen UI 2025-10-28 11:39:17 -07:00
Han Yin 1d508f367e UI: update AppContent to pass in correct navigation callbacks 2025-10-28 11:39:17 -07:00
Han Yin 0d65c4b06b nit: extract app name into a constant value; remove unused onBackPressed callbacks 2025-10-28 11:39:17 -07:00
Han Yin 9f1d26ac95 UI: migrate ConversationViewModel onto ModelLoadingViewModel; update & refine ConversationScreen 2025-10-28 11:39:17 -07:00
Han Yin cb508be782 UI: migrate ModelLoadingScreen onto ModelLoadingViewModel; update & refine ModelLoadingScreen 2025-10-28 11:39:17 -07:00
Han Yin f61c512223 UI: expose a single facade ModelUnloadDialogHandler; move UnloadModelState into ModelUnloadingViewModel.kt 2025-10-28 11:39:17 -07:00
Han Yin c5a3ac7eb1 UI: Introduce an abstract ViewModel to handle additional model unloading logics 2025-10-28 11:39:17 -07:00
Han Yin e1c77c6bbd LLama: add a new Initializing state; ; add two extension properties; rename LibraryLoaded state to Initialized 2025-10-28 11:39:17 -07:00
Han Yin ba40d689a1 UI: implement BenchmarkScreen's individual back handling 2025-10-28 11:39:17 -07:00
Han Yin 8203ddb97a UI: refactor back handling by removing centralized BackHandlerSetup and UnloadModelConfirmationDialog from AppContent 2025-10-28 11:39:17 -07:00
Han Yin c08d02d233 LLama: add ModelUnloadingState to engine State; add missing state checks in stub engine; fix instrumentation engine's error messages 2025-10-28 11:39:17 -07:00
Han Yin 481ba6e9d3 UI: remove code duplication in sort menu 2025-10-28 11:39:17 -07:00
Han Yin 41615be5ae UI: fix the typo `totalGb` in `StorageMetrics` 2025-10-28 11:39:17 -07:00
Han Yin 69f2bd62f9 UI: replace ugly optional as casts in AppScaffold with extension functions 2025-10-28 11:39:17 -07:00
Han Yin e269da655f UI: combine TopBarConfig and BottomBarConfig into each route's ScaffoldConfig 2025-10-28 11:39:17 -07:00
Han Yin 225c5435c5 UI: refactor BottomBarConfig.ModelsManagement APIs 2025-10-28 11:39:17 -07:00
Han Yin 63fc56d603 UI: centralize the AppScaffold and modularize its configs 2025-10-28 11:39:17 -07:00
Han Yin 72e97b93c5 feature: check for available space before copying local model 2025-10-28 11:39:16 -07:00
Han Yin 65d4a57a8b LLama: refactor loadModel by splitting the system prompt setting into a separate method 2025-10-28 11:39:16 -07:00
Han Yin 9f77155535 VM: handle the cancellation of ongoing token generation 2025-10-28 11:39:16 -07:00
Han Yin 46859c10f0 LLama: update engine state after handling the cancellation of sendUserPrompt 2025-10-28 11:39:16 -07:00
Han Yin 06448a60a5 UI: update UI ongoing model import's cancellation 2025-10-28 11:39:16 -07:00
Han Yin 9ba74a9d3d data: allow canceling the ongoing model import 2025-10-28 11:39:16 -07:00
Han Yin d70b8fe323 core: swap in LLamaAndroid and mark stub engine for testing only 2025-10-28 11:39:16 -07:00
Han Yin c2426a42e5 UI: unify Model Card components 2025-10-28 11:39:16 -07:00
Han Yin 434933f5b3 UI: show model card in Conversation screen 2025-10-28 11:39:16 -07:00
Han Yin 9769467723 UI: show model card in Model Loading screen 2025-10-28 11:39:16 -07:00
Han Yin 9cfa74f754 core: break down InferenceManager due to Interface Segregation Principle 2025-10-28 11:39:16 -07:00
Han Yin 286ed05f13 vm: merge SystemPromptViewModel into ModelLoadingViewModel 2025-10-28 11:39:16 -07:00
Han Yin 23d411d86e vm: split mono MainViewModel into separate individual ViewModels 2025-10-28 11:39:16 -07:00
Han Yin 32d778bb8e core: extract conversation and benchmark logics into InferenceManager; add logs and missing state updates in stub InferenceEngine 2025-10-28 11:39:16 -07:00
Han Yin 51b120f464 data: pass through getModelById from ModelDao into ModelRepository 2025-10-28 11:39:16 -07:00
Han Yin 59f5caa699 Util: split FileUtils from ModelUtils; extract copy methods into FileUtils 2025-10-28 11:39:16 -07:00
Han Yin 4913ad0dae nit: tidy SystemPromptViewModel 2025-10-28 11:39:16 -07:00
Han Yin 2614f91226 UI: replace model selection screen's data stubbing; add empty view 2025-10-28 11:39:16 -07:00
Han Yin 6b48f7473f UI: extract a shared ModelCard component 2025-10-28 11:39:16 -07:00
Han Yin 0d41e75ca5 UI: add a confirmation step when user picks a file; refactor model import overlay into AlertDialog 2025-10-28 11:39:16 -07:00
Han Yin 1bebd1bb07 util: extract file size formatting into ModelUtils 2025-10-28 11:39:16 -07:00
Han Yin 561fe0222f UI: handle back navigation when user is in multi-selection mode 2025-10-28 11:39:16 -07:00
Han Yin 2d6b8856f6 UI: implement multiple models deletion; update Models Management screen 2025-10-28 11:39:16 -07:00
Han Yin 025e3d2417 UI: enrich ModelManagementState; extract filename to show correct importing UI 2025-10-28 11:39:16 -07:00
Han Yin adfbfe3ffb data: add a util file for extracting file name & size and model metadata 2025-10-28 11:39:16 -07:00
Han Yin 290a6bfebe bugfix: use List instead of Collection for ModelDao's deletion 2025-10-28 11:39:16 -07:00
Han Yin 5de0b5d6d0 data: import local model with file picker 2025-10-28 11:39:16 -07:00
Han Yin a3ebdac58f UI: polish sort order menu 2025-10-28 11:39:16 -07:00
Han Yin 760d66c97d UI: replace Models Management screen's stubbing with instrumentation 2025-10-28 11:39:16 -07:00
Han Yin bc93c384a7 data: introduce Model entity and DAO; update DI module 2025-10-28 11:39:16 -07:00
Han Yin f5e2edda87 data: [WIP] prepare for ModelRepository refactor & impl 2025-10-28 11:39:16 -07:00
Han Yin b6cc8f0c01 DI: abstract the protocol of SystemPromptRepository; update AppModule 2025-10-28 11:39:16 -07:00
Han Yin eebc05b559 UI: polish UI for ModelsManagementScreen; inject ModelsManagementVieModel 2025-10-28 11:39:16 -07:00
Han Yin 6e82bb37d3 Feature: Introduce ModelRepository and ModelsManagementViewModel; update AppModule 2025-10-28 11:39:16 -07:00
Han Yin aedf442632 DI: Optimize AppModule 2025-10-28 11:39:16 -07:00
Han Yin d60bba9b8f UI: navigation with more natural animated transitions 2025-10-28 11:39:16 -07:00
Han Yin 511df35704 bugfix: wait for model to load before navigating to benchmark screen; use NavigationActions instead of raw navController 2025-10-28 11:39:16 -07:00
Han Yin ea11ee3c94 UI: optimize AppContent's composing 2025-10-28 11:39:16 -07:00
Han Yin 0afd087f35 DI: replace manual DI with Hilt DI 2025-10-28 11:39:16 -07:00
Han Yin a1f6e7e476 DI: make viewmodels Hilt injectable 2025-10-28 11:39:16 -07:00
Han Yin 564b095427 DI: make app Hilt injectable 2025-10-28 11:39:16 -07:00
Han Yin 65741a7e64 DI: introduce Hilt plugin + processor + lib dependencies 2025-10-28 11:39:16 -07:00
Han Yin af0d68d611 nit: combine temperatureMetrics and useFahrenheit 2025-10-28 11:39:16 -07:00
Han Yin 5e4972e93e UI: refactor top app bars 2025-10-28 11:39:16 -07:00
Han Yin 2a41c0e354 vm: replace token metrics stubs with actual implementation 2025-10-28 11:39:16 -07:00
Han Yin e47e3b77ee UI: locks user in alert dialog when model is unloading 2025-10-28 11:39:16 -07:00
Han Yin 6b341b0fbe bugfix: handle user quitting on model loading 2025-10-28 11:39:16 -07:00
Han Yin e8b84c6ebf UI: code polish 2025-10-28 11:39:16 -07:00
Han Yin fddf060d92 data: code polish 2025-10-28 11:39:16 -07:00
Han Yin 3b499ac7e4 UI: polish conversation screen 2025-10-28 11:39:16 -07:00
Han Yin 64ebdc67a6 UI: update app name to be more Arm 2025-10-28 11:39:16 -07:00
Han Yin 55681847e9 UI: rename `ModeSelection` to `ModelLoading` for better clarity 2025-10-28 11:39:16 -07:00
Han Yin 75c986afc5 bugfix: properly handle user's quitting conversation screen while tokens in generation 2025-10-28 11:39:16 -07:00
Han Yin 4848bf93d0 data: introduce repo for System Prompt; flow data from Room to VM 2025-10-28 11:39:16 -07:00
Han Yin 5596d5203b DB: setup Room database 2025-10-28 11:39:16 -07:00
Han Yin 4046cd16fd Deps: bump Kotlin plugin; introduce KSP; apply in :app subproject 2025-10-28 11:39:16 -07:00
Han Yin 5868eaa66b UI: polish system prompt setup UI 2025-10-28 11:39:16 -07:00
Han Yin a7ee3d305f UI: split a nested parent settings screen into separate child settings screens 2025-10-28 11:39:16 -07:00
Han Yin 65c09b2b32 UI: allow drawer's gesture control only on Home and Settings screens; enable alert dialog on back navigation inside conversation and benchmark 2025-10-28 11:39:16 -07:00
Han Yin 648b97818e UI: disable triggering drawer via gesture; enable alert dialog on back navigation inside conversation and benchmark 2025-10-28 11:39:16 -07:00
Han Yin a7ae8b7ce0 [WIP] DI: implement simple local vm factory provider 2025-10-28 11:39:16 -07:00
Han Yin ca2b7772ce UI: add a new MainActivity; update manifest 2025-10-28 11:39:16 -07:00
Han Yin 7e5c80cee9 UI: implement core flow's screens 2025-10-28 11:39:16 -07:00
Han Yin 5ad65919e9 util: implement user preferences utility 2025-10-28 11:39:16 -07:00
Han Yin 46bd638c5f util: implement performance monitor; wrap it with a viewmodel 2025-10-28 11:39:16 -07:00
Han Yin 4dd755e25b UI: implement basic UI components 2025-10-28 11:39:16 -07:00
Han Yin 32608fb225 UI: app navigation 2025-10-28 11:39:16 -07:00
Han Yin 3f913ce440 LLM: stub a local inference engine for faster iteration 2025-10-28 11:39:16 -07:00
Han Yin 3787fbddb0 data: define data models for LLM and system prompts 2025-10-28 11:39:16 -07:00
Han Yin 697d778db7 UI: define theme, color palette, typography and shape 2025-10-28 11:39:16 -07:00
Han Yin cbe7133742 UI: introduce new dependencies, update versions & references 2025-10-28 11:39:16 -07:00
Han Yin 44a522dbc8 UI: move existing UI src files into `legacy` package 2025-10-28 11:39:16 -07:00
Han Yin 37f3e1c415 Feature: use local llama_context for benchmarking; support context init with custom context size 2025-10-28 11:39:16 -07:00
Han Yin 6d2279e9cd REWRITE JNI bridge; Update viewmodel 2025-10-28 11:39:16 -07:00
Han Yin e1bc87610e Perf: allocate `llama_batch` on stack with `llama_batch_init` 2025-10-28 11:39:16 -07:00
Han Yin 2b52563737 Polish: better logging & documentation 2025-10-28 11:39:16 -07:00
Han Yin ec502cfde9 Feature: implement infinite conversation via context shifting 2025-10-28 11:39:16 -07:00
Han Yin 4e515727b4 Abort on system prompt too long; Truncate user prompt if too long. 2025-10-28 11:39:16 -07:00
Han Yin 4809112ec5 Polish: adopt common naming; init modularization; 2025-10-28 11:39:16 -07:00
Han Yin 8bf2f4d412 Feature: chat template auto formatting 2025-10-28 11:39:16 -07:00
Han Yin 1b0754c0f5 Perf: optimize performance with ARM features 2025-10-28 11:39:16 -07:00
Han Yin bb5b824208 Polish: populate backend names in `benchModel` 2025-10-28 11:39:16 -07:00
Han Yin c14c11dcbd Feature: decode system and user prompt in batches 2025-10-28 11:39:16 -07:00
Han Yin 02465137ca Bug fix: null system prompt state update; Safeguard empty user prompt 2025-10-28 11:39:16 -07:00
Han Yin 7bbb53aaf8 Clang-tidy linting: make functions & global variables static 2025-10-28 11:39:16 -07:00
Han Yin f44882aeeb Enforce centralized dependency management; bump Gradle & deps versions 2025-10-28 11:39:16 -07:00
Han Yin 0ade7fb4d7 Polish binding: Remove verbose setup JNI APIs; Update state machine states. 2025-10-28 11:39:16 -07:00
Han Yin 7dc9968f82 Restructure `LLamaAndroid.kt` 2025-10-28 11:39:16 -07:00
Han Yin 44720859d6 Rewrite llama-android JNI implementation 2025-10-28 11:39:15 -07:00
Han Yin d4ab3832cf Use common sampler 2025-10-28 11:39:15 -07:00
Han Yin 1f255d4bca Tidy & clean LLamaAndroid binding 2025-10-28 11:39:15 -07:00
Daniel Bevenius 56b4795842
model-conversion : add support for SentenceTransformers (#16387)
* model-conversion : add support for SentenceTransformers

This commit adds support for models that use SentenceTransformer layers.

The motivation for this is that if converted model includes any of the
numbered layers specified in the original models repository then these
changes enable these models to be used and verified. Currently the
model-conversion only support the base model output without any of
the additional transformation layers.

Usage:
Convert the model that also includes the SentenceTransformer layers:
```console
(venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M"
(venv) make embedding-convert-model
```

Verify the produced embeddings from the converted model against the
original model embeddings:
```console
(venv) make embedding-verify-logits-st
```

The original model can be run using SentenceTransformer:
```console
(venv) make embedding-run-original-model-st
```

Run the converted model using "SentenceTransformer" layers whic
enables pooling and normalization:
```console
(venv) make embedding-run-converted-model-st
```

* add model-conversion example requirements

* add support for -st flag in embedding model conversion

This commit add support for the -st flag in the embedding model
conversion script. This will enable models to be converted using
sentence transformers dense layers.
2025-10-09 14:35:22 +02:00
Aaron Teo 624207e676
devops: add s390x & ppc64le CI (#15925)
* devops: move s390x and ppc64le ci build

we have access to ubuntu-24.04-s390x and ppc64le images now

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le for now since they have compiler errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: stop warnings as errors

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: switch to non-macro flag

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: going the llama macro route

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add big-endian gguf test models

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le to test s390x, check test build

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dup .gguf.inp files for big-endian tests

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dup .gguf.out files for big-endian too

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add python setup and endian byteswap

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: pooring thing does not have s390x python3

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add missing rust compiler for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: try rust actions runner

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Revert "devops: try rust actions runner"

This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: try a different path for rust

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: dump home directory and user info

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: install gguf-py only

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: missed relative path

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: remove big-endian files since local swapping is working

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: revert test-tokenizer-0 cmakelists

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix unicode flags conversion from and to uint16_t

Bitfields are allocated in different order on s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Simplify byteswap command

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix endianness detection in vocab loader

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Disable test-thread-safety on s390x

In this test a model is downloaded,
then immediately loaded to check if more downloads are needed,
and then used for test.

There is no clean way to separate all those steps
 to add byteswapping between them, so just skip this test.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix q8_0 test in test-quantize-fns

vec_signed uses unexpected rounding mode.
Explicitly use different rounding function.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add big-endian stories260K

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: add s390x test-eval-callback

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix test does not exist

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: fix model not found llama-eval-callback

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix q3_K dot product error in test-quantize-fns on s390x

Array q8bytes had only 4 elements allocated, but 8 elements accessed.
This lead to write out of bounds and later read of overwritten values out of bounds
and incorrect result.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: re-enable ppc64le for testing

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: activate test-thread-safety for s390x

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: disable ppc64le tests

for some reason it keeps failing test-thread-safety tests and I do not
    have a machine that is able to replicate the tests.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* devops: LLAMA_FATAL_WARNINGS=ON

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Correct repository URL for s390x for test-thread-safety model

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fix fs_get_cache_directory

Ensure it works even if both XDG_CACHE_HOME and HOME are unset.
This might happen in containers.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Re-enable CI for ppc64le

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Fortify ggml_rope_impl

Only memcpy data from sections argument if it's non-NULL.

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>

* Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way

* Update URL for big-endian model

* Update .github/workflows/build.yml

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update remaining mentions of BE models to ggml-org/models repo

---------

Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com>
Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-09-27 02:03:33 +08:00
Daniel Bevenius aa3ee0eb0b
model-conversion : add embedding prompt file support (#15871)
This commit adds support for passing a prompt file to the model
conversion targets/scripts. It also updates the logits.cpp to print out
embedding information in the same format as when running the original
embedding model.

The motivation for this is that it allows us to pass files of different
sizes when running the converted models and validating the logits.

This can be particularly important when testing the sliding window
functionality of models where the sequence length needs to exceed a
certain number of tokens to trigger the sliding window logic.
2025-09-25 12:02:36 +02:00
Douglas Hanley b5bd037832
llama : add support for qwen3 reranker (#15824) 2025-09-25 11:53:09 +03:00
Jie Fu (傅杰) 63b54c81a6
model-conversion : make causal-verify-logits fails with model names containing "." (#16215)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 10:25:26 +02:00
Jie Fu (傅杰) 7735706b93
model-conversion : run-org-model.py fails to run on mac m1 (#16213)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 08:46:52 +02:00
Jie Fu (傅杰) 8ba548dae2
model-conversion : fix the make targets in the README.md (#16209)
Fix two incorrect make targets in the readme.

Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-24 06:19:23 +02:00
Georgi Gerganov 432cf4304c
codeowners : update + cleanup (#16174)
---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-09-22 18:20:21 +03:00
GideonSerf c6db9a1027
embedding : fix typos in README (#16171) 2025-09-22 11:49:58 +03:00
Jie Fu (傅杰) 1cbd80f8cf
examples : support encoder-decoder models in the simple example (#16002)
Signed-off-by: Jie Fu <jiefu@tencent.com>
2025-09-17 10:29:00 +03:00
Aman Gupta 6d758839ff
Add LLaDA-7b-MoE diffusion model (#16003) 2025-09-16 10:38:28 +08:00
Piotr Wilkin (ilintar) acc1b008cf
model-conversion : add extra debugging support for model conversion (#15877)
* feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py
2025-09-09 06:05:55 +02:00
Aldehir Rojas 7057faf64b
json : support `enum` values within `allOf` (#15830) 2025-09-08 16:14:32 -05:00
Erik Scholz a81283820a
gguf: gguf_writer refactor (#15691)
* gguf: split gguf writer into base and buf impl
* gguf: templated gguf write out
* gguf: file based writer (avoid writing everything to memory first!)
* examples(llama2c): fix log not being the same level and compiler nits
2025-09-05 11:34:28 +02:00
Daniel Bevenius 5d6688de08
model-conversion : add --embeddings flag to modelcard.template [no ci] (#15801)
This commit updates the modelcard.template file used in the model
conversion scripts for embedding models to include the llama-server
--embeddings flag in the recommended command to run the model.

The motivation for this change was that when using the model-conversion
"tool" to upload the EmbeddingGemma models to Hugging Face this flag was
missing and the embedding endpoint was there for not available when
copy&pasting the command.
2025-09-05 04:36:23 +02:00
Daniel Bevenius 407c23786d
model-conversion : fix pyright errors (#15770)
This commit addresses type errors reported by pyright in the model
conversion scripts.
2025-09-03 18:28:36 +02:00
Daniel Bevenius 40a751ea9a
model-conversion : remove hardcoded /bin/bash shebangs [no ci] (#15765)
* model-conversion : remove hardcoded /bin/bash shebangs [no ci]

This commit updates the bash scripts to use env instead of using
hardcoded /bin/bash in the shebang line.

The motivation for this is that some systems may have bash installed
in a different location, and using /usr/bin/env bash ensures that
the script will use the first bash interpreter found in the user's
PATH, making the scripts more portable across different environments.

* model-conversion : rename script to .py [no ci]

This commit renames run-casual-gen-embeddings-org.sh to
run-casual-gen-embeddings-org.py to reflect its Python nature.
2025-09-03 12:50:47 +02:00
Daniel Bevenius 8c3fdf44ec
model-conversion : add missing curl script [no ci] (#15761)
This commit adds a curl script to the model-conversion examples
which is currently missing. This script is required for the running the
embedding server targets to test llama-server embeddings functionality.
2025-09-03 09:48:35 +02:00
Georgi Gerganov e92d53b29e
sampling : optimize samplers by reusing bucket sort (#15665)
* sampling : optimize sorting using bucket sort in more places

ggml-ci

* sampling : do not sort in dist sampler

ggml-ci

* sampling : avoid heap allocations for sort buffers

ggml-ci

* common : add option to sort sampling candidates by probability

ggml-ci

* sampling : revert the change for preserving sort buffers

* sampling : use std::copy instead of memcpy

* sampling : clarify purpose of partial sort helpers

ggml-ci

* cont : remove wrong comment [no ci]

* common : update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-08-31 20:41:02 +03:00
Johannes Gäßler e81b8e4b7f
llama: use FA + max. GPU layers by default (#15434)
* llama: use max. GPU layers by default, auto -fa

* ggml-backend: abort instead of segfault
2025-08-30 16:32:10 +02:00
Gabe Goodhart a8bca68f72
fix: Compute the full sum in llama-eval-callback, not just the sum of printed values (#15637)
This makes it much easier to compare between llama.cpp and transformers!

https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409
Branch: gabe-l-hart/nvidia-nemotron-nano-15409

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
2025-08-28 15:27:36 -05:00
Daniel Bevenius 46d9caa27a
model-conversion : add mmproj conversion target (#15628)
This commit adds a new target to the Makefile for converting models that
are multimodal. This target will convert the original model and in
addition also create the mmproj GGUF model.

The motivation for this change is that for models that are multimodal,
for example those that contain a vision encoders, we will often want to
upload both the quantized model and the vision encoder model to
HuggingFace.

Example usage:
```console
$ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/
...
The environment variable CONVERTED_MODEL can be set to this path using:
export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf
The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf
```
The converted original model can then be quantized, and after that both
the quantized model and the mmproj file can then be uploaded to
HuggingFace.

Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main
2025-08-28 09:26:48 +02:00
Daniel Bevenius 62cef26ac5
model-conversion : add qat-q4 quantization targets (#15588)
This commit adds two targets to the Makefile for quantizing of
Quantization Aware Trained (QAT) models to Q4_0 format.

The motivation for this is that this sets the token embedding and the
output tensors data types to Q8_0 instead of the default Q6_K. This is
someting that we wish to enforce for QAT Q4_0 models that are to be
uploaded to ggml-org on Huggingface to guarantee the best quality.
2025-08-26 16:12:29 +02:00
Daniel Bevenius dfd9b5f6c7
model-conversion : set pooling type to none in logits.cpp (#15564)
This commit explicitly sets the pooling type to 'none' in the logits.cpp
to support models that have a pooling type specified.

The motivation for this is that some models may have a pooling type set
in the model file (.gguf file) and for this specific case where we only
want to extract logits, we need to ensure that no pooling is used to
so that we are comparing raw logits and not pooled embeddings.
2025-08-25 15:00:43 +02:00
Daniel Bevenius 5a6bc6b1a6
model-conversion : add model card template for embeddings [no ci] (#15557)
* model-conversion: add model card template for embeddings [no ci]

This commit adds a separate model card template (model repository
README.md template) for embedding models.

The motivation for this is that there server command for the embedding
model is a little different and some addition information can be useful
in the model card for embedding models which might not be directly
relevant for causal models.

* squash! model-conversion: add model card template for embeddings [no ci]

Fix pyright lint error.

* remove --pooling override and clarify embd_normalize usage
2025-08-25 14:25:25 +02:00