llama.cpp

Commit Graph

Author	SHA1	Message	Date
Han Yin	4dd755e25b	UI: implement basic UI components	2025-10-28 11:39:16 -07:00
Han Yin	32608fb225	UI: app navigation	2025-10-28 11:39:16 -07:00
Han Yin	3f913ce440	LLM: stub a local inference engine for faster iteration	2025-10-28 11:39:16 -07:00
Han Yin	3787fbddb0	data: define data models for LLM and system prompts	2025-10-28 11:39:16 -07:00
Han Yin	697d778db7	UI: define theme, color palette, typography and shape	2025-10-28 11:39:16 -07:00
Han Yin	cbe7133742	UI: introduce new dependencies, update versions & references	2025-10-28 11:39:16 -07:00
Han Yin	44a522dbc8	UI: move existing UI src files into `legacy` package	2025-10-28 11:39:16 -07:00
Han Yin	37f3e1c415	Feature: use local llama_context for benchmarking; support context init with custom context size	2025-10-28 11:39:16 -07:00
Han Yin	6d2279e9cd	REWRITE JNI bridge; Update viewmodel	2025-10-28 11:39:16 -07:00
Han Yin	e1bc87610e	Perf: allocate `llama_batch` on stack with `llama_batch_init`	2025-10-28 11:39:16 -07:00
Han Yin	2b52563737	Polish: better logging & documentation	2025-10-28 11:39:16 -07:00
Han Yin	ec502cfde9	Feature: implement infinite conversation via context shifting	2025-10-28 11:39:16 -07:00
Han Yin	4e515727b4	Abort on system prompt too long; Truncate user prompt if too long.	2025-10-28 11:39:16 -07:00
Han Yin	4809112ec5	Polish: adopt common naming; init modularization;	2025-10-28 11:39:16 -07:00
Han Yin	8bf2f4d412	Feature: chat template auto formatting	2025-10-28 11:39:16 -07:00
Han Yin	1b0754c0f5	Perf: optimize performance with ARM features	2025-10-28 11:39:16 -07:00
Han Yin	bb5b824208	Polish: populate backend names in `benchModel`	2025-10-28 11:39:16 -07:00
Han Yin	c14c11dcbd	Feature: decode system and user prompt in batches	2025-10-28 11:39:16 -07:00
Han Yin	02465137ca	Bug fix: null system prompt state update; Safeguard empty user prompt	2025-10-28 11:39:16 -07:00
Han Yin	7bbb53aaf8	Clang-tidy linting: make functions & global variables static	2025-10-28 11:39:16 -07:00
Han Yin	f44882aeeb	Enforce centralized dependency management; bump Gradle & deps versions	2025-10-28 11:39:16 -07:00
Han Yin	0ade7fb4d7	Polish binding: Remove verbose setup JNI APIs; Update state machine states.	2025-10-28 11:39:16 -07:00
Han Yin	7dc9968f82	Restructure `LLamaAndroid.kt`	2025-10-28 11:39:16 -07:00
Han Yin	44720859d6	Rewrite llama-android JNI implementation	2025-10-28 11:39:15 -07:00
Han Yin	d4ab3832cf	Use common sampler	2025-10-28 11:39:15 -07:00
Han Yin	1f255d4bca	Tidy & clean LLamaAndroid binding	2025-10-28 11:39:15 -07:00
Daniel Bevenius	56b4795842	model-conversion : add support for SentenceTransformers (#16387 ) * model-conversion : add support for SentenceTransformers This commit adds support for models that use SentenceTransformer layers. The motivation for this is that if converted model includes any of the numbered layers specified in the original models repository then these changes enable these models to be used and verified. Currently the model-conversion only support the base model output without any of the additional transformation layers. Usage: Convert the model that also includes the SentenceTransformer layers: ```console (venv) $ export EMBEDDING_MODEL_PATH="~/google/embeddinggemma-300M" (venv) make embedding-convert-model ``` Verify the produced embeddings from the converted model against the original model embeddings: ```console (venv) make embedding-verify-logits-st ``` The original model can be run using SentenceTransformer: ```console (venv) make embedding-run-original-model-st ``` Run the converted model using "SentenceTransformer" layers whic enables pooling and normalization: ```console (venv) make embedding-run-converted-model-st ``` * add model-conversion example requirements * add support for -st flag in embedding model conversion This commit add support for the -st flag in the embedding model conversion script. This will enable models to be converted using sentence transformers dense layers.	2025-10-09 14:35:22 +02:00
Aaron Teo	624207e676	devops: add s390x & ppc64le CI (#15925 ) * devops: move s390x and ppc64le ci build we have access to ubuntu-24.04-s390x and ppc64le images now Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le for now since they have compiler errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: stop warnings as errors Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: switch to non-macro flag Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: going the llama macro route Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian gguf test models Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le to test s390x, check test build Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.inp files for big-endian tests Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dup .gguf.out files for big-endian too Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add python setup and endian byteswap Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: pooring thing does not have s390x python3 Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add missing rust compiler for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try rust actions runner Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Revert "devops: try rust actions runner" This reverts commit 3f8db04356033d6c1d7eccc75ca396bc5298250c. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: try a different path for rust Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: dump home directory and user info Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: install gguf-py only Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: missed relative path Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: remove big-endian files since local swapping is working Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: revert test-tokenizer-0 cmakelists Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix unicode flags conversion from and to uint16_t Bitfields are allocated in different order on s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Simplify byteswap command Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add byteswapping and git-lfs for test-tokenizers-ggml-vocabs Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix endianness detection in vocab loader Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Disable test-thread-safety on s390x In this test a model is downloaded, then immediately loaded to check if more downloads are needed, and then used for test. There is no clean way to separate all those steps to add byteswapping between them, so just skip this test. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q8_0 test in test-quantize-fns vec_signed uses unexpected rounding mode. Explicitly use different rounding function. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add big-endian stories260K Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: add s390x test-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix test does not exist Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: fix model not found llama-eval-callback Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix q3_K dot product error in test-quantize-fns on s390x Array q8bytes had only 4 elements allocated, but 8 elements accessed. This lead to write out of bounds and later read of overwritten values out of bounds and incorrect result. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: re-enable ppc64le for testing Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: activate test-thread-safety for s390x Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: disable ppc64le tests for some reason it keeps failing test-thread-safety tests and I do not have a machine that is able to replicate the tests. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * devops: LLAMA_FATAL_WARNINGS=ON Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Correct repository URL for s390x for test-thread-safety model Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fix fs_get_cache_directory Ensure it works even if both XDG_CACHE_HOME and HOME are unset. This might happen in containers. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Re-enable CI for ppc64le Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Fortify ggml_rope_impl Only memcpy data from sections argument if it's non-NULL. Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> * Add TODO in struct unicode_cpt_flags to reimplement it in endian-independent way * Update URL for big-endian model * Update .github/workflows/build.yml Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update remaining mentions of BE models to ggml-org/models repo --------- Signed-off-by: Aaron Teo <aaron.teo1@ibm.com> Co-authored-by: Aleksei Nikiforov <aleksei.nikiforov@linux.ibm.com> Co-authored-by: Aleksei Nikiforov <103434461+AlekseiNikiforovIBM@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>	2025-09-27 02:03:33 +08:00
Daniel Bevenius	aa3ee0eb0b	model-conversion : add embedding prompt file support (#15871 ) This commit adds support for passing a prompt file to the model conversion targets/scripts. It also updates the logits.cpp to print out embedding information in the same format as when running the original embedding model. The motivation for this is that it allows us to pass files of different sizes when running the converted models and validating the logits. This can be particularly important when testing the sliding window functionality of models where the sequence length needs to exceed a certain number of tokens to trigger the sliding window logic.	2025-09-25 12:02:36 +02:00
Douglas Hanley	b5bd037832	llama : add support for qwen3 reranker (#15824 )	2025-09-25 11:53:09 +03:00
Jie Fu (傅杰)	63b54c81a6	model-conversion : make causal-verify-logits fails with model names containing "." (#16215 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 10:25:26 +02:00
Jie Fu (傅杰)	7735706b93	model-conversion : run-org-model.py fails to run on mac m1 (#16213 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 08:46:52 +02:00
Jie Fu (傅杰)	8ba548dae2	model-conversion : fix the make targets in the README.md (#16209 ) Fix two incorrect make targets in the readme. Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-24 06:19:23 +02:00
Georgi Gerganov	432cf4304c	codeowners : update + cleanup (#16174 ) --------- Co-authored-by: slaren <slarengh@gmail.com>	2025-09-22 18:20:21 +03:00
GideonSerf	c6db9a1027	embedding : fix typos in README (#16171 )	2025-09-22 11:49:58 +03:00
Jie Fu (傅杰)	1cbd80f8cf	examples : support encoder-decoder models in the simple example (#16002 ) Signed-off-by: Jie Fu <jiefu@tencent.com>	2025-09-17 10:29:00 +03:00
Aman Gupta	6d758839ff	Add LLaDA-7b-MoE diffusion model (#16003 )	2025-09-16 10:38:28 +08:00
Piotr Wilkin (ilintar)	acc1b008cf	model-conversion : add extra debugging support for model conversion (#15877 ) * feat: Extra debugging support for model conversion - added BF16 support for llama-callback-eval and support for dumping intermediate steps in run-org-model.py	2025-09-09 06:05:55 +02:00
Aldehir Rojas	7057faf64b	json : support `enum` values within `allOf` (#15830 )	2025-09-08 16:14:32 -05:00
Erik Scholz	a81283820a	gguf: gguf_writer refactor (#15691 ) * gguf: split gguf writer into base and buf impl * gguf: templated gguf write out * gguf: file based writer (avoid writing everything to memory first!) * examples(llama2c): fix log not being the same level and compiler nits	2025-09-05 11:34:28 +02:00
Daniel Bevenius	5d6688de08	model-conversion : add --embeddings flag to modelcard.template [no ci] (#15801 ) This commit updates the modelcard.template file used in the model conversion scripts for embedding models to include the llama-server --embeddings flag in the recommended command to run the model. The motivation for this change was that when using the model-conversion "tool" to upload the EmbeddingGemma models to Hugging Face this flag was missing and the embedding endpoint was there for not available when copy&pasting the command.	2025-09-05 04:36:23 +02:00
Daniel Bevenius	407c23786d	model-conversion : fix pyright errors (#15770 ) This commit addresses type errors reported by pyright in the model conversion scripts.	2025-09-03 18:28:36 +02:00
Daniel Bevenius	40a751ea9a	model-conversion : remove hardcoded /bin/bash shebangs [no ci] (#15765 ) * model-conversion : remove hardcoded /bin/bash shebangs [no ci] This commit updates the bash scripts to use env instead of using hardcoded /bin/bash in the shebang line. The motivation for this is that some systems may have bash installed in a different location, and using /usr/bin/env bash ensures that the script will use the first bash interpreter found in the user's PATH, making the scripts more portable across different environments. * model-conversion : rename script to .py [no ci] This commit renames run-casual-gen-embeddings-org.sh to run-casual-gen-embeddings-org.py to reflect its Python nature.	2025-09-03 12:50:47 +02:00
Daniel Bevenius	8c3fdf44ec	model-conversion : add missing curl script [no ci] (#15761 ) This commit adds a curl script to the model-conversion examples which is currently missing. This script is required for the running the embedding server targets to test llama-server embeddings functionality.	2025-09-03 09:48:35 +02:00
Georgi Gerganov	e92d53b29e	sampling : optimize samplers by reusing bucket sort (#15665 ) * sampling : optimize sorting using bucket sort in more places ggml-ci * sampling : do not sort in dist sampler ggml-ci * sampling : avoid heap allocations for sort buffers ggml-ci * common : add option to sort sampling candidates by probability ggml-ci * sampling : revert the change for preserving sort buffers * sampling : use std::copy instead of memcpy * sampling : clarify purpose of partial sort helpers ggml-ci * cont : remove wrong comment [no ci] * common : update comment Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2025-08-31 20:41:02 +03:00
Johannes Gäßler	e81b8e4b7f	llama: use FA + max. GPU layers by default (#15434 ) * llama: use max. GPU layers by default, auto -fa * ggml-backend: abort instead of segfault	2025-08-30 16:32:10 +02:00
Gabe Goodhart	a8bca68f72	fix: Compute the full sum in llama-eval-callback, not just the sum of printed values (#15637 ) This makes it much easier to compare between llama.cpp and transformers! https://github.com/ggml-org/llama.cpp/issues/nemotron-nano-15409 Branch: gabe-l-hart/nvidia-nemotron-nano-15409 Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>	2025-08-28 15:27:36 -05:00
Daniel Bevenius	46d9caa27a	model-conversion : add mmproj conversion target (#15628 ) This commit adds a new target to the Makefile for converting models that are multimodal. This target will convert the original model and in addition also create the mmproj GGUF model. The motivation for this change is that for models that are multimodal, for example those that contain a vision encoders, we will often want to upload both the quantized model and the vision encoder model to HuggingFace. Example usage: ```console $ make causal-convert-mm-model MODEL_PATH=~/work/ai/models/gemma-3-4b-it-qat-q4_0-unquantized/ ... The environment variable CONVERTED_MODEL can be set to this path using: export CONVERTED_MODEL=/home/danbev/work/ai/llama.cpp/models/gemma-3-4b-it-qat-q4_0-unquantized.gguf The mmproj model was created in /home/danbev/work/ai/llama.cpp/models/mmproj-gemma-3-4b-it-qat-q4_0-unquantized.gguf ``` The converted original model can then be quantized, and after that both the quantized model and the mmproj file can then be uploaded to HuggingFace. Refs: https://huggingface.co/ggml-org/gemma-3-4b-it-qat-GGUF/tree/main	2025-08-28 09:26:48 +02:00
Daniel Bevenius	62cef26ac5	model-conversion : add qat-q4 quantization targets (#15588 ) This commit adds two targets to the Makefile for quantizing of Quantization Aware Trained (QAT) models to Q4_0 format. The motivation for this is that this sets the token embedding and the output tensors data types to Q8_0 instead of the default Q6_K. This is someting that we wish to enforce for QAT Q4_0 models that are to be uploaded to ggml-org on Huggingface to guarantee the best quality.	2025-08-26 16:12:29 +02:00
Daniel Bevenius	dfd9b5f6c7	model-conversion : set pooling type to none in logits.cpp (#15564 ) This commit explicitly sets the pooling type to 'none' in the logits.cpp to support models that have a pooling type specified. The motivation for this is that some models may have a pooling type set in the model file (.gguf file) and for this specific case where we only want to extract logits, we need to ensure that no pooling is used to so that we are comparing raw logits and not pooled embeddings.	2025-08-25 15:00:43 +02:00

1 2 3 4 5 ...

1566 Commits