Francis Couture-Harpin
7c3f9c226f
Merge branch 'master' into compilade/test-model-random
2025-06-26 17:23:16 -04:00
Georgi Gerganov
e8215dbb96
metal : add special-case mat-vec mul for ne00 == 4 ( #14385 )
...
ggml-ci
2025-06-26 15:51:19 +03:00
Aman Gupta
aa064b2eb7
CUDA: add mean operation ( #14313 )
...
* CUDA: add mean operation
* add back sum_rows_f32_cuda
* Review: early exit if col!=0
2025-06-22 12:39:54 +08:00
Aman Gupta
c959f462a0
CUDA: add conv_2d_transpose ( #14287 )
...
* CUDA: add conv_2d_transpose
* remove direct include of cuda_fp16
* Review: add brackets for readability, remove ggml_set_param and add asserts
2025-06-20 22:48:24 +08:00
Francis Couture-Harpin
ccb2bb9988
test-model-random : show max error
2025-06-18 15:11:23 -04:00
Francis Couture-Harpin
9d873d7543
test-model-random : shuffle across sequences but not within
...
There isn't really a use-case for fully-shuffled batches
* test-model-random : use F32 as the KV cache type
Temporary until F16 is fixed on ARM when using FP16_VECTOR_ARITHMETIC
2025-06-18 15:07:24 -04:00
Francis Couture-Harpin
04b8f5143d
Merge branch 'master' into compilade/test-model-random
2025-06-16 21:45:48 -04:00
Francis Couture-Harpin
352703b08b
test-model-random : better default tensor initialization distribution
2025-06-16 21:37:45 -04:00
Diego Devesa
6adc3c3ebc
llama : add thread safety test ( #14035 )
...
* llama : add thread safety test
* llamafile : remove global state
* llama : better LLAMA_SPLIT_MODE_NONE logic
when main_gpu < 0 GPU devices are not used
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 08:11:43 -07:00
Francis Couture-Harpin
dfa3c18266
tests : add LLAMA, LLAMA4, and GEMMA2 to test-model-random
2025-06-13 20:02:47 -04:00
Francis Couture-Harpin
8fe213af76
tests : avoid sprintf in test-model-random
2025-06-12 02:48:11 -04:00
Francis Couture-Harpin
7657835b33
tests : fix overflow and memory leaks in test-model-random
...
* tests : fix integer types in test-model-random
2025-06-12 02:41:36 -04:00
Francis Couture-Harpin
9cd402cbe1
tests : add test-model-random
...
This generates random models and then tests different concurrencies
of batches to check if the output is consistent.
This can detect when e.g. the recurrent cache has been broken,
or anything else which would affect the consistency of the output
when inferencing multiple distinct sequences.
More architectures will be added, but for now this starts with Mamba.
Eventually, consistency of pooled embeddings will also be tested.
The goal is to reduce accidental regressions
by making it easy to quickly test a lot of edge cases
on the supported architectures,
without having to download any model.
2025-06-12 01:00:57 -04:00
Sigbjørn Skjæret
d4e0d95cf5
chore : clean up relative source dir paths ( #14128 )
2025-06-11 19:04:23 +02:00
Sigbjørn Skjæret
cc66a7f78f
tests : add test-tokenizers-repo ( #14017 )
2025-06-11 17:16:32 +02:00
Diego Devesa
7f4fbe5183
llama : allow building all tests on windows when not using shared libs ( #13980 )
...
* llama : allow building all tests on windows when not using shared libraries
* add static windows build to ci
* tests : enable debug logs for test-chat
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-09 20:03:09 +02:00
Ervin Áron Tasnádi
0d3984424f
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D ( #13813 )
...
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D
* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D
* Missing barrier added to shader.
Number of additional tests reduced to 108.
* * Fixes typo in variable name.
* Removes extra whitespaces.
* Adds int64->int32 casts to prevent possible warnings.
* Problem size reduced in tests to pass tests with llvmpipe.
* supports_op condition moved from unintended position
2025-06-04 22:02:00 +02:00
Olivier Chafik
c9bbc77931
`server`: update deepseek reasoning format (pass reasoning_content as diffs) ( #13933 )
...
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Johannes Gäßler
7675c555a1
gguf: fix failure on version == 0 ( #13956 )
2025-06-01 18:08:05 +02:00
Olivier Chafik
e15898d1c7
server: allow unclosed thinking tags ( #13931 )
2025-05-31 08:26:10 -07:00
Georgi Gerganov
53f925074d
sync : vendor ( #13901 )
...
* sync : vendor
ggml-ci
* cont : fix httplib version
ggml-ci
* cont : fix lint
* cont : fix lint
* vendor : move to common folder /vendor
ggml-ci
* cont : fix lint
* cont : move httplib to /vendor + use json_fwd.hpp
ggml-ci
* cont : fix server build
ggml-ci
* cont : add missing headers
ggml-ci
* cont : header clean-up
ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen
07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list ( #13847 )
...
* convert : allow partial update to the chkhsh pre-tokenizer list
* code style
* update tokenizer out
* rm inp/out files for models not having gguf
* fixed hash for glm
* skip nomic-bert-moe test
* Update convert_hf_to_gguf_update.py
* fix minerva-7b hash
* rm redundant import
2025-05-30 12:24:37 +02:00
Georgi Gerganov
66c92061f5
tests : remove json.hpp from a test ( #13880 )
...
ggml-ci
2025-05-29 12:17:16 +03:00
Georgi Gerganov
f9cd68398b
sampling : make sure samplers return at least 1 token ( #13822 )
...
* sampling : min-p should always return at least one token
ggml-ci
* sampling : same for typical sampling
* tests : sampling tests use min_keep == 0
ggml-ci
2025-05-27 12:07:52 +03:00
Olivier Chafik
03f582ae8f
server: fix streaming crashes ( #13786 )
...
* add preludes to content on partial regex match
* allow all parsers to parse non-tool-call content.
* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik
d74e94c1b3
`server`: fix format of streamed tool call deltas (diff name, fix id location) ( #13800 )
...
* fix deltas of tool_call.function.name
* fix tool_call.id (was in tool_call.function.id!) + add function type
* add tool_call.type
* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik
e121edc432
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) ( #13771 )
...
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Sigbjørn Skjæret
aa50ba462f
tests : improve UGM tokenizer test coverage ( #13773 )
2025-05-25 16:22:29 +02:00
Olivier Chafik
f5cd27b71d
`server`: streaming of tool calls and thoughts when `--jinja` is on ( #12379 )
...
* add common_json w/ support for truncated json healing
* add common_chat_msg_diff
* partial common_chat_parse
* refactor parser w/ optionals
* server: wire chat diffs in stream mode
* fix trigger of thinking models (must happen after thoughts are closed)
* fix functionary v3.2 raw python!
* rename: common_chat_syntax (now contains format)
* rm common_regex.at_start
* don't return empty <think></think>
* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)
* fix QwQ 32B tool call parsing after thoughts (hermes2)
* better logs for grammar triggers
* consume spaces after parse_json_tool_calls
* fix required tool calls w/ thinking models that have pre-opened thinking tags
* fix thinking model's initial trigger + test qwq's template
* run most test_tool_call tests in stream + non-stream modes
* make functionary v3.2 parsing more strict (differentiate first match from others)
* send final diff from server, to close off raw python arguments
* support partial content streaming in Generic mode
* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)
* Update function-calling.md
* Update tool_bench.py
* chat-parser: remove input from exception (llm output may contain PII)
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Sigbjørn Skjæret
759e37b0d8
tests : avoid github urls due to throttling ( #13654 )
2025-05-20 12:03:17 +02:00
Olivier Chafik
aa48e373f2
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 ( #12802 )
...
* Inject date_string in llama 3.x + fix for functionary v2
https://github.com/ggml-org/llama.cpp/issues/12729
* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Olivier Chafik
3198405e98
`common`: add partial regex support ( #12808 )
...
* move string_find_partial_stop & string_ends_with to common
* add common_regex (supports partial matches)
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/regex-partial.h
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* partial regex: add missing iterator end checks
* string utils: use string_views
* direct throw to avoid ggml.h include
* regex-partial: replace missed ggml_asserts
---------
Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Johannes Gäßler
10d2af0eaa
llama/ggml: add LLM training support ( #10544 )
...
* llama/ggml: add LLM training support
more compact progress bar
llama_save_model_to_file
llama_opt_param_filter
ggml_graph_dup force_grads
refactor ggml_opt, fix test-opt
* remove logits_all
* refactor CUDA implementation for ACC
* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang
7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B ( #13386 )
2025-05-11 14:18:39 +02:00
DocShotgun
ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate ( #13345 )
2025-05-06 22:36:24 +02:00
Xuan-Son Nguyen
27aa259532
mtmd : add C public API ( #13184 )
...
* init
* wip
* working version
* add mtmd::bitmaps
* add test target
* rm redundant define
* test: mtmd_input_chunks_free
* rm outdated comment
* fix merging issue
* explicitly create mtmd::input_chunks
* mtmd_input_chunk_copy
* add clone()
* add const to various places
* add warning about breaking changes
* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
Diego Devesa
9f2da5871f
llama : build windows releases with dl backends ( #13220 )
2025-05-04 14:20:49 +02:00
Diego Devesa
1d36b3670b
llama : move end-user examples to tools directory ( #13249 )
...
* llama : move end-user examples to tools directory
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Georgi Gerganov
b34443923c
sync : ggml ( #13268 )
...
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)
* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)
* review: remove src_x/y < 0 checks; add performance tests
* sync : ggml
ggml-ci
* vulkan : fix lint (#0 )
---------
Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
piDack
2af6880178
llama-chat : reset glmedge chat template ( #13253 )
...
* reset glmedge chat template
* fix glmedge chat template
2025-05-02 11:06:09 +02:00
matteo
e0f572c846
llama-chat : update GLM4 chat template ( #13238 )
...
* update GLM4 chat template
* Update chat template
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-01 21:16:38 +02:00
Johannes Gäßler
b0ecbd434b
test: non-cont. b in test-backend-ops -o MUL_MAT ( #13187 )
2025-05-01 20:18:56 +02:00
Johannes Gäßler
e1e8e0991f
CUDA: batched+noncont MMQ, refactor bs>1 MoE code ( #13199 )
2025-04-30 23:12:59 +02:00
Xuan-Son Nguyen
da84c04d8f
docker : do not build tests ( #13204 )
...
* docker : do not build tests
* include "ggml-cpu.h"
2025-04-30 10:44:07 +02:00
Xuan-Son Nguyen
4e87962e34
mtmd : fix glm-edge redundant token count ( #13139 )
...
* mtmd : fix glm-edge redundant token count
* fix chat template
* temporary disable GLMEdge test chat tmpl
2025-04-28 16:12:56 +02:00
Xuan-Son Nguyen
2d451c8059
common : add common_remote_get_content ( #13123 )
...
* common : add common_remote_get_content
* support max size and timeout
* add tests
2025-04-26 22:58:12 +02:00
frob
d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema ( #13117 )
...
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-04-26 10:10:20 +02:00
Xuan-Son Nguyen
edb18b6e8f
clip : fix pixtral on some GPU backends ( #13097 )
...
* clip : fix pixtral on some GPU backends
* refactor inp_raw set
* rm outdated comment
* fix dynamic size
* add TODO
2025-04-25 14:31:42 +02:00
Georgi Gerganov
13b4548877
cmake : do not include ./src as public for libllama ( #13062 )
...
* cmake : do not include ./src as public for libllama
ggml-ci
* cmake : rework tests
ggml-ci
* llguidance : remove unicode include
ggml-ci
* cmake : make c++17 private
ggml-ci
2025-04-24 16:00:10 +03:00
Johannes Gäßler
658987cfc9
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID ( #13014 )
...
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID
* fix logic for RoPE support, CUDA graphs
2025-04-22 21:27:40 +02:00