Yee Man Chan
8bd617eb1c
set n_embd_head_k/v to ensure kv cache works
2026-01-03 08:26:41 +08:00
Yee Man Chan
f85e5c73b9
Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
2026-01-02 21:20:34 +08:00
Yee Man Chan
f67a42d572
reduce OP count by 1 due to removal of kda_scan
2025-12-19 07:37:33 +08:00
Yee Man Chan
776294c04e
removed all traces of kda_scan
2025-12-19 07:36:06 +08:00
Yee Man Chan
f9a11d7758
rewrite get_vocab for KimiLinear. Removed all kda_scan code
2025-12-18 20:46:10 +08:00
Yee Man Chan
ae9771d1dc
removed unnecessary internal methods called by the old set_vocab of KimiLinear
2025-12-18 08:14:15 +08:00
Yee Man Chan
ef5bc30544
use DeepseekV2 tokenizer
2025-12-14 17:43:30 +08:00
Yee Man Chan
a0269af292
removed all hard code
2025-12-06 11:51:16 +08:00
Yee Man Chan
9f1265fec1
removed some hard coded code
2025-12-05 19:51:02 +08:00
Yee Man Chan
772ca88070
read MoE params
2025-12-02 20:16:24 +08:00
Yee Man Chan
83d328d0d3
remove type mismatch warning
2025-12-02 14:09:02 +08:00
Yee Man Chan
139548d070
remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
2025-12-02 12:11:15 +08:00
Yee Man Chan
e308026f64
kimi linear src/llama
2025-12-02 12:02:35 +08:00
Yee Man Chan
d73d3e51a5
Kimi Linear ggml.c
2025-12-02 11:27:57 +08:00
Yee Man Chan
bf42bc0606
Kimi Linear ggml-cuda
2025-12-02 11:24:37 +08:00
Yee Man Chan
26a6553155
kimi linear ggml-cpu
2025-12-02 11:20:46 +08:00
Yee Man Chan
6167f39e08
Kimi Linear ggml.h
2025-12-02 11:14:34 +08:00
Yee Man Chan
57cca52779
kimi linear constants.py tensor_mapping.py
2025-12-02 10:40:44 +08:00
Yee Man Chan
84f822c5a5
kimi linear convert_hf_to_gguf
2025-12-02 08:51:09 +08:00
Yee Man Chan
27baad43d5
kimi linear model implementation
2025-12-02 08:35:14 +08:00
Xuan-Son Nguyen
7733409734
common: improve verbosity level definitions ( #17630 )
...
* common: improve verbosity level definitions
* string_format
* update autogen docs
2025-12-01 14:38:13 +01:00
Xuan-Son Nguyen
cd3c118908
model: support Ministral3 ( #17644 )
...
* conversion script
* support ministral 3
* maybe this is better?
* add TODO for rope_yarn_log_mul
* better ppl (tested on 14B-Instruct)
* Add Ministral3 support to Mistral format
* improve arch handling
* add sizes
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* nits
---------
Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Georgi Gerganov
649495c9d9
metal : add FA head size 48 ( #17619 )
2025-12-01 12:49:53 +02:00
Georgi Gerganov
90c72a614a
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler ( #17617 )
2025-12-01 12:49:33 +02:00
Aman Gupta
6eea666912
llama-graph: avoid expand_forward for fusion ( #17633 )
2025-12-01 11:12:48 +02:00
Xuan-Son Nguyen
ff90508d68
contributing: update guidelines for AI-generated code ( #17625 )
...
* contributing: update guidelines for AI-generated code
* revise
2025-11-30 22:51:34 +01:00
Adrien Gallouët
0a4aeb927d
cmake : add option to build and link LibreSSL ( #17552 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 22:14:32 +01:00
Tarek Dakhran
2ba719519d
model: LFM2-VL fixes ( #17577 )
...
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
2025-11-30 21:57:31 +01:00
Xuan-Son Nguyen
7f8ef50cce
clip: fix nb calculation for qwen3-vl ( #17594 )
2025-11-30 15:33:55 +01:00
Xuan-Son Nguyen
3c136b21a3
cli: add migration warning ( #17620 )
2025-11-30 15:32:43 +01:00
Adrien Gallouët
beb1f0c503
common : throttle download progress output to reduce IO flush ( #17427 )
...
This change limits progress updates to approximately every 0.1% of the
file size to minimize stdio overhead.
Also fixes compiler warnings regarding __func__ in lambdas.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 14:22:44 +02:00
Aaron Teo
def5404f26
common: add LLAMA_LOG_FILE env var ( #17609 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-11-30 12:12:32 +01:00
Gilad S.
fa0465954f
ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` ( #17581 )
2025-11-30 10:00:59 +08:00
ddh0
5a6241feb0
common: update env var name ( #17588 )
2025-11-30 09:59:25 +08:00
Aman Gupta
c7af376c29
CUDA: add stream-based concurrency ( #16991 )
...
* CUDA: add stream-based concurrency
* HIP: fix hipStreamWaitEvent define and nodiscard warnings
* ggml-cuda: fix fusion inside stream
* ggml-cuda: fix bug w.r.t first stream launch
* ggml-cuda: format
* ggml-cuda: improve assert message
* ggml-cuda: use lambda instead of duplicating code
* ggml-cuda: add some more comments
* ggml-cuda: add more detailed comments about concurrency
* ggml-cuda: rename + remove unused var
* ggml-cuda: fix condition for stream launch
* ggml-cuda: address review comments, add destructor
* common.cuh: add is_valid for concurrent events
* common.cuh: make comment better
* update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* common.cuh: fix lower_bound condition + remove join_node data from write_ranges
* ggml-cuda: fix overlap condition + shadowing parameter
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-30 08:17:55 +08:00
Mahekk Shaikh
00425e2ed1
cuda : add error checking for cudaMemcpyAsync in argsort ( #17599 )
...
* cuda : add error checking for cudaMemcpyAsync in argsort (#12836 )
* fix indentation
2025-11-30 08:16:28 +08:00
Acly
385c3da5e6
vulkan : fix FA mask load with bounds check (coopmat2) ( #17606 )
2025-11-30 01:03:21 +01:00
Xuan-Son Nguyen
ab49f094d2
server: move server-context to its own cpp|h ( #17595 )
...
* git mv
* add server-context.h
* add server-context.h
* clean up headers
* cont : cleanup
* also expose server_response_reader (to be used by CLI)
* fix windows build
* decouple server_routes and server_http
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-29 22:04:44 +01:00
Haiyue Wang
8c32d9d96d
server: explicitly set the function name in lambda ( #17538 )
...
As [1] explained, the real debug message will be like:
"res operator(): operator() : queue result stop"
Set the name explicitly, the message is easy for debugging:
"res operator(): recv : queue result stop"
The left "operator()" is generated by 'RES_DBG() ... __func__'
[1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
2025-11-29 18:43:29 +01:00
Igor Smirnov
0874693b44
common : fix json schema with '\' in literals ( #17307 )
...
* Fix json schema with '\' in literals
* Add "literal string with escapes" test
2025-11-29 17:06:32 +01:00
Neo Zhang
7d2add51d8
sycl : support to malloc memory on device more than 4GB, update the doc and script ( #17566 )
...
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-11-29 14:59:44 +02:00
ixgbe
f698a79c63
ggml: replace hwcap with riscv_hwprobe for RVV detection ( #17567 )
...
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-29 14:56:31 +02:00
Ruben Ortlam
47a268ea50
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support ( #16900 )
...
* vulkan: split mul_mmq_funcs for mul_mat_vecq use
* add mxfp4 mmvq
* add q2_k mmvq
* add q3_k mmvq
* add q4_k and q5_k mmvq
* add q6_k mmvq
* handle 4x4 quants per mmvq thread
* enable MUL_MAT_ID mmvq support
* enable subgroup optimizations for mul_mat_vec_id shaders
* device tuning
* request prealloc_y sync after quantization
* fix indentation
* fix llvmpipe test failures
* fix mul_mat_id mmvq condition
* fix unused variable warning
2025-11-29 09:37:22 +01:00
Jeff Bolz
59d8d4e963
vulkan: improve topk perf for large k, fix overflow in unit tests ( #17582 )
2025-11-29 08:39:57 +01:00
Aleksei Nikiforov
d82b7a7c1d
gguf-py : fix passing non-native endian tensors (editor-gui and new-metadata) ( #17553 )
...
gguf_new_metadata.py reads data from reader.
Reader doesn't byteswap tensors to native endianness.
But writer does expect tensors in native endianness to convert them
into requested endianness.
There are two ways to fix this: update reader and do conversion to native endianness and back,
or skip converting endianness in writer in this particular USE-case.
gguf_editor_gui.py doesn't allow editing or viewing tensor data.
Let's go with skipping excessive byteswapping.
If eventually capability to view or edit tensor data is added,
tensor data should be instead byteswapped when reading it.
2025-11-28 20:53:01 +01:00
DAN™
03914c7ef8
common : move all common_chat_parse_* to chat-parser.cpp. ( #17481 )
2025-11-28 19:29:36 +01:00
o7si
3ce7a65c2f
server: fix: /metrics endpoint returning JSON-escaped Prometheus format ( #17386 )
...
* fix: /metrics endpoint returning JSON-escaped Prometheus format
* mod: remove string overload from ok() method
2025-11-28 19:14:00 +01:00
Diego Devesa
e072b2052e
ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched ( #17276 )
...
* ggml : add GGML_SCHED_NO_REALLOC option to disable reallocations in ggml_backend_sched
Enabled in ggml-ci for testing.
* llama : update worst-case graph for unified cache
* ci : disable op offload in some tests
* fix spelling
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-28 17:33:23 +02:00
R0CKSTAR
c6f7a423c8
[MUSA] enable fp16/fast_fp16/bf16_mma on PH1 ( #17551 )
...
* [MUSA] enable fp16/fast_fp16/bf16_mma on PH1
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
* Update ggml/src/ggml-cuda/fattn-vec.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/fattn-vec.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Update ggml/src/ggml-cuda/fattn-tile.cuh
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* Address review comments
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
---------
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-28 14:08:29 +01:00
Aman Gupta
2e7ef98f18
ggml-cuda: add stricter checking for fusion ( #17568 )
...
* ggml-cuda: make conditions for fusion more explicit
* ggml-cuda: remove size check as std::equal already does it
2025-11-28 20:34:51 +08:00