Yee Man Chan
67bee56013
add Kimi-K2 specific tokens to be recognized as EOG
2026-01-06 21:15:12 +08:00
Yee Man Chan
e3542ff8a2
fixed some comments
2026-01-06 11:35:25 +08:00
Yee Man Chan
cfed14e31b
naive chunking form implemented
2026-01-06 11:23:53 +08:00
Yee Man Chan
aba181ebad
removed LOG_INFO
2026-01-05 19:21:06 +08:00
Yee Man Chan
66c0c5d8d4
Kimi Linear backend agnostic
2026-01-05 16:35:19 +08:00
Yee Man Chan
a4020d867f
don't quantize conv1d of Kimi Linear
2026-01-03 08:27:29 +08:00
Yee Man Chan
8bd617eb1c
set n_embd_head_k/v to ensure kv cache works
2026-01-03 08:26:41 +08:00
Yee Man Chan
f85e5c73b9
Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache
2026-01-02 21:20:34 +08:00
Yee Man Chan
f67a42d572
reduce OP count by 1 due to removal of kda_scan
2025-12-19 07:37:33 +08:00
Yee Man Chan
776294c04e
removed all traces of kda_scan
2025-12-19 07:36:06 +08:00
Yee Man Chan
f9a11d7758
rewrite get_vocab for KimiLinear. Removed all kda_scan code
2025-12-18 20:46:10 +08:00
Yee Man Chan
ae9771d1dc
removed unnecessary internal methods called by the old set_vocab of KimiLinear
2025-12-18 08:14:15 +08:00
Yee Man Chan
ef5bc30544
use DeepseekV2 tokenizer
2025-12-14 17:43:30 +08:00
Yee Man Chan
a0269af292
removed all hard code
2025-12-06 11:51:16 +08:00
Yee Man Chan
9f1265fec1
removed some hard coded code
2025-12-05 19:51:02 +08:00
Yee Man Chan
772ca88070
read MoE params
2025-12-02 20:16:24 +08:00
Yee Man Chan
83d328d0d3
remove type mismatch warning
2025-12-02 14:09:02 +08:00
Yee Man Chan
139548d070
remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning
2025-12-02 12:11:15 +08:00
Yee Man Chan
e308026f64
kimi linear src/llama
2025-12-02 12:02:35 +08:00
Yee Man Chan
d73d3e51a5
Kimi Linear ggml.c
2025-12-02 11:27:57 +08:00
Yee Man Chan
bf42bc0606
Kimi Linear ggml-cuda
2025-12-02 11:24:37 +08:00
Yee Man Chan
26a6553155
kimi linear ggml-cpu
2025-12-02 11:20:46 +08:00
Yee Man Chan
6167f39e08
Kimi Linear ggml.h
2025-12-02 11:14:34 +08:00
Yee Man Chan
57cca52779
kimi linear constants.py tensor_mapping.py
2025-12-02 10:40:44 +08:00
Yee Man Chan
84f822c5a5
kimi linear convert_hf_to_gguf
2025-12-02 08:51:09 +08:00
Yee Man Chan
27baad43d5
kimi linear model implementation
2025-12-02 08:35:14 +08:00
Xuan-Son Nguyen
7733409734
common: improve verbosity level definitions ( #17630 )
...
* common: improve verbosity level definitions
* string_format
* update autogen docs
2025-12-01 14:38:13 +01:00
Xuan-Son Nguyen
cd3c118908
model: support Ministral3 ( #17644 )
...
* conversion script
* support ministral 3
* maybe this is better?
* add TODO for rope_yarn_log_mul
* better ppl (tested on 14B-Instruct)
* Add Ministral3 support to Mistral format
* improve arch handling
* add sizes
* Apply suggestions from code review
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* nits
---------
Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Georgi Gerganov
649495c9d9
metal : add FA head size 48 ( #17619 )
2025-12-01 12:49:53 +02:00
Georgi Gerganov
90c72a614a
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler ( #17617 )
2025-12-01 12:49:33 +02:00
Aman Gupta
6eea666912
llama-graph: avoid expand_forward for fusion ( #17633 )
2025-12-01 11:12:48 +02:00
Xuan-Son Nguyen
ff90508d68
contributing: update guidelines for AI-generated code ( #17625 )
...
* contributing: update guidelines for AI-generated code
* revise
2025-11-30 22:51:34 +01:00
Adrien Gallouët
0a4aeb927d
cmake : add option to build and link LibreSSL ( #17552 )
...
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 22:14:32 +01:00
Tarek Dakhran
2ba719519d
model: LFM2-VL fixes ( #17577 )
...
* Adjust to pytorch
* Add antialiasing upscale
* Increase number of patches to 1024
* Handle default marker insertion for LFM2
* Switch to flag
* Reformat
* Cuda implementation of antialias kernel
* Change placement in ops.cpp
* consistent float literals
* Pad only for LFM2
* Address PR feedback
* Rollback default marker placement changes
* Fallback to CPU implementation for antialias implementation of upscale
2025-11-30 21:57:31 +01:00
Xuan-Son Nguyen
7f8ef50cce
clip: fix nb calculation for qwen3-vl ( #17594 )
2025-11-30 15:33:55 +01:00
Xuan-Son Nguyen
3c136b21a3
cli: add migration warning ( #17620 )
2025-11-30 15:32:43 +01:00
Adrien Gallouët
beb1f0c503
common : throttle download progress output to reduce IO flush ( #17427 )
...
This change limits progress updates to approximately every 0.1% of the
file size to minimize stdio overhead.
Also fixes compiler warnings regarding __func__ in lambdas.
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 14:22:44 +02:00
Aaron Teo
def5404f26
common: add LLAMA_LOG_FILE env var ( #17609 )
...
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-11-30 12:12:32 +01:00
Gilad S.
fa0465954f
ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` ( #17581 )
2025-11-30 10:00:59 +08:00
ddh0
5a6241feb0
common: update env var name ( #17588 )
2025-11-30 09:59:25 +08:00
Aman Gupta
c7af376c29
CUDA: add stream-based concurrency ( #16991 )
...
* CUDA: add stream-based concurrency
* HIP: fix hipStreamWaitEvent define and nodiscard warnings
* ggml-cuda: fix fusion inside stream
* ggml-cuda: fix bug w.r.t first stream launch
* ggml-cuda: format
* ggml-cuda: improve assert message
* ggml-cuda: use lambda instead of duplicating code
* ggml-cuda: add some more comments
* ggml-cuda: add more detailed comments about concurrency
* ggml-cuda: rename + remove unused var
* ggml-cuda: fix condition for stream launch
* ggml-cuda: address review comments, add destructor
* common.cuh: add is_valid for concurrent events
* common.cuh: make comment better
* update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* update comment
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
* common.cuh: fix lower_bound condition + remove join_node data from write_ranges
* ggml-cuda: fix overlap condition + shadowing parameter
---------
Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-30 08:17:55 +08:00
Mahekk Shaikh
00425e2ed1
cuda : add error checking for cudaMemcpyAsync in argsort ( #17599 )
...
* cuda : add error checking for cudaMemcpyAsync in argsort (#12836 )
* fix indentation
2025-11-30 08:16:28 +08:00
Acly
385c3da5e6
vulkan : fix FA mask load with bounds check (coopmat2) ( #17606 )
2025-11-30 01:03:21 +01:00
Xuan-Son Nguyen
ab49f094d2
server: move server-context to its own cpp|h ( #17595 )
...
* git mv
* add server-context.h
* add server-context.h
* clean up headers
* cont : cleanup
* also expose server_response_reader (to be used by CLI)
* fix windows build
* decouple server_routes and server_http
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-29 22:04:44 +01:00
Haiyue Wang
8c32d9d96d
server: explicitly set the function name in lambda ( #17538 )
...
As [1] explained, the real debug message will be like:
"res operator(): operator() : queue result stop"
Set the name explicitly, the message is easy for debugging:
"res operator(): recv : queue result stop"
The left "operator()" is generated by 'RES_DBG() ... __func__'
[1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html
Signed-off-by: Haiyue Wang <haiyuewa@163.com>
2025-11-29 18:43:29 +01:00
Igor Smirnov
0874693b44
common : fix json schema with '\' in literals ( #17307 )
...
* Fix json schema with '\' in literals
* Add "literal string with escapes" test
2025-11-29 17:06:32 +01:00
Neo Zhang
7d2add51d8
sycl : support to malloc memory on device more than 4GB, update the doc and script ( #17566 )
...
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-11-29 14:59:44 +02:00
ixgbe
f698a79c63
ggml: replace hwcap with riscv_hwprobe for RVV detection ( #17567 )
...
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-29 14:56:31 +02:00
Ruben Ortlam
47a268ea50
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support ( #16900 )
...
* vulkan: split mul_mmq_funcs for mul_mat_vecq use
* add mxfp4 mmvq
* add q2_k mmvq
* add q3_k mmvq
* add q4_k and q5_k mmvq
* add q6_k mmvq
* handle 4x4 quants per mmvq thread
* enable MUL_MAT_ID mmvq support
* enable subgroup optimizations for mul_mat_vec_id shaders
* device tuning
* request prealloc_y sync after quantization
* fix indentation
* fix llvmpipe test failures
* fix mul_mat_id mmvq condition
* fix unused variable warning
2025-11-29 09:37:22 +01:00
Jeff Bolz
59d8d4e963
vulkan: improve topk perf for large k, fix overflow in unit tests ( #17582 )
2025-11-29 08:39:57 +01:00