Commit Graph

7243 Commits

Author SHA1 Message Date
Yee Man Chan 67bee56013 add Kimi-K2 specific tokens to be recognized as EOG 2026-01-06 21:15:12 +08:00
Yee Man Chan e3542ff8a2 fixed some comments 2026-01-06 11:35:25 +08:00
Yee Man Chan cfed14e31b naive chunking form implemented 2026-01-06 11:23:53 +08:00
Yee Man Chan aba181ebad removed LOG_INFO 2026-01-05 19:21:06 +08:00
Yee Man Chan 66c0c5d8d4 Kimi Linear backend agnostic 2026-01-05 16:35:19 +08:00
Yee Man Chan a4020d867f don't quantize conv1d of Kimi Linear 2026-01-03 08:27:29 +08:00
Yee Man Chan 8bd617eb1c set n_embd_head_k/v to ensure kv cache works 2026-01-03 08:26:41 +08:00
Yee Man Chan f85e5c73b9 Move KIMI_LINEAR to llm_arch_is_hybrid to enable KV cache 2026-01-02 21:20:34 +08:00
Yee Man Chan f67a42d572 reduce OP count by 1 due to removal of kda_scan 2025-12-19 07:37:33 +08:00
Yee Man Chan 776294c04e removed all traces of kda_scan 2025-12-19 07:36:06 +08:00
Yee Man Chan f9a11d7758 rewrite get_vocab for KimiLinear. Removed all kda_scan code 2025-12-18 20:46:10 +08:00
Yee Man Chan ae9771d1dc removed unnecessary internal methods called by the old set_vocab of KimiLinear 2025-12-18 08:14:15 +08:00
Yee Man Chan ef5bc30544 use DeepseekV2 tokenizer 2025-12-14 17:43:30 +08:00
Yee Man Chan a0269af292 removed all hard code 2025-12-06 11:51:16 +08:00
Yee Man Chan 9f1265fec1 removed some hard coded code 2025-12-05 19:51:02 +08:00
Yee Man Chan 772ca88070 read MoE params 2025-12-02 20:16:24 +08:00
Yee Man Chan 83d328d0d3 remove type mismatch warning 2025-12-02 14:09:02 +08:00
Yee Man Chan 139548d070 remove "const int64_t n_seq_tokens = q->ne[2];" to get rid of unused variable warning 2025-12-02 12:11:15 +08:00
Yee Man Chan e308026f64 kimi linear src/llama 2025-12-02 12:02:35 +08:00
Yee Man Chan d73d3e51a5 Kimi Linear ggml.c 2025-12-02 11:27:57 +08:00
Yee Man Chan bf42bc0606 Kimi Linear ggml-cuda 2025-12-02 11:24:37 +08:00
Yee Man Chan 26a6553155 kimi linear ggml-cpu 2025-12-02 11:20:46 +08:00
Yee Man Chan 6167f39e08 Kimi Linear ggml.h 2025-12-02 11:14:34 +08:00
Yee Man Chan 57cca52779 kimi linear constants.py tensor_mapping.py 2025-12-02 10:40:44 +08:00
Yee Man Chan 84f822c5a5 kimi linear convert_hf_to_gguf 2025-12-02 08:51:09 +08:00
Yee Man Chan 27baad43d5 kimi linear model implementation 2025-12-02 08:35:14 +08:00
Xuan-Son Nguyen 7733409734
common: improve verbosity level definitions (#17630)
* common: improve verbosity level definitions

* string_format

* update autogen docs
2025-12-01 14:38:13 +01:00
Xuan-Son Nguyen cd3c118908
model: support Ministral3 (#17644)
* conversion script

* support ministral 3

* maybe this is better?

* add TODO for rope_yarn_log_mul

* better ppl (tested on 14B-Instruct)

* Add Ministral3 support to Mistral format

* improve arch handling

* add sizes

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* nits

---------

Co-authored-by: Julien Denize <julien.denize@mistral.ai>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-12-01 12:26:52 +01:00
Georgi Gerganov 649495c9d9
metal : add FA head size 48 (#17619) 2025-12-01 12:49:53 +02:00
Georgi Gerganov 90c72a614a
ggml : extend the GGML_SCHED_NO_REALLOC debug logic of the scheduler (#17617) 2025-12-01 12:49:33 +02:00
Aman Gupta 6eea666912
llama-graph: avoid expand_forward for fusion (#17633) 2025-12-01 11:12:48 +02:00
Xuan-Son Nguyen ff90508d68
contributing: update guidelines for AI-generated code (#17625)
* contributing: update guidelines for AI-generated code

* revise
2025-11-30 22:51:34 +01:00
Adrien Gallouët 0a4aeb927d
cmake : add option to build and link LibreSSL (#17552)
Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 22:14:32 +01:00
Tarek Dakhran 2ba719519d
model: LFM2-VL fixes (#17577)
* Adjust to pytorch

* Add antialiasing upscale

* Increase number of patches to 1024

* Handle default marker insertion for LFM2

* Switch to flag

* Reformat

* Cuda implementation of antialias kernel

* Change placement in ops.cpp

* consistent float literals

* Pad only for LFM2

* Address PR feedback

* Rollback default marker placement changes

* Fallback to CPU implementation for antialias implementation of upscale
2025-11-30 21:57:31 +01:00
Xuan-Son Nguyen 7f8ef50cce
clip: fix nb calculation for qwen3-vl (#17594) 2025-11-30 15:33:55 +01:00
Xuan-Son Nguyen 3c136b21a3
cli: add migration warning (#17620) 2025-11-30 15:32:43 +01:00
Adrien Gallouët beb1f0c503
common : throttle download progress output to reduce IO flush (#17427)
This change limits progress updates to approximately every 0.1% of the
file size to minimize stdio overhead.

Also fixes compiler warnings regarding __func__ in lambdas.

Signed-off-by: Adrien Gallouët <angt@huggingface.co>
2025-11-30 14:22:44 +02:00
Aaron Teo def5404f26
common: add LLAMA_LOG_FILE env var (#17609)
Signed-off-by: Aaron Teo <aaron.teo1@ibm.com>
2025-11-30 12:12:32 +01:00
Gilad S. fa0465954f
ggml: fix: macOS build with `-DGGML_BACKEND_DL=ON` (#17581) 2025-11-30 10:00:59 +08:00
ddh0 5a6241feb0
common: update env var name (#17588) 2025-11-30 09:59:25 +08:00
Aman Gupta c7af376c29
CUDA: add stream-based concurrency (#16991)
* CUDA: add stream-based concurrency

* HIP: fix hipStreamWaitEvent define and nodiscard warnings

* ggml-cuda: fix fusion inside stream

* ggml-cuda: fix bug w.r.t first stream launch

* ggml-cuda: format

* ggml-cuda: improve assert message

* ggml-cuda: use lambda instead of duplicating code

* ggml-cuda: add some more comments

* ggml-cuda: add more detailed comments about concurrency

* ggml-cuda: rename + remove unused var

* ggml-cuda: fix condition for stream launch

* ggml-cuda: address review comments, add destructor

* common.cuh: add is_valid for concurrent events

* common.cuh: make comment better

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* update comment

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* common.cuh: fix lower_bound condition + remove join_node data from write_ranges

* ggml-cuda: fix overlap condition + shadowing parameter

---------

Co-authored-by: Carl Philipp Klemm <carl@uvos.xyz>
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-11-30 08:17:55 +08:00
Mahekk Shaikh 00425e2ed1
cuda : add error checking for cudaMemcpyAsync in argsort (#17599)
* cuda : add error checking for cudaMemcpyAsync in argsort (#12836)

* fix indentation
2025-11-30 08:16:28 +08:00
Acly 385c3da5e6
vulkan : fix FA mask load with bounds check (coopmat2) (#17606) 2025-11-30 01:03:21 +01:00
Xuan-Son Nguyen ab49f094d2
server: move server-context to its own cpp|h (#17595)
* git mv

* add server-context.h

* add server-context.h

* clean up headers

* cont : cleanup

* also expose server_response_reader (to be used by CLI)

* fix windows build

* decouple server_routes and server_http

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-11-29 22:04:44 +01:00
Haiyue Wang 8c32d9d96d
server: explicitly set the function name in lambda (#17538)
As [1] explained, the real debug message will be like:
	"res    operator(): operator() : queue result stop"

Set the name explicitly, the message is easy for debugging:
	"res    operator(): recv : queue result stop"

The left "operator()" is generated by 'RES_DBG() ... __func__'

[1]: https://clang.llvm.org/extra/clang-tidy/checks/bugprone/lambda-function-name.html

Signed-off-by: Haiyue Wang <haiyuewa@163.com>
2025-11-29 18:43:29 +01:00
Igor Smirnov 0874693b44
common : fix json schema with '\' in literals (#17307)
* Fix json schema with '\' in literals

* Add "literal string with escapes" test
2025-11-29 17:06:32 +01:00
Neo Zhang 7d2add51d8
sycl : support to malloc memory on device more than 4GB, update the doc and script (#17566)
Co-authored-by: Neo Zhang Jianyu <jianyu.zhang@intel.com>
2025-11-29 14:59:44 +02:00
ixgbe f698a79c63
ggml: replace hwcap with riscv_hwprobe for RVV detection (#17567)
Signed-off-by: Wang Yang <yangwang@iscas.ac.cn>
2025-11-29 14:56:31 +02:00
Ruben Ortlam 47a268ea50
Vulkan: MMVQ Integer Dot K-Quant and MUL_MAT_ID support (#16900)
* vulkan: split mul_mmq_funcs for mul_mat_vecq use

* add mxfp4 mmvq

* add q2_k mmvq

* add q3_k mmvq

* add q4_k and q5_k mmvq

* add q6_k mmvq

* handle 4x4 quants per mmvq thread

* enable MUL_MAT_ID mmvq support

* enable subgroup optimizations for mul_mat_vec_id shaders

* device tuning

* request prealloc_y sync after quantization

* fix indentation

* fix llvmpipe test failures

* fix mul_mat_id mmvq condition

* fix unused variable warning
2025-11-29 09:37:22 +01:00
Jeff Bolz 59d8d4e963
vulkan: improve topk perf for large k, fix overflow in unit tests (#17582) 2025-11-29 08:39:57 +01:00