Commit Graph

587 Commits

Author SHA1 Message Date
Oliver Simons 6028bf7435
CUDA: Optimize `reduce_rows_f32` kernel, leading up to 25x perf improvement on kernel-level and 10% perf increase for Gemma3n (#15132)
* Factor out `reduce_rows_f32` from common.cuh

This increases iteration cycle speed by not having to recompile
every kernel all the time

* Hide memory-latency by loop unrolling in reduce_rows_f32

* Further optimizations to `reduce_rows_f32`

1. Increase threadblock size to better hide latency of memory requests.
   As a consequence of bigger threadblocks, do 2-step summation, using
   shared memory to communicate results between invocations
2. Use sum_temp array to reduce waits on sum
3. Adjust num_unroll to reflext bigger threadblock
4. Improve default block_dims, increase support for more block_dims

* Add perf tests for `reduce_rows_f32` kernel

* Add heuristic to toggle 128/512 threads based on sm count

Break even point was the minimum of the following multiples.

| GPU Model                     | Nrow SM Count Multiple |
| -----------                   | -----------            |
| RTX 4000 SFF ADA              | 2.0x                   |
| RTX 6000 ADA                  | 2.5x                   |
| RTX PRO 6000 Blackwell Max-Q  | 3.04x                  |
| RTX PRO 4500 Blackwell	| 3.15x                  |

* Ensure perf gains also for small ncols and large nrows

Alternative to this, one could have also made the number of unrollings
template-able, but that would require compiling the kernel multiple
times, increasing binary size unnecessarily

* Modify perf and unit-tests

* Apply auto-formatting by clang

* Fix CI build failure

See https://github.com/ggml-org/llama.cpp/actions/runs/16798370266/job/47573716079?pr=15132#step:7:486
Building with VS generator worked though.

* Remove sm_count property from `ggml_backend_cuda_context`

Requested by @JohannesGaessler, and should fix remaining CI issues as a
side-effect

* Add CUB-based implementation for GGML_OP_MEAN

Currently this branch is only executed for nrows==1

* Add heuristics to execute CUB branch only when it brings perf

Heuristics were determined on the following HW:

* RTX 4000 SFF ADA
* RTX 6000 ADA
* RTX PRO 6000 Blackwell Max-Q
* RTX PRO 4500 Blackwell

* Add unit-test for CUB-based mean

Tests should run with CUDA Graphs enabled per default on NVGPUs

* Rename `USE_CUB` to `GGML_CUDA_USE_CUB`

Suggested by @JohannesGaessler

* Unindent Preprocessor directives

See
https://github.com/ggml-org/llama.cpp/pull/15132#discussion_r2269213506
2025-08-13 10:04:46 +02:00
Sachin Desai 3db4da56a5
chat : support Granite model reasoning and tool call (#14864) 2025-08-06 20:27:30 +02:00
Sigbjørn Skjæret 65c797c4fa
chat : fix yandex chat template (#15116) 2025-08-06 13:26:49 +02:00
Georgi Gerganov fd1234cb46
llama : add gpt-oss (#15091)
* oai moe

* compat with new checkpoint

* add attn sink impl

* add rope scaling yarn

* logits match with latest transformers code

* wip chat template

* rm trailing space

* use ggml_scale_bias

* rm redundant is_swa_all

* convert interleaved gate_up

* graph : fix activation function to match reference (#7)

* vocab : handle o200k_harmony special tokens

* ggml : add attention sinks support (#1)

* llama : add attn sinks

* ggml : add attn sinks

* cuda : add attn sinks

* vulkan : add support for sinks in softmax

remove unnecessary return

* ggml : add fused swiglu_oai op (#11)

* ggml : add fused swiglu_oai op

* Update ggml/src/ggml-cpu/ops.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* update CUDA impl

* cont : metal impl

* add vulkan impl

* test-backend-ops : more test cases, clean up

* llama : remove unfused impl

* remove extra lines

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: slaren <slarengh@gmail.com>

* repack mxfp4 upon conversion

* clean up a bit

* enable thinking

* add quick hack to render only some special tokens

* fix bf16 conversion

* remove vocab hack

* webui ok

* support chat parsing for gpt-oss

* fix webui

* direct mapping mxfp4, FINALLY

* force using mxfp4

* properly use lazy tensor

* ggml : add mxfp4

ggml : use e8m0 conversion instead of powf

Co-authored-by: Diego Devesa <slarengh@gmail.com>

change kvalues_mxfp4 table to match e2m1 (#6)

metal : remove quantization for now (not used)

cuda : fix disabled CUDA graphs due to ffn moe bias

vulkan : add support for mxfp4

cont : add cm2 dequant

* ggml : add ggml_add_id (#13)

* ggml : add ggml_add_id

* add cuda impl

* llama : add weight support check for add_id

* perf opt

* add vulkan impl

* rename cuda files

* add metal impl

* allow in-place ggml_add_id

* llama : keep biases on CPU with --cpu-moe

* llama : fix compile error

ggml-ci

* cuda : add fallback for __nv_cvt_e8m0_to_bf16raw

ggml-ci

* cleanup

ggml-ci

* sycl : fix supports_op for MXFP4

ggml-ci

* fix Unknown reasoning format

* ggml-cpu : fix AVX build

ggml-ci

* fix hip build

ggml-ci

* cuda : add mxfp4 dequantization support for cuBLAS

ggml-ci

* ggml-cpu : fix mxfp4 fallback definitions for some architectures

ggml-ci

* cuda : fix version required for __nv_cvt_e8m0_to_bf16raw

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: slaren <slarengh@gmail.com>
2025-08-05 22:10:36 +03:00
Sigbjørn Skjæret f324a3b715
chat : only remove double bos/eos if added (#15086)
* only remove double bos/eos if added

* fix tests
2025-08-05 20:43:36 +02:00
Jhen-Jie Hong f738989dcb
chat : fix multiple tool_calls on hermes-2-pro (#14962) 2025-08-02 18:04:48 +08:00
Jeff Bolz ec0b18802c
vulkan: Support ne[3]>1 in noncontig matrix-vector multiply (#15015) 2025-08-02 10:48:30 +02:00
Georgi Gerganov 00131d6eaf
tests : update for LLAMA_SET_ROWS=1 (#14961)
* test-thread-safety : each context uses a single sequence

* embedding : handle --parallel argument

ggml-ci

* save-load : handle -np 1

ggml-ci

* thread-safety : avoid overriding threads, reduce test case arg

ggml-ci
2025-07-30 15:12:02 +03:00
Sigbjørn Skjæret 138b288b59
cuda : add softcap fusion (#14907) 2025-07-29 14:22:03 +02:00
Leonard Mosescu bda62193b2
test-backend-ops : extend test case filtering (#14865)
* Extend test case filtering

1. Allow passing multiple (comma-separated?) ops to test-backend-ops. This can be convenient when working on a set of ops, when you'd want to test them together (but without having to run every single op). For example:

`test-backend-ops.exe test -o "ADD,RMS_NORM,ROPE,SILU,SOFT_MAX"`

2. Support full test-case variation string in addition to basic op names. This would make it easy to select a single variation, either for testing or for benchmarking. It can be particularly useful for profiling a particular variation (ex. a CUDA kernel), for example:

`test-backend-ops.exe perf -b CUDA0 -o "MUL_MAT(type_a=f16,type_b=f32,m=4096,n=512,k=14336,bs=[1,1],nr=[1,1],per=[0,1,2,3],v=2)"`

These two can be combined. As the current `-o`, this change doesn't try to detect/report an error if an filter doesn't name existing ops (ex. misspelled)

* Updating the usage help text

* Update tests/test-backend-ops.cpp
2025-07-28 18:04:27 +02:00
Erik Scholz 89d1029559
vulkan : add fp16 support for the conv_2d kernel (#14872)
* add f16 to conv_2d testing
* weaken conv2d test error threshold
2025-07-27 12:04:33 +02:00
Aman Gupta 446595b9b3
Docs: add instructions for adding backends (#14889) 2025-07-27 09:36:43 +08:00
Georgi Gerganov 18f3b5ff9e tests : add non-cont K,V FA tests
ggml-ci
2025-07-23 14:08:09 +03:00
Aman Gupta 8c988fa41d
CUDA: add fused rms norm (#14800) 2025-07-23 09:25:42 +08:00
Jeff Bolz c2e058f1b4
vulkan/cuda: Fix im2col when KW!=KH (#14789)
The tid is decomposed into "ow + ky*OW + kx*OW*KH". Change "ksize" to match.
2025-07-21 13:35:40 +02:00
Ervin Áron Tasnádi a979ca22db
ggml: adds CONV_2D op and direct GEMM Vulkan implementation (#14316)
* ggml/ggml-vulkan/test-backend-ops: adds CONV_2D for Vulkan

* ggml-vulkan: adds f32 scalar shader to compute 2D convolution directly
with gemm (no need for im2col),

* test-backend-ops: adds test_case_ref to check the validity/performance of ops
against reference implementations having different graphs, adds tests

* * Performance fixes: minimized branch divergence, uses collectives to
  eliminate redundant calculation, macros removed.

* Kernel shared memory size check

* Updates test-backend-ops to support graphs for performance
  measurement.

* * Apple/Win32 compile errors fixed

* Subgroup size used to determine tile size -> fixes llvmpipe errors.

* Collectives disabled by default.

* Intel support is disabled as the performance is poor.

* Conv2d enabled for Intel with disabled collectives, disabled for Apple

* test-backend-ops modifications are reverted

* Trailing spaces and missing override fixed.

* Triggering pipeline relaunch.

* Code formatted with .clang-format.
2025-07-19 21:59:08 +02:00
Georgi Gerganov bf9087f59a
metal : fuse add, mul + add tests (#14596)
ggml-ci
2025-07-18 20:37:26 +03:00
Georgi Gerganov 225e7a1438
llama : add high-throughput mode (#14363)
* kv-cache : prepare K/V buffers for separation

ggml-ci

* batched-bench : fix oob write

ggml-ci

* llama : add "virtual sequences"

ggml-ci

* llama : use "stream" vs "virtual sequence"

ggml-ci

* graph : fix stream splitting when KV cache is not used

ggml-ci

* kv-cache : add multi-stream save/load support

ggml-ci

* llama : add "--attn-streams" flag

ggml-ci

* kv-cache : fix handling when find_slot fails

ggml-ci

* kv-cache : restore find_slot impl

ggml-ci

* kv-cache : add comments

* kv-cache : add bounds checks for sequence id

ggml-ci

* cont : add n_seq_max to batch allocr

ggml-ci

* kv-cache : perform stream copies lazily after llama_synchronize

ggml-ci

* kv-cache : avoid throwing exceptions across the C boundary

ggml-ci

* CUDA: 4D FlashAttention support (#14628)

* CUDA: 4D FlashAttention support

* CUDA: fix WMMA FA kernel

* llama : rename attn_streams -> kv_unified

ggml-ci

* common : rename kv_split -> kv_unified

ggml-ci

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-07-16 16:35:42 +03:00
Tarek Dakhran c31e60647d
tests : cover lfm2 cases in test_ssm_conv (#14651) 2025-07-12 19:10:14 +02:00
Acly 3e303b1107 vulkan : implement ggml_roll (ggml/1290)
ggml-ci
2025-07-12 14:25:44 +03:00
Aman Gupta 11ee0fea2a
Docs: script to auto-generate ggml operations docs (#14598)
* Docs: script to auto-generate ggml operations docs

* Review: formatting changes + change github action

* Use built-in types instead of typing

* docs : add BLAS and Metal ops

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-07-10 23:29:01 +08:00
compilade a57d1bcb3c
cuda : support Falcon-H1 state size for SSM_SCAN (#14602) 2025-07-09 23:54:38 -04:00
Xuan-Son Nguyen 98bab638fb
ggml : add ggml_scale_bias (#14417)
* ggml : add ggml_scale_bias

* ggml_vec_mad1_f32

* add more simd

* add CUDA

* sycl

* vulkan

* cann (placeholder)

* opencl

* will this fix cpu?

* fix cuda

* suggestions from coderabbit

* fix cann compile error

* vDSP_vsmsa

* rm __ARM_FEATURE_SVE

* use memcpy for op params

* make code looks more consistent

* use scalar for __ARM_FEATURE_SVE

* add x param to ggml_vec_mad1_f32
2025-07-09 18:16:12 +02:00
Georgi Gerganov 4d0dcd4a06
cuda : fix rope with partial rotation and non-cont src (#14580)
* cuda : fix rope non-cont

ggml-ci

* cont : fix multi-rope + add test

ggml-ci

* sycl : try fix

ggml-ci

* cont : fix sycl + clean-up cuda

ggml-ci
2025-07-08 10:15:21 +03:00
Jeff Bolz e592be1575
vulkan: fix rms_norm+mul fusion (#14545)
The fused operation was grabbing the epsilon value from the wrong place.

Add an env var to disable fusion.

Add some missing checks for supported shapes/types.

Handle fused rms_norm+mul in check_results.
2025-07-06 10:08:16 +02:00
R0CKSTAR b81510a7b7
test-backend-ops: add support for specifying output format (#14368)
* test-backend-ops: add support for specifying output format

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Add build_commit and build_number in test_result

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* refactor

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Get build commit from ggml_commit()

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Merge errors into test_operation_info && address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* remove visitor nonsense

* remove visitor comment

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

* Address review comments

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>

---------

Signed-off-by: Xiaodong Ye <yeahdongcn@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2025-07-05 12:10:53 +08:00
Johannes Gäßler c8c4495b8d
ggml: backward pass for split swiglu (#14483) 2025-07-03 17:05:18 +02:00
Georgi Gerganov 9067487c44
ggml : fix FA mask dim 2 and 3 (#14505)
* ggml : fix FA mask dim 2 and 3

ggml-ci

* backends : unsupport batched FA in CUDA and Vulkan

ggml-ci

* vulkan : disable FA for mask->ne[2] != 1
2025-07-03 10:46:57 +03:00
Georgi Gerganov d4cdd9c1c3
ggml : remove kompute backend (#14501)
ggml-ci
2025-07-03 07:48:32 +03:00
Aman Gupta 55c2646b45
CUDA: add dynamic shared mem to softmax, refactor general usage (#14497) 2025-07-03 07:45:11 +08:00
compilade 5d46babdc2
llama : initial Mamba-2 support (#9126)
* llama : initial Mamba-2 support

* ggml : SIMD ggml_ssm_scan for Mamba-2

* ggml : improve ggml_mul speed when masking recurrent states

* llama : support running Mamba-Codestral-7B-v0.1

* llama : fix Mamba-2 conv state saving

* ggml : make the ggml_mul fast broadcast path more consistently formatted

* llama : remove unused variable

* llama : add missing break

* convert_hf : prefer SentencePiece tokenizer for Mamba-2 when present

The tokenzier.json of Mamba-Codestral-7B-v0.1 otherwise requires
workarounds to work correctly.

* llama : avoid redundant state copy for Mamba 1 and 2

* metal : attempt to adapt SSM_SCAN for Mamba-2

* metal : fix SSM_SCAN pipeline scope

* metal : use log and exp instead of log1pf and expf in SSM_SCAN

* metal : remove unused arguments for SSM_SCAN

The max index is 31, so trimming the arguments is necessary.

* metal : add back n_seqs to SSM_SCAN args

Whoops, this is needed for the offset in the concatenated output.

* metal : fix SSM_SCAN state head offset

* metal : fix wrong number of tokens per sequence in SSM_SCAN

* ggml : remove unused fast broadcast path in GGML_MUL

This was initially added because states were masked with ggml_mul,
but this is no longer done and so this "optimisation" is no longer
necessary, or at least not worth the additional code complexity.

* ggml : avoid multiply by D in GGML_OP_SSM_SCAN

This makes the weight buft detection in src/llama.cpp simpler.

* convert : transpose Mamba-2 A, D and reshape SSM_NORM

This breaks existing conversions of Mamba-2 models
to avoid some reshapes.

Not sure if it's a good idea,
but it makes the graph slightly cleaner.

* llama : more appropriate SSM_SCAN and SSM_CONV buft support checks

* convert : fix flake8 lint

* metal : fix confusion between ; and ,

* metal : add missing args for nb references in ssm_scan_f32_group

* metal : single-user mamba2 inference works

* kv-cache : remove const_cast when setting inputs for s_copy

And also fix multi-user inference for recurrent models
by using cell_id instead of i as the kv cell index
when populating s_copy.

* convert : avoid AutoConfig for Mamba and Mamba2 hparams

* kv-cache : allow context shift for recurrent models

* graph : fix recurrent state copies when avoiding copies

Works, but using lambda functions might not be that clean.

* ggml : fix mamba2 ssm scan when compiled with SVE

* ggml-cpu : reorder SVE FMA for consistency with other SIMD arches

* cuda : implement ssm scan for Mamba2

There is still room for improvement, but it works!

* cuda : adapt Mamba1 ssm scan to shape changes from Mamba2

* mamba : fix mismatched new and delete size for llm_build_mamba

Subclasses of llm_graph_context cannot have extra fields,
because the called destructor is not the one from the subclass.
This otherwise would cause problems when runnning Mamba-(1|2) inference
when compiled -DGGML_SANITIZE_ADDRESS=ON

* cuda : graceful fallback for Mamba-1 models with weird embd size
2025-07-02 13:10:24 -04:00
Georgi Gerganov ec68e84c32 ggml : support bcast ggml_soft_max_ext, ggml_flash_attn_ext (#14435)
ggml-ci
2025-07-02 15:48:33 +03:00
Jeff Bolz 6a746cf9c4
vulkan: Split large mul_mat_id to fit in shared memory (#14451) 2025-07-01 10:43:08 +02:00
Acly 431b2c24f3 ggml-cpu : "align corners" for bilinear upscale/downscale (ggml/1285)
* add "align corners" mode for bilinear upscale, and allow downscaling
* add ggml_interpolate, deprecate ggml_upscale_ext, pass in align-corners as bit-flag
* test-backend-ops: replace ggml_upscale_ext with ggml_interpolate, add test cases for downscale and align-corners
2025-07-01 11:06:39 +03:00
Diego Devesa eb3fa2913e
test-backend-ops : disable llama test (#14461) 2025-06-30 12:43:15 +02:00
Vedran Miletić e9b6350e61
scripts : make the shell scripts cross-platform (#14341) 2025-06-30 10:17:18 +02:00
Sigbjørn Skjæret a0535ffa0d
ggml : implement REGLU/GEGLU/SWIGLU ops (#14158)
* implement unary REGLU/GEGLU/SWIGLU cpu ops

* relax constraints

* duplicate shape of source

* fix ggml_vec_geglu_f16

* special case gated ops

* implement unary REGLU/GEGLU/SWIGLU cuda ops

* tighten constraints again

* refactor into GGML_GLU_OP

* metal : add glu kernels

ggml-ci

* add CUDA_GLU_BLOCK_SIZE [no ci]

* more constraints and use 64bit ints

ggml-ci

* 64bit multiplication [no ci]

* implement swapped variants (cpu/cuda)

* update comment [no ci]

ggml-ci

* Vulkan: Add GLU ops and shaders

* SYCL: Implement fused kernel GEGLU, SWIGLU and REGLU for single up+gate

* ggml : implement GLU for split up/gate (#14181)

* implement GLU for split up/gate

* add tests for ggml_glu_split

* Vulkan: Implement glu_split logic and shader support

* add split to logging [no ci]

* SYCL: refactor element_size ops and add split up and gate support to gated kernels

* SYCL: switch GEGLU to use tanh approximation

---------

Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>

* GGML: increase OP count in assertion

* Refactor: Optimize SYCL element-wise operations with unary function inlining

This commit refactors the SYCL element-wise operations to improve performance by:

- Inlining unary operations (sgn, abs, elu, gelu, silu, etc.) to reduce kernel launch overhead.
- Introducing helper functions `op_xxx` for each unary operation to encapsulate the logic.
- Replacing direct kernel calls with calls to these inlined functions.
- Using `__dpct_inline__` to encourage compiler inlining.
- Minor code cleanup and consistency improvements.

The changes aim to reduce kernel launch overhead and improve the overall efficiency of element-wise operations on SYCL devices.

* vulkan: Increase workgroup size for GLU, for performance (#14345)

* vulkan: Increase workgroup size for GLU, for performance

* vulkan: change GLU shaders to do one element per invocation rather than one row per workgroup

* merge fix

* metal : add support for split and swap

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: 0cc4m <picard12@live.de>
Co-authored-by: Akarshan <akarshan@menlo.ai>
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-06-29 11:04:10 +02:00
Jeff Bolz bd9c981d72
vulkan: Add fusion support for RMS_NORM+MUL (#14366)
* vulkan: Add fusion support for RMS_NORM+MUL

- Add a use_count to ggml_tensor, so we can detect if an output is used more than once.
- Change the ggml-vulkan rms_norm shader to optionally multiply by another tensor.
- Add detection logic and basic fusion logic in ggml-vulkan.
- Add some testing support for fusion. Rather than computing one node at a time, allow
for computing the whole graph and just testing one node's results. Add rms_norm_mul tests
and enable a llama test.

* extract some common fusion logic

* fix -Winconsistent-missing-override

* move ggml_can_fuse to a common function

* build fix

* C and C++ versions of can_fuse

* move use count to the graph to avoid data races and double increments when used in multiple threads

* use hash table lookup to find node index

* change use_counts to be indexed by hash table slot

* minimize hash lookups

style fixes

* last node doesn't need single use.
fix type.
handle mul operands being swapped.

* remove redundant parameter

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-06-29 09:43:36 +02:00
Aman Gupta 27208bf657
CUDA: add bf16 and f32 support to cublas_mul_mat_batched (#14361)
* CUDA: add bf16 and f32 support to cublas_mul_mat_batched

* Review: add type traits and make function more generic

* Review: make check more explicit, add back comments, and fix formatting

* Review: fix formatting, remove useless type conversion, fix naming for bools
2025-06-29 01:30:53 +08:00
Radoslav Gerganov 8d94219a4a
ggml : add ggml_set_rows (#14274)
* ggml : add ggml_set_rows

Add ggml_set_rows(a, b, c) which copies rows from 'b' into 'a' using
indices from 'c'.

ref: #8366

* use I64 for indices

* ggml : add repeat impl for i64

* ggml : add ggml_is_contiguous_rows

* ggml : ggml_set_rows support broadcast

* ggml : ggml_set_rows support quantized dst

ggml-ci

* ggml : support GGML_TYPE_F32 ".from_float" trait

* ggml : ggml_set_rows update comment + better index name

* tests : add ggml_set_rows

* metal : add ggml_set_rows implementation

ggml-ci

* ggml : simplify forward_dup_f32

* ggml : fix supports_op

* tests : add comment to set_rows

* ggml : leave the repeat_i64 for a separate PR

ggml-ci

* ggml : set_rows use std::min instead of MIN

* ggml : better error message for set_rows unsupported type

* metal : perform op->type check only once

* tests : more consistent implementation + more tests

ggml-ci

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-27 16:41:40 +03:00
Georgi Gerganov e8215dbb96
metal : add special-case mat-vec mul for ne00 == 4 (#14385)
ggml-ci
2025-06-26 15:51:19 +03:00
Aman Gupta aa064b2eb7
CUDA: add mean operation (#14313)
* CUDA: add mean operation

* add back sum_rows_f32_cuda

* Review: early exit if col!=0
2025-06-22 12:39:54 +08:00
Aman Gupta c959f462a0
CUDA: add conv_2d_transpose (#14287)
* CUDA: add conv_2d_transpose

* remove direct include of cuda_fp16

* Review: add brackets for readability, remove ggml_set_param and add asserts
2025-06-20 22:48:24 +08:00
Diego Devesa 6adc3c3ebc
llama : add thread safety test (#14035)
* llama : add thread safety test

* llamafile : remove global state

* llama : better LLAMA_SPLIT_MODE_NONE logic

when main_gpu < 0 GPU devices are not used

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-16 08:11:43 -07:00
Sigbjørn Skjæret d4e0d95cf5
chore : clean up relative source dir paths (#14128) 2025-06-11 19:04:23 +02:00
Sigbjørn Skjæret cc66a7f78f
tests : add test-tokenizers-repo (#14017) 2025-06-11 17:16:32 +02:00
Diego Devesa 7f4fbe5183
llama : allow building all tests on windows when not using shared libs (#13980)
* llama : allow building all tests on windows when not using shared libraries

* add static windows build to ci

* tests : enable debug logs for test-chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-06-09 20:03:09 +02:00
Ervin Áron Tasnádi 0d3984424f
ggml-vulkan: adds support for op CONV_TRANSPOSE_1D (#13813)
* * ggml-vulkan: adds op CONV_TRANSPOSE_1D

* test-backend-ops: adds more spohisticated tests for CONV_TRANSPOSE_1D

* Missing barrier added to shader.
Number of additional tests reduced to 108.

* * Fixes typo in variable name.

* Removes extra whitespaces.

* Adds int64->int32 casts to prevent possible warnings.

* Problem size reduced in tests to pass tests with llvmpipe.

* supports_op condition moved from unintended position
2025-06-04 22:02:00 +02:00
Olivier Chafik c9bbc77931
`server`: update deepseek reasoning format (pass reasoning_content as diffs) (#13933)
* server: update deepseek reasoning format (now in reasoning_content diffs), add legacy option for compat
* update unit/test_tool_call.py::test_thoughts
2025-06-02 10:15:44 -07:00
Johannes Gäßler 7675c555a1
gguf: fix failure on version == 0 (#13956) 2025-06-01 18:08:05 +02:00
Olivier Chafik e15898d1c7
server: allow unclosed thinking tags (#13931) 2025-05-31 08:26:10 -07:00
Georgi Gerganov 53f925074d
sync : vendor (#13901)
* sync : vendor

ggml-ci

* cont : fix httplib version

ggml-ci

* cont : fix lint

* cont : fix lint

* vendor : move to common folder /vendor

ggml-ci

* cont : fix lint

* cont : move httplib to /vendor + use json_fwd.hpp

ggml-ci

* cont : fix server build

ggml-ci

* cont : add missing headers

ggml-ci

* cont : header clean-up

ggml-ci
2025-05-30 16:25:45 +03:00
Xuan-Son Nguyen 07e4351ce6
convert : allow partial update to the chkhsh pre-tokenizer list (#13847)
* convert : allow partial update to the chkhsh pre-tokenizer list

* code style

* update tokenizer out

* rm inp/out files for models not having gguf

* fixed hash for glm

* skip nomic-bert-moe test

* Update convert_hf_to_gguf_update.py

* fix minerva-7b hash

* rm redundant import
2025-05-30 12:24:37 +02:00
Georgi Gerganov 66c92061f5
tests : remove json.hpp from a test (#13880)
ggml-ci
2025-05-29 12:17:16 +03:00
Georgi Gerganov f9cd68398b
sampling : make sure samplers return at least 1 token (#13822)
* sampling : min-p should always return at least one token

ggml-ci

* sampling : same for typical sampling

* tests : sampling tests use min_keep == 0

ggml-ci
2025-05-27 12:07:52 +03:00
Olivier Chafik 03f582ae8f
server: fix streaming crashes (#13786)
* add preludes to content on partial regex match

* allow all parsers to parse non-tool-call content.

* tweak order of <|python_tag|> vs <function= parsing for functionary v3.1 format. still not ideal but hopefully less prone to crash
2025-05-26 16:03:57 +01:00
Olivier Chafik d74e94c1b3
`server`: fix format of streamed tool call deltas (diff name, fix id location) (#13800)
* fix deltas of tool_call.function.name

* fix tool_call.id (was in tool_call.function.id!) + add function type

* add tool_call.type

* populate empty tool_call.function.arguments on first delta
2025-05-26 14:56:49 +01:00
Olivier Chafik e121edc432
`server`: add `--reasoning-budget 0` to disable thinking (incl. qwen3 w/ enable_thinking:false) (#13771)
---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-26 00:30:51 +01:00
Sigbjørn Skjæret aa50ba462f
tests : improve UGM tokenizer test coverage (#13773) 2025-05-25 16:22:29 +02:00
Olivier Chafik f5cd27b71d
`server`: streaming of tool calls and thoughts when `--jinja` is on (#12379)
* add common_json w/ support for truncated json healing

* add common_chat_msg_diff

* partial common_chat_parse

* refactor parser w/ optionals

* server: wire chat diffs in stream mode

* fix trigger of thinking models (must happen after thoughts are closed)

* fix functionary v3.2 raw python!

* rename: common_chat_syntax (now contains format)

* rm common_regex.at_start

* don't return empty <think></think>

* accommodate yet another deepseek r1 distill fantasy syntax (`<|tool▁calls|>`)

* fix QwQ 32B tool call parsing after thoughts (hermes2)

* better logs for grammar triggers

* consume spaces after parse_json_tool_calls

* fix required tool calls w/ thinking models that have pre-opened thinking tags

* fix thinking model's initial trigger + test qwq's template

* run most test_tool_call tests in stream + non-stream modes

* make functionary v3.2 parsing more strict (differentiate first match from others)

* send final diff from server, to close off raw python arguments

* support partial content streaming in Generic mode

* tool-call: allow content prelude before hermes2 tool calls (for Qwen2.5)

* Update function-calling.md

* Update tool_bench.py

* chat-parser: remove input from exception (llm output may contain PII)

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Olivier Chafik <ochafik@users.noreply.github.com>
2025-05-25 01:48:08 +01:00
Sigbjørn Skjæret 759e37b0d8
tests : avoid github urls due to throttling (#13654) 2025-05-20 12:03:17 +02:00
Olivier Chafik aa48e373f2
`server`: inject date_string in llama 3.x template + fix date for firefunction v2 (#12802)
* Inject date_string in llama 3.x + fix for functionary v2

https://github.com/ggml-org/llama.cpp/issues/12729

* move/fix detection of functionary v3.1 before llama 3.x, fix & test their non-tool mode

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* generate more tokens in test_completion_with_required_tool_tiny_fast to avoid truncation

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
2025-05-15 02:39:51 +01:00
Olivier Chafik 3198405e98
`common`: add partial regex support (#12808)
* move string_find_partial_stop & string_ends_with to common

* add common_regex (supports partial matches)

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/regex-partial.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* partial regex: add missing iterator end checks

* string utils: use string_views

* direct throw to avoid ggml.h include

* regex-partial: replace missed ggml_asserts

---------

Co-authored-by: ochafik <ochafik@google.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-05-14 19:50:57 +01:00
Johannes Gäßler 10d2af0eaa
llama/ggml: add LLM training support (#10544)
* llama/ggml: add LLM training support

more compact progress bar

llama_save_model_to_file

llama_opt_param_filter

ggml_graph_dup force_grads

refactor ggml_opt, fix test-opt

* remove logits_all

* refactor CUDA implementation for ACC

* reset graph at beginning of opt period
2025-05-12 14:44:49 +02:00
David Huang 7f323a589f
Add `--no-op-offload` to improve `-ot` pp perf in MoE models like llama4 400B (#13386) 2025-05-11 14:18:39 +02:00
DocShotgun ffc727203a
sampling : make top_n_sigma no-op at <=0 or a single candidate (#13345) 2025-05-06 22:36:24 +02:00
Xuan-Son Nguyen 27aa259532
mtmd : add C public API (#13184)
* init

* wip

* working version

* add mtmd::bitmaps

* add test target

* rm redundant define

* test: mtmd_input_chunks_free

* rm outdated comment

* fix merging issue

* explicitly create mtmd::input_chunks

* mtmd_input_chunk_copy

* add clone()

* add const to various places

* add warning about breaking changes

* helper: use mtmd_image_tokens_get_n_pos
2025-05-04 23:43:42 +02:00
Diego Devesa 9f2da5871f
llama : build windows releases with dl backends (#13220) 2025-05-04 14:20:49 +02:00
Diego Devesa 1d36b3670b
llama : move end-user examples to tools directory (#13249)
* llama : move end-user examples to tools directory

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-05-02 20:27:13 +02:00
Georgi Gerganov b34443923c
sync : ggml (#13268)
* vulkan : kernels for depthwise 2D convolution (CONV_2D_DW) (ggml/1204)

* vulkan : add kernels for depthwise 2d convolution (OP_CONV_2D_DW)

* review: remove src_x/y < 0 checks; add performance tests

* sync : ggml

ggml-ci

* vulkan : fix lint (#0)

---------

Co-authored-by: Acly <aclysia@gmail.com>
2025-05-02 20:54:30 +03:00
piDack 2af6880178
llama-chat : reset glmedge chat template (#13253)
* reset glmedge chat template

* fix glmedge chat template
2025-05-02 11:06:09 +02:00
matteo e0f572c846
llama-chat : update GLM4 chat template (#13238)
* update GLM4 chat template

* Update chat template

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-05-01 21:16:38 +02:00
Johannes Gäßler b0ecbd434b
test: non-cont. b in test-backend-ops -o MUL_MAT (#13187) 2025-05-01 20:18:56 +02:00
Johannes Gäßler e1e8e0991f
CUDA: batched+noncont MMQ, refactor bs>1 MoE code (#13199) 2025-04-30 23:12:59 +02:00
Xuan-Son Nguyen da84c04d8f
docker : do not build tests (#13204)
* docker : do not build tests

* include "ggml-cpu.h"
2025-04-30 10:44:07 +02:00
Xuan-Son Nguyen 4e87962e34
mtmd : fix glm-edge redundant token count (#13139)
* mtmd : fix glm-edge redundant token count

* fix chat template

* temporary disable GLMEdge test chat tmpl
2025-04-28 16:12:56 +02:00
Xuan-Son Nguyen 2d451c8059
common : add common_remote_get_content (#13123)
* common : add common_remote_get_content

* support max size and timeout

* add tests
2025-04-26 22:58:12 +02:00
frob d5fe4e81bd
grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
2025-04-26 10:10:20 +02:00
Xuan-Son Nguyen edb18b6e8f
clip : fix pixtral on some GPU backends (#13097)
* clip : fix pixtral on some GPU backends

* refactor inp_raw set

* rm outdated comment

* fix dynamic size

* add TODO
2025-04-25 14:31:42 +02:00
Georgi Gerganov 13b4548877
cmake : do not include ./src as public for libllama (#13062)
* cmake : do not include ./src as public for libllama

ggml-ci

* cmake : rework tests

ggml-ci

* llguidance : remove unicode include

ggml-ci

* cmake : make c++17 private

ggml-ci
2025-04-24 16:00:10 +03:00
Johannes Gäßler 658987cfc9
CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID (#13014)
* CUDA: noncont MMVQ + batched bs1 MUL_MAT_ID

* fix logic for RoPE support, CUDA graphs
2025-04-22 21:27:40 +02:00
Georgi Gerganov 2f74c354c0
graph : make FA compatible with MLA + add initial Metal kernels (#12953)
* graph : make mla compatible with FA

* metal : add exp FA kernels for DeepSeek models

ggml-ci

* llama : minor naming updates

ggml-ci

* ggml : disable FA for DS head sizes

* tests : add FA tests for MLA shapes

ggml-ci
2025-04-17 18:16:36 +03:00
Jeff Bolz 015022bb53
vulkan: enable coopmat2 FA gqa and split_k optimizations more often (#12931)
The grouped query attention optmization doesn't require a power of two ratio,
the only thing relying on it was the modulo operation written as bitwise &.

split_k need not depend on gqa_ratio - enable it any time there's only one
workgroup in the X dimension. The shader gets the split index from the x coord,
and multiple workgroups in the X dimension (pre-split) indicates a larger
FA operation that wouldn't need splitting.
2025-04-16 20:37:25 +02:00
Olivier Chafik b6930ebc42
`tool-call`: fix non-tool-calling grammar crashes w/ Qwen / Hermes 2 templates (#12900)
* `tool-call`: don't call common_chat_params_init_hermes_2_pro when there aren't tools (or when there's a schema)

* test all chat formats w/o tools
2025-04-11 21:47:52 +02:00
Georgi Gerganov 1d2b613445 tests : fix init order (#0)
ggml-ci
2025-04-11 00:17:47 +03:00
Diego Devesa fe92821ea9 ggml : add bilinear upscale support (ggml/1185) 2025-04-11 00:17:47 +03:00
Plamen Minev 381603a775
ci: detach common from the library (#12827)
* fix: detach common from the library

* fix: building chat test template
2025-04-09 10:11:11 +02:00
Xuan-Son Nguyen bd3f59f812
cmake : enable curl by default (#12761)
* cmake : enable curl by default

* no curl if no examples

* fix build

* fix build-linux-cross

* add windows-setup-curl

* fix

* shell

* fix path

* fix windows-latest-cmake*

* run: include_directories

* LLAMA_RUN_EXTRA_LIBS

* sycl: no llama_curl

* no test-arg-parser on windows

* clarification

* try riscv64 / arm64

* windows: include libcurl inside release binary

* add msg

* fix mac / ios / android build

* will this fix xcode?

* try clearing the cache

* add bunch of licenses

* revert clear cache

* fix xcode

* fix xcode (2)

* fix typo
2025-04-07 13:35:19 +02:00
R0CKSTAR 5f696e88e0
sync : minja (inclusionAI/Ling) and update tests (#12699)
Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
2025-04-03 13:51:35 +02:00
Jeff Bolz f01bd02376
vulkan: Implement split_k for coopmat2 flash attention. (#12627)
When using group query attention, we have one workgroup per KV batch and this
can be very few workgroups (e.g. just 8 in some models). Enable split_k to
spread the work across SMs. This helps a lot when the KV cache is large.
2025-04-02 14:25:08 -05:00
Xuan-Son Nguyen 267c1399f1
common : refactor downloading system, handle mmproj with -hf option (#12694)
* (wip) refactor downloading system [no ci]

* fix all examples

* fix mmproj with -hf

* gemma3: update readme

* only handle mmproj in llava example

* fix multi-shard download

* windows: fix problem with std::min and std::max

* fix 2
2025-04-01 23:44:05 +02:00
Sergei Vorobyov 7242dd9675
llama-chat : Add Yandex instruct model template support (#12621)
* add yandex template

* update yandex chat template

* fix tests

* adjust chat template

* fix style

* fix tool macro in template

* add clarify comment

---------

Co-authored-by: Sergei Vorobev <serv01@yandex-team.ru>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
2025-03-30 20:12:03 +02:00
Georgi Gerganov b4ae50810e
metal : improve FA + improve MoE (#12612)
* ggml : FA with different K, V head sizes (CPU)

ggml-ci

* metal : add FA with HS=192

* metal : extend FA to support different K and V head sizes

ggml-ci

* metal : add FA vector kernels for heads K 192 and V 128

ggml-ci

* ggml : restrict op on other backends to equal head sizes

ggml-ci

* metal : optimize FA-vec kernel

ggml-ci

* metal : FA remove mq registers

* metal : improve MoE mul_mat_id condition

ggml-ci

* metal : fix comments + remove unnecessary addition

ggml-ci

* metal : avoid too much shared memory usage with mul_mat_id

ggml-ci
2025-03-28 20:21:59 +02:00
Michał Moskal 2447ad8a98
upgrade to llguidance 0.7.10 (#12576) 2025-03-26 11:06:09 -07:00
Jeff Bolz 9b169a4d4e
vulkan: fix mul_mat_vec failure in backend tests (#12529)
The OOB calculation could be wrong if the last iteration was during one of
the unrolled loops. Adjust the unrolling counts to avoid this. Add a couple
new backend tests that hit this failure on NVIDIA GPUs.
2025-03-24 07:56:17 +01:00
Georgi Gerganov ba932dfb50
ggml : fix quantized cpy op (#12310)
* ggml : fix quantized cpy op

ggml-ci

* tests : add cpy tests for all types

ggml-ci

* tests : add BF16 copy tests

ggml-ci

* tests : fix loop for same-type copy

ggml-ci

* tests : add option to permute the dst tensor

ggml-ci
2025-03-22 16:23:26 +02:00
Jeff Bolz eddfb43850
vulkan: Optimize mul_mat_vec p021 and nc shaders (#12505)
* tests: add mul_mat perf/functional tests for p021/nc vulkan shaders

* vulkan: Optimize mul_mat_vec p021 and nc shaders.

These shaders are used in attention calculations, and when the KV cache grows
large they start to dominate the run time. For the nc shader (which is called
with large 'k' dimension), use unrolling and vector loads. For the p021 shader
(which is called with large 'm' and small 'k' dimensions), take advantage of
grouped query attention to reuse loads from the A matrix for the whole group,
and reduce the number of workgroups (too much overhead from tiny dispatches).

Using subgroupAdd in the p021 shader also helps, use that conditionally.
2025-03-22 09:40:11 +01:00
Gaurav Garg 517b5ddbf0
CUDA: Improve flash decoding kernel GPU occupancy for BS=1 case (#12183)
- Find out active blocks per SM using cudaOccupancyMaxActiveBlocksPerMultiprocessor API. Use this value to determine the optimal parallel_blocks value.
- Prefer vector flash attention kernels over MMA kernel for BS=1

Fixes Issue: #12182
---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-03-19 20:52:06 +01:00
Molly Sophia 7dfad387e3
llama: Add support for RWKV v7 architecture (#12412)
* ggml: Add op l2_norm

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: Add op rwkv_wkv7

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: Add support for RWKV7 and ARWKV7 models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix inference with RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: add more (a)rwkv7 variants in size

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Apply code-format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* fix MUSA build

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* llama: fix shape error with rwkv using llama-parallel

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2025-03-18 07:27:50 +08:00
Jeff Bolz bf69cfe62f
vulkan: fix bug in coopmat1 mul_mat_id (#12316)
* tests: run mul_mat_id with a larger N

* vulkan: fix bug in coopmat1 mul_mat_id
2025-03-12 06:59:19 +01:00
Georgi Gerganov e128a1bf5b
tests : fix test-quantize-fns to init the CPU backend (#12306)
ggml-ci
2025-03-10 14:07:15 +02:00
Olivier Chafik 4e39a3c332
`server`: extract <think> tags from qwq outputs (#12297)
* extract <think> tags from qwq outputs

* const for all static regexes in chat.cpp
2025-03-10 10:59:03 +00:00
Olivier Chafik 87c2630546
allow missing content in message if tool_calls provided (#12293) 2025-03-10 09:45:07 +00:00
Olivier Chafik 669912d9a5
`tool-call`: fix Qwen 2.5 Coder support, add micro benchmarks, support trigger patterns for lazy grammars (#12034)
* sampler: turn lazy grammar trigger words to regexes

* add scripts/tool_bench.sh & .py

* constrain llama json output regardless of function name if matches at beginning

* update relaxed newline space rule in grammar tests

* support add_generation_prompt query parameter (useful for /apply_template)

* Update src/llama-grammar.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-03-05 13:05:13 +00:00
cmdr2 0cbee131ad cuda/vulkan: specify fp32-only support for some operations in supports_op (ggml/1129)
ggml-ci
2025-03-03 18:18:11 +02:00
cmdr2 87abb7e903 cuda/cpu: Increase support for fp16 unary operations (ggml/1125)
* Support fp16 unary operations in the CUDA backend

* cpu: increase fp16 support for unary operators in the CPU backend

* cuda: increase fp16 support for unary operators in the CUDA backend

* Add test cases for fp16 unary operators

* metal: update supports_op for unary operators that don't support fp16, to prevent test-backend-ops from failing

* metal: fix PR comments for unary op support after fp16 unary tests
2025-03-03 18:18:11 +02:00
cmdr2 f54a4ba11e Support pure float16 add/sub/mul/div operations in the CUDA (and CPU) backend (ggml/1121)
* Support float16-to-float16 add/sub/mul/div operations in the CUDA backend

* Add fp16 support for add/sub/mul/div on the CPU backend

* Add test cases for fp16 add/sub/mul/div
2025-03-03 18:18:11 +02:00
Diego Devesa d5c63cd7f9
test-backend-ops : add option -p to filter by op params (#12155) 2025-03-03 14:00:46 +01:00
William Tambellini 70680c48e5
ggml : upgrade init_tensor API to return a ggml_status (#11854)
* Upgrade init_tensor API to return a ggml_status

To prepare for an 'abort-free' ggml
(ggml not to abort on OOMs but return a OOM status),
as agreeed with Diego in the ggml repo,
upgrade the init_tensor() and view_init() APIs
to return a ggml_status.

* misc fixes

---------

Co-authored-by: slaren <slarengh@gmail.com>
2025-02-28 14:41:47 +01:00
Johannes Gäßler 5fa07c2f93
CUDA: optimize FA for GQA + large batches (#12014) 2025-02-22 12:20:17 +01:00
Olivier Chafik 63e489c025
tool-call: refactor common chat / tool-call api (+ tests / fixes) (#11900)
* tool-call refactoring: moved common_chat_* to chat.h, common_chat_templates_init return a unique_ptr to opaque type

* addressed clang-tidy lints in [test-]chat.*

* rm minja deps from util & common & move it to common/minja/

* add name & tool_call_id to common_chat_msg

* add common_chat_tool

* added json <-> tools, msgs conversions to chat.h

* fix double bos/eos jinja avoidance hack (was preventing inner bos/eos tokens)

* fix deepseek r1 slow test (no longer <think> opening w/ new template)

* allow empty tools w/ auto + grammar

* fix & test server grammar & json_schema params w/ & w/o --jinja
2025-02-18 18:03:23 +00:00
Rémy O 2eea03d86a
vulkan: implement several ops relevant for ggml_opt (#11769)
* vulkan: support memset_tensor

* vulkan: support GGML_OP_SUM

* vulkan: implement GGML_OP_ARGMAX

* vulkan: implement GGML_OP_SUB

* vulkan: implement GGML_OP_COUNT_EQUAL

* vulkan: implement GGML_OP_OPT_STEP_ADAMW

* vulkan: fix check_results RWKV_WKV6 crash and memory leaks

* vulkan: implement GGML_OP_REPEAT_BACK

* tests: remove invalid test-backend-ops REPEAT_BACK tests

* vulkan: fix COUNT_EQUAL memset using a fillBuffer command
2025-02-17 07:55:57 +01:00
Olivier Chafik c7f460ab88
`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (#11607)
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 10:05:16 +00:00
Vinesh Janarthanan 27e8a23300
sampling: add Top-nσ sampler (#11223)
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
bandoti fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf (#11811) 2025-02-12 10:06:53 -04:00
Olivier Chafik 9f4cc8f8d3
`sync`: minja (#11641)
* `sync`: minja

182de30cda

https://github.com/google/minja/pull/46

https://github.com/google/minja/pull/45
2025-02-05 01:00:12 +00:00
Johannes Gäßler fd08255d0d
CUDA: non-contiguous (RMS) norm support (#11659)
* CUDA: non-contiguous (RMS) norm support

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-04 22:21:42 +01:00
Olivier Chafik db288b60cb
`tool-call`: command r7b fix for normal responses (#11608)
* fix command r7b normal response regex + add to server test

* test multiline non-tool-call responses in test-chat
2025-02-04 15:48:53 +00:00
Olivier Chafik bfcce4d693
`tool-call`: support Command R7B (+ return tool_plan "thoughts" in API) (#11585)
* `tool-call`: support Command R7B (w/ tool_plan return)

* `tool-call`: cleaner preservation of tokens + warn when likely bad chat template override

* `tool-call`: test cleanup / handle lazy grammar triggers
2025-02-02 09:25:38 +00:00
Michał Moskal ff227703d6
sampling : support for llguidance grammars (#10224)
* initial porting of previous LLG patch

* update for new APIs

* build: integrate llguidance as an external project

* use '%llguidance' as marker to enable llg lark syntax

* add some docs

* clarify docs

* code style fixes

* remove llguidance.h from .gitignore

* fix tests when llg is enabled

* pass vocab not model to llama_sampler_init_llg()

* copy test-grammar-integration.cpp to test-llguidance.cpp

* clang fmt

* fix ref-count bug

* build and run test

* gbnf -> lark syntax

* conditionally include llguidance test based on LLAMA_LLGUIDANCE flag

* rename llguidance test file to test-grammar-llguidance.cpp

* add gh action for llg test

* align tests with LLG grammar syntax and JSON Schema spec

* llama_tokenizer() in fact requires valid utf8

* update llg

* format file

* add $LLGUIDANCE_LOG_LEVEL support

* fix whitespace

* fix warning

* include <cmath> for INFINITY

* add final newline

* fail llama_sampler_init_llg() at runtime

* Link gbnf_to_lark.py script; fix links; refer to llg docs for lexemes

* simplify #includes

* improve doc string for LLAMA_LLGUIDANCE

* typo in merge

* bump llguidance to 0.6.12
2025-02-02 09:55:32 +02:00
piDack 0cec062a63
llama : add support for GLM-Edge and GLM-Edge-V series models (#10573)
* add glm edge chat model

* use config partial_rotary_factor as rope ratio

* support for glm edge model

* vision model support

* remove debug info

* fix format

* llava.cpp trailing whitespace

* remove unused AutoTokenizer

* Update src/llama.cpp for not contain <|end|> or </s>

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* add edge template

* fix chat template

* fix confict

* fix confict

* fix ci err

* fix format err

* fix template err

* 9b hf chat support

* format

* format clip.cpp

* fix format

* Apply suggestions from code review

* Apply suggestions from code review

* Update examples/llava/clip.cpp

* fix format

* minor : style

---------

Co-authored-by: liyuhang <yuhang.li@zhipuai.cn>
Co-authored-by: piDack <pcdack@hotmail.co>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: liyuhang <yuhang.li@aminer.cn>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-02 09:48:46 +02:00
Olivier Chafik 8b576b6c55
Tool call support (generic + native for Llama, Functionary, Hermes, Mistral, Firefunction, DeepSeek) w/ lazy grammars (#9639)
---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-01-30 19:13:58 +00:00
Akarshan Biswas 6e84b0ab8e
SYCL : SOFTMAX F16 mask support and other fixes (#11261)
Implemented ggml_sycl_op_soft_max() F16 src1(mask) support for which a pragma deprecation warning was added during #5021.
To do this, had to decouple it from ggml_sycl_op_flatten which always considered src1 to be of fp32 type(many OP functions are dependent on it).

* SYCL: SOFTMAX F16 mask support and other fixes

* test-backend-ops: Add F16 mask test cases
2025-01-28 09:56:58 +00:00
Johannes Gäßler 8137b4bb2b
CPU/CUDA: fix (GQA) mul mat back, add CUDA support (#11380) 2025-01-24 12:38:31 +01:00
Jeff Bolz 564804b79b
tests: fix some mul_mat test gaps (#11375)
Now that we have batched mat-vec mul Vulkan shaders for up to n==8,
these tests weren't actually exercising the mat-mat mul path. Test
n==9 as well. Also, change to use all_types.
2025-01-23 14:51:24 -06:00
Olivier Chafik 6171c9d258
Add Jinja template support (#11016)
* Copy minja from 58f0ca6dd7

* Add --jinja and --chat-template-file flags

* Add missing <optional> include

* Avoid print in get_hf_chat_template.py

* No designated initializers yet

* Try and work around msvc++ non-macro max resolution quirk

* Update test_chat_completion.py

* Wire LLM_KV_TOKENIZER_CHAT_TEMPLATE_N in llama_model_chat_template

* Refactor test-chat-template

* Test templates w/ minja

* Fix deprecation

* Add --jinja to llama-run

* Update common_chat_format_example to use minja template wrapper

* Test chat_template in e2e test

* Update utils.py

* Update test_chat_completion.py

* Update run.cpp

* Update arg.cpp

* Refactor common_chat_* functions to accept minja template + use_jinja option

* Attempt to fix linkage of LLAMA_CHATML_TEMPLATE

* Revert LLAMA_CHATML_TEMPLATE refactor

* Normalize newlines in test-chat-templates for windows tests

* Forward decl minja::chat_template to avoid eager json dep

* Flush stdout in chat template before potential crash

* Fix copy elision warning

* Rm unused optional include

* Add missing optional include to server.cpp

* Disable jinja test that has a cryptic windows failure

* minja: fix vigogne (https://github.com/google/minja/pull/22)

* Apply suggestions from code review

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Finish suggested renamings

* Move chat_templates inside server_context + remove mutex

* Update --chat-template-file w/ recent change to --chat-template

* Refactor chat template validation

* Guard against missing eos/bos tokens (null token otherwise throws in llama_vocab::impl::token_get_attr)

* Warn against missing eos / bos tokens when jinja template references them

* rename: common_chat_template[s]

* reinstate assert on chat_templates.template_default

* Update minja to b8437df626

* Update minja to https://github.com/google/minja/pull/25

* Update minja from https://github.com/google/minja/pull/27

* rm unused optional header

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-01-21 13:18:51 +00:00
Georgi Gerganov 4dd34ff831
cmake : add sanitizer flags for llama.cpp (#11279)
* cmake : add sanitizer flags for llama.cpp

ggml-ci

* tests : fix compile warnings

ggml-ci

* cmake : move sanitizer flags to llama_add_compile_flags

ggml-ci

* cmake : move llama.cpp compile flags to top level lists

ggml-ci

* cmake : apply only sanitizer flags at top level

ggml-ci

* tests : fix gguf context use in same_tensor_data

* gguf-test: tensor data comparison

* dummy : trigger ggml-ci

* unicode : silence gcc warnings

ggml-ci

* ci : use sanitizer builds only in Debug mode

ggml-ci

* cmake : add status messages [no ci]

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2025-01-18 16:18:15 +02:00
Jeff Bolz 44e18ef939
vulkan: fix coopmat2 flash attention for non-contiguous inputs (#11281)
Add code similar to mul_mm_cm2 to force alignment of strides, to avoid
a performance regression.

Add noncontiguous FA tests in test-backend-ops.

Fixes #11268.
2025-01-18 09:26:50 +01:00
Jeff Bolz bd38ddea01
vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl (#11166)
* vulkan: support copy from f32 to q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl

Shaders are based on cpy.cu.

* vulkan: support copy from q4_0/q4_1/q5_0/q5_1/q8_0/iq4_nl to f32

* ggml: copy q->f32 assumes some contiguity in the destination
2025-01-16 22:47:10 +01:00
Johannes Gäßler 9c8dcefe17
CUDA: backwards pass for misc. ops, add tests (#11257)
* CUDA: backwards pass for misc. ops, add tests

* remove restrict from pointers
2025-01-16 16:43:38 +01:00
Georgi Gerganov f11cfdfd7f
ci : use -no-cnv in gguf-split tests (#11254)
* ci : use -no-cnv in gguf-split tests

ggml-ci

* ci : use -no-cnv in requantize tests

ggml-ci

* scripts : fix [no ci]
2025-01-15 18:28:35 +02:00
Johannes Gäßler 432df2d5f9
RoPE: fix back, CUDA support for back + noncont. (#11240)
* RoPE: fix back, CUDA support for back + noncont.

* fix comments reg. non-cont. RoPE support [no-ci]
2025-01-15 12:51:37 +01:00
Olivier Chafik 091592d758
Refactor test-chat-template.cpp (#11224)
* Refactor test-chat-template

* Update test-chat-template.cpp
2025-01-14 10:16:41 +00:00
Georgi Gerganov afa8a9ec9b
llama : add `llama_vocab`, functions -> methods, naming (#11110)
* llama : functions -> methods (#11110)

* llama : add struct llama_vocab to the API (#11156)

ggml-ci

* hparams : move vocab params to llama_vocab (#11159)

ggml-ci

* vocab : more pimpl (#11165)

ggml-ci

* vocab : minor tokenization optimizations (#11160)

ggml-ci

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* lora : update API names (#11167)

ggml-ci

* llama : update API names to use correct prefix (#11174)

* llama : update API names to use correct prefix

ggml-ci

* cont

ggml-ci

* cont

ggml-ci

* minor [no ci]

* vocab : llama_vocab_add_[be]os -> llama_vocab_get_add_[be]os (#11174)

ggml-ci

* vocab : llama_vocab_n_vocab -> llama_vocab_n_tokens (#11174)

ggml-ci

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-01-12 11:32:42 +02:00
Molly Sophia ee7136c6d1
llama: add support for QRWKV6 model architecture (#11001)
llama: add support for QRWKV6 model architecture (#11001)

* WIP: Add support for RWKV6Qwen2

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Some graph simplification

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Add support for RWKV6Qwen2 with cpu and cuda GLA

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV6[QWEN2]: Concat lerp weights together to reduce cpu overhead

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix some typos

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* code format changes

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix wkv test & add gla test

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Fix cuda warning

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update README.md

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update ggml/src/ggml-cuda/gla.cu

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Fix fused lerp weights loading with RWKV6

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* better sanity check skipping for QRWKV6 in llama-quant

thanks @compilade

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: compilade <git@compilade.net>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: compilade <git@compilade.net>
2025-01-10 09:58:08 +08:00
Xuan Son Nguyen d9feae1c06
llama-chat : add phi 4 template (#11148) 2025-01-09 10:07:33 +01:00
Johannes Gäßler 53ff6b9b9f
GGUF: C++ refactor, backend support, misc fixes (#11030)
* GGUF: C++ refactor, backend support, misc fixes

remove ggml_tensor.backend

update CODEOWNERS [no ci]

remove gguf_get_data from API

revise GGUF API data types
2025-01-07 18:01:58 +01:00
Georgi Gerganov 47182dd03f
llama : update llama_model API names (#11063)
* llama : deprecate llama_free_model, add llama_model_free

ggml-ci

* llama : change `llama_load_model_from_file` -> `llama_model_load_from_file`

ggml-ci
2025-01-06 10:55:18 +02:00
Jeff Bolz 716bd6dec3
vulkan: optimize mul_mat for small values of N (#10991)
Make the mul_mat_vec shaders support N>1 (as a spec constant, NUM_COLS) where
the batch_strides are overloaded to hold the row strides. Put the loads from the
B matrix in the innermost loop because it should cache better.

Share some code for reducing the result values to memory in mul_mat_vec_base.
2024-12-30 18:27:11 +01:00
Jeff Bolz a813badbbd
vulkan: im2col and matmul optimizations for stable diffusion (#10942)
* tests: Add im2col perf tests

* vulkan: optimize im2col, more elements per thread

* vulkan: increase small tile size for NV_coopmat2

* vulkan: change im2col to 512 elements per workgroup
2024-12-29 10:16:34 +01:00
Yun Dou b92a14a841
llama : support InfiniAI Megrez 3b (#10893)
* Support InfiniAI Megrez 3b

* Fix tokenizer_clean_spaces for megrez
2024-12-23 01:35:44 +01:00
Georgi Gerganov 5cab3e4aaa
llama : minor grammar refactor (#10897)
ggml-ci
2024-12-19 17:42:13 +02:00
Johannes Gäßler cd920d0ac3
tests: disable GGUF test for bad value size (#10886) 2024-12-19 08:53:58 +01:00
Johannes Gäßler 081b29bd2a
tests: add tests for GGUF (#10830) 2024-12-17 19:09:35 +01:00
Georgi Gerganov 0006f5a74a
ggml : update ggml_backend_cpu_device_supports_op (#10867)
* ggml : fix cpy op for IQ-quants to use reference impl

ggml-ci

* ggml : disable tests involving i-matrix quantization

* ggml : update ggml_backend_cpu_device_supports_op

ggml-ci
2024-12-17 18:35:42 +02:00
Georgi Gerganov 644fd71b44
sampling : refactor + optimize penalties sampler (#10803)
* sampling : refactor + optimize penalties sampler

ggml-ci

* common : apply ignore_eos as logit bias

ggml-ci

* batched : remove penalties sampler

* params : allow penalty_last_n == -1 to be equal to context size

ggml-ci

* common : by default, move the penalties at the end of the sampling chain

ggml-ci

* common : ignore all EOG tokens

Co-authored-by: Diego Devesa <slarengh@gmail.com>

* common : move back the penalties at the front of the sampling chain

ggml-ci

* readme : restore hint about --ignore-eos flag [no ci]

* llama : minor

ggml-ci

* webui : update

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-16 12:31:14 +02:00
Valentin Mamedov a0974156f3
llama : add Deepseek MoE v1 & GigaChat models (#10827)
* Add deepseek v1 arch & gigachat template

* improve template code

* add readme

* delete comments

* remove comment

* fix format

* lint llama.cpp

* fix order of deepseek and deepseek2, move gigachat temlate to the end of func

* fix order of deepseek and deepseek2 in constants; mark shared exp as deepseek arch need

* remove comments

* move deepseek above deepseek2

* change placement of gigachat chat template
2024-12-15 19:02:46 +02:00
HimariO ba1cb19cdd
llama : add Qwen2VL support + multimodal RoPE (#10361)
* Barebone Qwen2VL LLM convertor

* Add Qwen2VL cli entrypoint

* [WIP] add qwen2vl arch

* Verify m-rope output

* Add vl-rope/2d-rope support for qwen2vl ViT

* update qwen2vl cli tool

* update 5D tensor op workaround

* [WIP] qwen2vl vision model

* make batch and clip utils compatible with qwen2vl

* [WIP] create inference workflow, gguf convert script but fix

* correcting vision-rope behavior, add the missing last layer back to ViT

* add arg parser to qwen2vl_surgery

* replace variable size array with vector

* cuda-gdb cmake preset

* add fp32 mrope, vision rope kernel

* add fp16 support for qwen2vl and m-rope

* add `GGML_ROPE_TYPE_MROPE`, `GGML_ROPE_TYPE_VISION`

* fix rope op mode switching, out dated func args

* update `llama_hparams`

* update to keep up stream changes

* resolve linter, test errors

* add makefile entry, update speical image padding token

* add mrope unit test, fix few compiler warnings

* rename `mrope` related function, params

* minor updates on debug util, bug fixs

* add `m-rope` testcase to `test-backend-ops`

* Apply suggestions from code review

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix traililng whitespce

* store `llama_hparams.rope_sections` with fixed size array

* update position id tensor size check in GGML_OP_ROPE

* minor updates

* update `ggml_backend_*_supports_op` of unsupported backends

* remote old `rope_section` compare operator

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-12-14 14:43:46 +02:00
Diego Devesa cb13ef85a4
remove CMAKE_WINDOWS_EXPORT_ALL_SYMBOLS (#10797)
other windows build fixes
2024-12-12 19:02:49 +01:00
PAB a8cbab201d
ggml: add `GGML_SET` Metal kernel + i32 CPU kernel (ggml/1037)
* implemented cpu kernel

* add i32 test cases in test-backend-ops

* typedef `ggml_metal_kargs_set`

* implemented `kernel_set`

* memcpy
2024-12-05 13:27:33 +02:00
PAB c2082d93a8
ggml : add `GGML_PAD_REFLECT_1D` operation (ggml/1034)
* ggml_pad_reflect_1d defined in header

* implemented on CPU

* called the forward pass

* impl Metal kernel

* added Metal kernel

* added OP_PAD_REFLECT_1D in test-backend-ops.cpp

* add test-pad-reflect-1d test case

* test case support multiple backend
2024-12-05 13:27:31 +02:00
ltoniazzi 253b7fde91
Fix HF repo commit to clone lora test models (#10649) 2024-12-04 10:45:48 +01:00
Jeff Bolz 2759916d86
vulkan: Implement "fast divide" (mul+shift) for unary ops like copy (#10642) 2024-12-04 08:28:59 +01:00
PAB efb6ae9630 feat: add `GGML_UNARY_OP_ARGMAX` Metal kernel (ggml/1019)
* implemented argmax kernel

* tpig -> tgpig

* change to strides

* contiguous assertions

* kernel working and tested

* argmax simd parallel implementation

* added 2 new tests for argmax in test-backend-ops

* cosmit

* added 3 tests cases for perf eval

* add test_argmax in make_test_cases_perf

* Update test-backend-ops.cpp

Co-authored-by: Diego Devesa <slarengh@gmail.com>

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2024-12-03 20:04:49 +02:00
Georgi Gerganov 0115df2f65
metal : small-batch mat-mul kernels (#10581)
* metal : small-batch mat-mul kernels

ggml-ci

* metal : add rest of types

ggml-ci

* metal : final adjustments

ggml-ci

* metal : add comments

ggml-ci
2024-12-03 11:52:33 +02:00
Xuan Son Nguyen 642330ac7c
llama : add enum for built-in chat templates (#10623)
* llama : add enum for supported chat templates

* use "built-in" instead of "supported"

* arg: print list of built-in templates

* fix test

* update server README
2024-12-02 22:10:19 +01:00
Juk Armstrong 917786f43d
Add `mistral-v1`, `mistral-v3`, `mistral-v3-tekken` and `mistral-v7` chat template types (#10572)
* Templates: `mistral-v1`, `mistral-v2`, `mistral-v3`, `mistral-v3-tekken`

* Changed system message logic and added tests for all 4

* Invalid `system_message` instead of `content` fixed

* Removed tab-indented lines

* Added template code and test for `mistral-v7`

* Added all tests. Fixed bug with `tmpl == "llama2"` test.

* Replaced tabs with spaces.

* Removed `'mistral-v2'` option as no (open) models ever used it

* Removed all references to 'v2' template from comments

* Update llama.cpp

Fixed `trim_assistant_message` bug
2024-12-01 23:09:49 +01:00
Diego Devesa 7cc2d2c889
ggml : move AMX to the CPU backend (#10570)
* ggml : move AMX to the CPU backend

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-29 21:54:58 +01:00
Georgi Gerganov f0678c5ff4
ggml : fix I8MM Q4_1 scaling factor conversion (#10562)
ggml-ci
2024-11-29 16:25:39 +02:00
Jeff Bolz 904109ed0d
vulkan: fix group_norm (#10496)
Fix bad calculation of the end of the range. Add a backend test that
covers the bad case (taken from stable diffusion).

Fixes https://github.com/leejet/stable-diffusion.cpp/issues/439.
2024-11-26 16:45:05 +01:00
Diego Devesa 5931c1f233
ggml : add support for dynamic loading of backends (#10469)
* ggml : add support for dynamic loading of backends

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-11-25 15:13:39 +01:00
Georgi Gerganov f6d12e7df8
tests : fix compile warning 2024-11-25 15:17:32 +02:00
Georgi Gerganov d9d54e498d
speculative : refactor and add a simpler example (#10362)
* speculative : refactor and add a simpler example

ggml-ci

* speculative : clean-up and add comments and TODOs [no ci]

* speculative : manage context in common_speculative

ggml-ci

* speculative : simplify

ggml-ci

* speculative : simplify (cont)

ggml-ci

* speculative : add --draft-min CLI arg

* speculative : minor fixup

* make : build fixes

* speculative : do not redraft previous drafts

ggml-ci

* speculative : fix the draft sampling

ggml-ci

* speculative : fix compile warning

* common : refactor args

ggml-ci

* common : change defaults [no ci]

* common : final touches

ggml-ci
2024-11-25 09:58:41 +02:00
Diego Devesa a5e47592b6
cuda : optimize argmax (#10441)
* cuda : optimize argmax

* remove unused parameter

ggml-ci

* fixup : use full warps

ggml-ci

* Apply suggestions from code review

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>

* fix ub

* ggml : check ne00 <= INT32_MAX in argmax and argsort

---------

Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-11-21 18:18:50 +01:00
Johannes Gäßler 02e4eaf22f
ggml-opt: fix data corruption (ggml/1022) 2024-11-21 09:22:02 +02:00
Jeff Bolz b3e585988f
vulkan: Optimize soft_max (#10301)
* vulkan: Optimize soft_max

Large soft_max could already saturate memory, but small/medium sizes were
pretty slow. The bulk of the gains for them comes from using a smaller
workgroup size, and making the workgroup size match the subgroup size also
makes the barriers much cheaper.

Cache some values in locals to avoid refetching/recomputing. And stamp
out a few "template instantiations" so smaller cases will fully unroll.

Add a missing early return for OOB rows. This happens when there are more
than 512 rows and the dispatch is 512 x H.

* vulkan: Further soft_max optimizations

Restore the workgroup size of 512 case, use it for >1024.

Use unrollable loops for more iteration counts.
2024-11-19 08:25:17 +01:00
Georgi Gerganov 24203e9dd7 ggml : inttypes.h -> cinttypes (#0)
ggml-ci
2024-11-17 08:30:29 +02:00
Georgi Gerganov 84274a10c3 tests : remove test-grad0 2024-11-17 08:30:29 +02:00
Johannes Gäßler 8a43e940ab ggml: new optimization interface (ggml/988) 2024-11-17 08:30:29 +02:00
Diego Devesa ae8de6d50a
ggml : build backends as libraries (#10256)
* ggml : build backends as libraries

---------

Signed-off-by: Xiaodong Ye <xiaodong.ye@mthreads.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: R0CKSTAR <xiaodong.ye@mthreads.com>
2024-11-14 18:04:35 +01:00
Jeff Bolz 80dd7ff22f
vulkan: Optimize contiguous copies (#10254)
* tests: Fix memory bandwidth calculation for perf tests

Add a flops calculation for flash attention.

Add one GGML_OP_CPY perf test.

* vulkan: Optimize contiguous copies

Add a variant of the copy shader for when the tensors are contiguous. Avoid
the complex addressing calculations, and do four elements per invocation
to hide some other overhead.

Apply similar changes to the scale shader, since scale is always contiguous.

Add a "progress bar" for shader compiles.
2024-11-13 07:58:57 +01:00
Georgi Gerganov 841f27abdb
metal : optimize FA kernels (#10171)
* ggml : add ggml_flash_attn_ext_get_prec

* metal : use F16 precision in FA kernels

ggml-ci

* metal : minor clean-up

* metal : compile-guard bf16 FA kernels

ggml-ci

* build : remove obsolete compile flag [no ci]

* metal : prevent int overflows [no ci]

* cuda : disable BF16 FA

ggml-ci

* metal : fix BF16 requirement for FA kernels

ggml-ci

* make : clean-up [no ci]
2024-11-08 13:47:22 +02:00
Xuan Son Nguyen a71d81cf8c
server : revamp chat UI with vuejs and daisyui (#10175)
* server : simple chat UI with vuejs and daisyui

* move old files to legacy folder

* embed deps into binary

* basic markdown support

* add conversation history, save to localStorage

* fix bg-base classes

* save theme preferences

* fix tests

* regenerate, edit, copy buttons

* small fixes

* docs: how to use legacy ui

* better error handling

* make CORS preflight more explicit

* add GET method for CORS

* fix tests

* clean up a bit

* better auto scroll

* small fixes

* use collapse-arrow

* fix closeAndSaveConfigDialog

* small fix

* remove console.log

* fix style for <pre> element

* lighter bubble color (less distract when reading)
2024-11-07 17:31:10 -04:00
Zhiyuan Li 3bcd40b3c5
Optimize RWKV6 Operator Naming and Implement Multi-core CPU/ SYCL Acceleration (#10133)
* rwkv6: rename to wkv6

* rwkv6: support avx2 avx512 armv8 armv9

* rwkv6: update cuda file name

* rwkv6: rename params

* wkv on sycl

* sycl: add some ops

* sycl: Enhance OP support judgment

* wkv6: drop armv9 and tranfer to GGML style

ggml-ci

* sync : ggml

* update the function to use appropriate types

* fix define error

* Update ggml/src/ggml-cpu.c

* add appropriate asserts

* move element-wise functions outside

* put the declaration outside the loop

* rewrite to be more inline with the common pattern for distributing threads

* use recommended way GGML_TENSOR_LOCALS

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Diego Devesa <slarengh@gmail.com>
Co-authored-by: Plamen Minev <pacominev@gmail.com>
Co-authored-by: Yuri Khrustalev <ykhrustalev@users.noreply.github.com>
Co-authored-by: Meng, Hengyu <airdldl@163.com>
2024-11-07 15:19:10 +08:00
Georgi Gerganov 5c333e0140
metal : add BF16 support (#8439)
* ggml : add initial BF16 support

ggml-ci

* metal : add mul_mat_id BF16 support

ggml-ci

* metal : check for bfloat support on the Metal device

ggml-ci

* metal : better var names [no ci]

* metal : do not build bfloat kernels when not supported

ggml-ci

* metal : try to fix BF16 support check

ggml-ci

* metal : this should correctly check bfloat support
2024-11-06 19:53:51 +02:00
Diego Devesa 9f40989351
ggml : move CPU backend to a separate file (#10144) 2024-11-03 19:34:08 +01:00
Georgi Gerganov 8d8ff71536
llama : remove Tail-Free sampling (#10071)
ggml-ci
2024-10-29 10:42:05 +02:00
arch-btw 61715d5cc8
llama : Add IBM granite template (#10013)
* Add granite template to llama.cpp

* Add granite template to test-chat-template.cpp

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Update tests/test-chat-template.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Added proper template and expected output

* Small change to \n

Small change to \n

* Add code space &

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

* Fix spacing

* Apply suggestions from code review

* Update src/llama.cpp

---------

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-28 18:45:33 +01:00
wwoodsTM ff252ea48e
llama : add DRY sampler (#9702)
* sampling : add DRY sampler (post-refactor)

* DRY: Trying to fix coauthors, removed unneeded line

* DRY: Fixed redundant code

* DRY: Fixed crash issue due to DRY being in chain but uninitialized

---------

Co-authored-by: l3utterfly <gc.pthzfoldr@gmail.com>
Co-authored-by: pi6am <34464159+pi6am@users.noreply.github.com>
2024-10-25 19:07:34 +03:00
Johannes Gäßler c39665f589
CUDA: fix MMQ for non-contiguous src0, add tests (#10021)
* CUDA: fix MMQ for non-contiguous src0, add tests

* revise test code
2024-10-24 11:09:36 +02:00
Johannes Gäßler 80273a306d CUDA: fix 1D im2col, add tests (ggml/993) 2024-10-23 16:50:02 +03:00
Jun Hee Yoo 4c9388fb96
metal : add POOL2D and fix IM2COL (#9943)
* add pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix im2col and add unittest for N>=1024

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* add tests for N % 1024 != 0

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* remove trailing whitespaces

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply suggestions

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply more optimization

- original IM2COL kernel + _ext with MIN()

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review: change kernel name of pool_2d

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* apply review

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

* fix more formatting and enhance readability

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>

---------

Signed-off-by: Junhee Yoo <junhee.yoo@navercorp.com>
2024-10-23 13:33:45 +03:00
Molly Sophia 11d47057a5
Rwkv chat template fix (#10001)
* llama: remove useless template matching for rwkv-world

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* converter: Add comment about the hack for rwkv models

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* Update src/llama.cpp

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
2024-10-22 15:22:26 +02:00
Molly Sophia 4ff7fe1fb3
llama : add chat template for RWKV-World + fix EOT (#9968)
* Add chat template for RWKV-World

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV: Fix the chat template not being used

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* RWKV v6: Set EOT token to ``\n\n``

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* readme: add rwkv into supported model list

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-10-22 13:33:37 +03:00
Georgi Gerganov 55e47786e3
llama : default sampling changes + greedy update (#9897)
* llama : deprecate softmax sampler + fix dist sampler

ggml-ci

* tests : replace macros with functions

ggml-ci

* sampling : change temperature sampler logic

For t <= 0.0f, keep the max logit intact and set the rest to -inf

* cont : no need for special "greedy" logic

top-k == 1 is the same

* tests : init prob correctly

* llama : handle temp <= 0.0 in the temp_ext sampler too

ggml-ci

* cont : avoid extra loop in temperature sampler for sub-zero temp

ggml-ci
2024-10-21 09:46:40 +03:00
Joe Eli McIlvain 66c2c93082
grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
2024-10-16 19:03:24 +03:00
MaggotHATE fbc98b748e
sampling : add XTC sampler (#9742)
* Initial XTC commit

Adds XTC sampler, not activated by default, but recommended settings by default.

* Cleanup

* Simplified chances calculation

To be more inline with the original implementation, chance is calculated once at the beginning.

* First fixes by comments

Still need to look into sorting

* Fixed trailing backspaces

* Fixed RNG to be reproduceable 

Thanks to @slaren for directions

* Fixed forgotten header

* Moved `min_keep` 

Moved from conditions to a simple check at the end.

* Fixed broken randomization

Thanks to @slaren for explanation

* Swapped sorting for a custom algorithm

Shifts tokens to remove the penalized ones, then puts the penalized at the back. Should make `min_keep` still viable.

* Algorithm rework

1. Scan token from top till the first non-penalizable
2. Remove the last captured token (the least probable above threshold)
3. Shift all tokens to override the remaining penalizable
4. Penalize and put them at the the bottom.

* Added XTC to `test-sampling`

* Simplified algorithm and more tests

* Updated info in common and args

* Merged back lost commits in common and arg

* Update dump info in common

* Fixed incorrect min_keep check

* Added XTC to README

* Renamed parameters, fixed info and defaults

* probability is at 0 by default, but XTC is included in sampling queue
* threshold higher than 0.5 switches XTC off

* Initial server support

* Added XTC to server UIs

* Fixed labels in old server UI

* Made algorithm safer and more readable

* Removed xtc_threshold_max

* Fixed arg after update

* Quick fixes by comments

* Simplified algorithm since threshold_max is removed

* Renamed random distribution

* Fixed tests and outdated README

* Small fixes
2024-10-15 12:54:55 +02:00
Diego Devesa 7eee341bee
common : use common_ prefix for common library functions (#9805)
* common : use common_ prefix for common library functions

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-10-10 22:57:42 +02:00
Diego Devesa dca1d4b58a
ggml : fix BLAS with unsupported types (#9775)
* ggml : do not use BLAS with types without to_float

* ggml : return pointer from ggml_internal_get_type_traits to avoid unnecessary copies

* ggml : rename ggml_internal_get_type_traits -> ggml_get_type_traits

it's not really internal if everybody uses it
2024-10-08 14:21:43 +02:00
Diego Devesa 6374743747
ggml : add backend registry / device interfaces to BLAS backend (#9752)
* ggml : add backend registry / device interfaces to BLAS backend

* fix mmap usage when using host buffers
2024-10-07 21:55:08 +02:00
Johannes Gäßler fabdc3bda3
ggml/ex: calculate accuracy in graph, adapt MNIST (ggml/980) 2024-10-03 21:17:26 +03:00
Diego Devesa c83ad6d01e
ggml-backend : add device and backend reg interfaces (#9707)
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
2024-10-03 01:49:47 +02:00
Johannes Gäßler e98c1c188e
test: fix OPT_STEP_ADAMW for test-backend-ops (ggml/974) 2024-10-01 16:07:40 +03:00
Johannes Gäßler 7254cdf7e8
ggml: fix gradient allocation logic (ggml/966)
* ggml: fix gradient allocation logic

* gradient allocation in ggml_build_backward_expand

* fixup

* fix test-backend-ops grad

* suggestions by slaren

* fix test1.c

* fix legacy opt API

* fix test-grad0

* remove keep arg
2024-10-01 16:07:38 +03:00
slaren 1b2f992cd2
test-backend-ops : use flops for some performance tests (#9657)
* test-backend-ops : use flops for some performance tests

- parallelize tensor quantization

- use a different set of cases for performance and correctness tests

- run each test for at least one second
2024-09-28 14:32:46 +02:00
Zhenwei Jin 6102037bbb
vocab : refactor tokenizer to reduce init overhead (#9449)
* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* refactor tokenizer

* refactor tokenizer

* llama : make llm_tokenizer more private

ggml-ci

* remove unused files

* remove unused fileds to avoid unused filed build error

* avoid symbol link error

* Update src/llama.cpp

* Update src/llama.cpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-09-28 15:10:58 +03:00
Georgi Gerganov b0f27361f3
sampling : avoid expensive softmax during greedy sampling (#9605)
* sampling : avoid expensive softmax during greedy sampling

ggml-ci

* speculative : fix default RNG seed + set sparams.n_probs

* Update tests/test-sampling.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* sampling : add clarifying comment [no ci]

---------

Co-authored-by: slaren <slarengh@gmail.com>
2024-09-24 09:03:17 +03:00
Johannes Gäßler a5b57b08ce
CUDA: enable Gemma FA for HIP/Pascal (#9581) 2024-09-22 09:34:52 +02:00
Molly Sophia 2a63caaa69
RWKV v6: RWKV_WKV op CUDA implementation (#9454)
* ggml: CUDA unary op EXP

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

* ggml: rwkv_wkv op CUDA impl

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>

---------

Signed-off-by: Molly Sophia <mollysophia379@gmail.com>
2024-09-22 04:29:12 +02:00
Johannes Gäßler 424c5d00a9 ggml/examples: add backend support for numerical optimization (ggml/949)
* CUDA eval works

* stochastic gradient descent op

* Adam except decay

* CUDA CROSS_ENTROPY_LOSS_BACK

* CUDA mnist-fc training works

* backend CLI arg

* refactor gguf load

* remove sched from opt_step_adam

* implement l1 regularization (weight decay)

* extra call to add optimizer

* initialize gradients with ggml_graph_reset

* gradient accumulation

* increment iter per eval instead of epoch

* adjust backend interfaces

* fix ggml_graph_reset without backend

* fix ggml graph export/import

* fixup

* rename

* revert ggml_opt changes

* more general CUDA repeat_back

* update documentation, fix CNN

* validation split

* add clarifying comment

* optimize PyTorch training

* adjust buffer size, thread count

* fix 0.0f validation split

* Update examples/mnist/mnist-common.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* fix gradient accumulation

* tensor flag for accumulators -> tensor hash set

* Update include/ggml.h

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* Update tests/test-backend-ops.cpp

Co-authored-by: slaren <slarengh@gmail.com>

* fix test prints

* Update src/ggml-backend.c

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* better CUDA support for noncontiguous out_prod

* add comment

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: slaren <slarengh@gmail.com>
2024-09-20 21:15:05 +03:00