Commit Graph

4801 Commits

Author SHA1 Message Date
Olivier Chafik c7f460ab88
`server`: fix tool-call of DeepSeek R1 Qwen, return reasoning_content (Command 7RB & DeepSeek R1) unless `--reasoning-format none` (#11607)
* extract & return thoughts in reasoning_content field (unless --reasoning-format) for DeepSeek R1 & Command R7B

* tool-calls: add deepseek r1 template (models/templates/llama-cpp-deepseek-r1.jinja) + hackommodate broken official template

* tool-calls: accommodate variety of wrong tool call opening tags both R1 Qwen 32B and 7B distills like to spit out

* server/oai: ensure content is null when there are tool calls, and reasoning_content appears before content for readability

* tool-calls: add DeepSeek R1 Qwen distills to server/README.md & server tests

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 10:05:16 +00:00
Vinesh Janarthanan 27e8a23300
sampling: add Top-nσ sampler (#11223)
* initial sampling changes:

* completed top nsigma sampler implementation

* apply parameter to only llama-cli

* updated readme

* added tests and fixed nsigma impl

* cleaned up pr

* format

* format

* format

* removed commented tests

* cleanup pr and remove explicit floats

* added top-k sampler to improve performance

* changed sigma to float

* fixed string format to float

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update common/sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update src/llama-sampling.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* added llama_sampler_init

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov e4376270d9
llama.cpp: fix warning message (#11839)
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.

Before the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```

After the fix:
```
slot launch_slot_: id  0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
2025-02-13 08:25:34 +02:00
Daniel Bevenius 3e69319772
llama : update llama_decode_internal ref [no ci] (#11840)
This commit updates the comment in llama_kv_cache.h to reflect the
change of the function name from llama_decode_internal to
llama_decode_impl.
2025-02-13 08:07:51 +02:00
Diego Devesa a394039db0
ggml-cpu : add chunking support to mul_mat_id (#11666)
* ggml-cpu : add chunking support to mul_mat_id

* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row

* disable for arm

* cleanup

* better way to disable for arm

* fix uninitialized counter when using 1 thread only

* revert test-backend-ops changes
2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen be3bbd6215
ggml : x2 speed for WASM by optimizing SIMD (#11453)
* ggml : x2 speed for WASM by optimizing SIMD

* fix bad merging

* rm trailing spaces

* rm redundant clamp

* better quantize_row_q8_K

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>

---------

Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
2025-02-13 00:33:45 +01:00
Woof Dog 31afcbee0e
server : (webui) Give copy button back to all message bubbles (#11814)
* All messages get the copy button

* Update index.html.gz
2025-02-12 23:47:11 +01:00
uvos 5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ (#11831) 2025-02-12 22:25:28 +01:00
JC bfd11a2344
Fix: Compile failure due to Microsoft STL breaking change (#11836) 2025-02-12 21:36:11 +01:00
Georgi Gerganov 0fb77f821f
sync : ggml 2025-02-12 21:46:02 +02:00
uvos e598697d63
HIP: Switch to std::vector in rocblas version check (#11820) 2025-02-12 17:25:03 +01:00
Georgi Gerganov fbe6a07256
context : rename to llama_context_kv_self 2025-02-12 17:16:44 +02:00
Georgi Gerganov 6ee86e5e0f
graph : restore ubatch in build_cb
ggml-ci
2025-02-12 16:29:15 +02:00
bandoti fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf (#11811) 2025-02-12 10:06:53 -04:00
Richard 748ee9fe93
ggml : fix multi-threaded clamp_f32 (#11824)
* Bug fix for clamp_f32

When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.

* Bug fix for clamp_f32

* Bug fix for clamp_f32
2025-02-12 15:57:33 +02:00
Georgi Gerganov f63aeecce6
llama : models now build their graphs using llama_graph_i
ggml-ci
2025-02-12 15:08:40 +02:00
Weizhao Ouyang 198b1ec611
ggml-cpu: Fix duplicate MATMUL_INT8 (#11817)
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-02-12 13:22:58 +01:00
Johannes Gäßler c3d6af7cd2
CUDA: fix CUDART_VERSION checks (#11821) 2025-02-12 13:16:39 +01:00
Georgi Gerganov 0ab50f1bbb
context : prepare llama_model graph build
ggml-ci
2025-02-12 14:09:55 +02:00
Georgi Gerganov e633dc171a
context : introduce llama_graph_i
ggml-ci
2025-02-12 13:49:44 +02:00
Georgi Gerganov 5eae8e5183
context : move build_rope_factors to base class
ggml-ci
2025-02-12 13:32:02 +02:00
Georgi Gerganov d146a14f77
context : minor naming fix 2025-02-12 12:41:36 +02:00
Georgi Gerganov 8da7f612b7
context : improve llama_context encapsulation
ggml-ci
2025-02-12 12:15:04 +02:00
Georgi Gerganov b52b79b048
context : move encode/decode to llama-context.cpp 2025-02-12 11:23:38 +02:00
Daniel Bevenius 369be5598a
llama : fix typo in llama-grammar.h [no ci] (#11816) 2025-02-12 09:40:01 +02:00
lhez 4078c77f98
docs: add OpenCL (#11697) 2025-02-11 15:04:13 -07:00
Georgi Gerganov 02ef4be975
context : initial abstraction
ggml-ci
2025-02-11 22:27:21 +02:00
Sheldon Robinson 90e4dba461
Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (#11803)
* Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Daniel Bevenius a18f481f99
server : use common_token_to_piece instead of common_detokenize (#11740)
* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: https://github.com/ggerganov/llama.cpp/issues/11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Johannes Gäßler b9ab0a4d0b
CUDA: use arch list for compatibility check (#11775)
* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush 7b891bdc86
fix: typos in documentation files (#11791)
* Update ggml.c

* Update arg.cpp

* Update speculative.h
2025-02-10 23:21:31 +01:00
jason_w 81732619fd
docs: utilize the forward slash (/) as the path separator for Unix-like systems (#11770) 2025-02-10 23:17:48 +01:00
Xuan-Son Nguyen 507f9174fe
server : (webui) introduce conversation branching + idb storage (#11792)
* server : (webui) introduce conversation branching + idb storage

* mark old conv as "migrated" instead deleting them

* improve migration

* add more comments

* more clarification
2025-02-10 21:23:17 +01:00
Wilken Gottwalt 19b392d58d
llama-mmap: fix missing include (#11796)
Technically the fixed width types come only from iostream and
cstdint/stdint.h headers. memory and vector headers should not provide
these. In GCC 15 the headers are cleaned up and you require the proper
header cstdint.

src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type
   26 |     uint32_t read_u32() const;
      |     ^~~~~~~~
2025-02-10 20:58:18 +02:00
Xuan-Son Nguyen 0893e0114e
server : correct signal handler (#11795) 2025-02-10 18:03:28 +01:00
Georgi Gerganov 2cd8a903c8
context : make output functions members
ggml-ci
2025-02-10 17:01:27 +02:00
Georgi Gerganov d1d8d53008
bman : remove ubatch member
ggml-ci
2025-02-10 16:50:14 +02:00
Georgi Gerganov ef358ee78f
context : add decode/encode
ggml-ci
2025-02-10 16:14:13 +02:00
Georgi Gerganov 879ba82777
server : increase context size for the tests
ggml-ci
2025-02-10 15:00:02 +02:00
Georgi Gerganov f9971ef2e1
llama : dedup reserve code 2025-02-10 14:59:51 +02:00
Georgi Gerganov 972f91c7d7
Merge branch 'master' into gg/llama-kv-cache
ggml-ci
2025-02-10 14:45:54 +02:00
Olivier Chafik d7b31a9d84
sync: minja (a72057e519) (#11774) 2025-02-10 09:34:09 +00:00
pascal-lc 9ac3457b39
Update README.md [no ci] (#11781)
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Danny Milosavljevic c2a67efe38
vulkan: Make Vulkan optional at runtime (#11493). (#11494)
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna b044a0fe3c
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (#11592) 2025-02-10 07:08:22 +01:00
Eric Curtin 19d3c8293b
There's a better way of clearing lines (#11756)
Use the ANSI escape code for clearing a line.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Jeff Bolz 98f6b0fd1e
vulkan: account for lookup tables when checking shared memory size (#11502) 2025-02-09 08:43:51 +01:00
Xuan-Son Nguyen 55ac8c7791
server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759)
* redo Settings modal UI

* add python code interpreter

* fix auto scroll

* build

* fix overflow for long output lines

* bring back sticky copy button

* adapt layout on mobile view

* fix multiple lines output and color scheme

* handle python exception

* better state management

* add webworker

* add headers

* format code

* speed up by loading pyodide on page load

* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog e6e6583199
server : (webui) increase edit textarea size (#11763) 2025-02-08 20:09:55 +01:00
Georgi Gerganov aaa5505307
server : minor log updates (#11760)
ggml-ci
2025-02-08 18:08:43 +02:00