Vinesh Janarthanan
27e8a23300
sampling: add Top-nσ sampler ( #11223 )
...
* initial sampling changes:
* completed top nsigma sampler implementation
* apply parameter to only llama-cli
* updated readme
* added tests and fixed nsigma impl
* cleaned up pr
* format
* format
* format
* removed commented tests
* cleanup pr and remove explicit floats
* added top-k sampler to improve performance
* changed sigma to float
* fixed string format to float
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update common/sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update src/llama-sampling.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* added llama_sampler_init
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2025-02-13 08:45:57 +02:00
Oleksandr Kuvshynov
e4376270d9
llama.cpp: fix warning message ( #11839 )
...
There was a typo-like error, which would print the same number twice if
request is received with n_predict > server-side config.
Before the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 4096 exceeds server configuration, setting to 4096
```
After the fix:
```
slot launch_slot_: id 0 | task 0 | n_predict = 8192 exceeds server configuration, setting to 4096
```
2025-02-13 08:25:34 +02:00
Daniel Bevenius
3e69319772
llama : update llama_decode_internal ref [no ci] ( #11840 )
...
This commit updates the comment in llama_kv_cache.h to reflect the
change of the function name from llama_decode_internal to
llama_decode_impl.
2025-02-13 08:07:51 +02:00
Diego Devesa
a394039db0
ggml-cpu : add chunking support to mul_mat_id ( #11666 )
...
* ggml-cpu : add chunking support to mul_mat_id
* allocate chunk counter in wdata
parallelize src1 quantization by column to allows parallelization even when there is only one row
* disable for arm
* cleanup
* better way to disable for arm
* fix uninitialized counter when using 1 thread only
* revert test-backend-ops changes
2025-02-13 01:02:38 +01:00
Xuan-Son Nguyen
be3bbd6215
ggml : x2 speed for WASM by optimizing SIMD ( #11453 )
...
* ggml : x2 speed for WASM by optimizing SIMD
* fix bad merging
* rm trailing spaces
* rm redundant clamp
* better quantize_row_q8_K
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
* remove memset that causes buffer overflow
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
---------
Co-authored-by: camel-cdr <camel-cdr@protonmail.com>
2025-02-13 00:33:45 +01:00
Woof Dog
31afcbee0e
server : (webui) Give copy button back to all message bubbles ( #11814 )
...
* All messages get the copy button
* Update index.html.gz
2025-02-12 23:47:11 +01:00
uvos
5c4284d57b
HIP: Remove GCN from list of devices that avoid MMQ ( #11831 )
2025-02-12 22:25:28 +01:00
JC
bfd11a2344
Fix: Compile failure due to Microsoft STL breaking change ( #11836 )
2025-02-12 21:36:11 +01:00
Georgi Gerganov
0fb77f821f
sync : ggml
2025-02-12 21:46:02 +02:00
uvos
e598697d63
HIP: Switch to std::vector in rocblas version check ( #11820 )
2025-02-12 17:25:03 +01:00
Georgi Gerganov
fbe6a07256
context : rename to llama_context_kv_self
2025-02-12 17:16:44 +02:00
Georgi Gerganov
6ee86e5e0f
graph : restore ubatch in build_cb
...
ggml-ci
2025-02-12 16:29:15 +02:00
bandoti
fef0cbeadf
cleanup: fix compile warnings associated with gnu_printf ( #11811 )
2025-02-12 10:06:53 -04:00
Richard
748ee9fe93
ggml : fix multi-threaded clamp_f32 ( #11824 )
...
* Bug fix for clamp_f32
When using tensors larger than 1d clamp operation does not work due to the restriction of returning if ith is not 0.
* Bug fix for clamp_f32
* Bug fix for clamp_f32
2025-02-12 15:57:33 +02:00
Georgi Gerganov
f63aeecce6
llama : models now build their graphs using llama_graph_i
...
ggml-ci
2025-02-12 15:08:40 +02:00
Weizhao Ouyang
198b1ec611
ggml-cpu: Fix duplicate MATMUL_INT8 ( #11817 )
...
Signed-off-by: Weizhao Ouyang <o451686892@gmail.com>
2025-02-12 13:22:58 +01:00
Johannes Gäßler
c3d6af7cd2
CUDA: fix CUDART_VERSION checks ( #11821 )
2025-02-12 13:16:39 +01:00
Georgi Gerganov
0ab50f1bbb
context : prepare llama_model graph build
...
ggml-ci
2025-02-12 14:09:55 +02:00
Georgi Gerganov
e633dc171a
context : introduce llama_graph_i
...
ggml-ci
2025-02-12 13:49:44 +02:00
Georgi Gerganov
5eae8e5183
context : move build_rope_factors to base class
...
ggml-ci
2025-02-12 13:32:02 +02:00
Georgi Gerganov
d146a14f77
context : minor naming fix
2025-02-12 12:41:36 +02:00
Georgi Gerganov
8da7f612b7
context : improve llama_context encapsulation
...
ggml-ci
2025-02-12 12:15:04 +02:00
Georgi Gerganov
b52b79b048
context : move encode/decode to llama-context.cpp
2025-02-12 11:23:38 +02:00
Daniel Bevenius
369be5598a
llama : fix typo in llama-grammar.h [no ci] ( #11816 )
2025-02-12 09:40:01 +02:00
lhez
4078c77f98
docs: add OpenCL ( #11697 )
2025-02-11 15:04:13 -07:00
Georgi Gerganov
02ef4be975
context : initial abstraction
...
ggml-ci
2025-02-11 22:27:21 +02:00
Sheldon Robinson
90e4dba461
Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx ( #11803 )
...
* Fix #11802 : Compile bug - RegQueryValueExA changed to RegQueryValueEx
* Fix #11802 : PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Daniel Bevenius
a18f481f99
server : use common_token_to_piece instead of common_detokenize ( #11740 )
...
* server : use common_token_to_piece instead of common_detokenize
This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.
The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.
Resolves: https://github.com/ggerganov/llama.cpp/issues/11728
* squash! server : use common_token_to_piece instead of common_detokenize
Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Johannes Gäßler
b9ab0a4d0b
CUDA: use arch list for compatibility check ( #11775 )
...
* CUDA: use arch list for feature availability check
---------
Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush
7b891bdc86
fix: typos in documentation files ( #11791 )
...
* Update ggml.c
* Update arg.cpp
* Update speculative.h
2025-02-10 23:21:31 +01:00
jason_w
81732619fd
docs: utilize the forward slash (/) as the path separator for Unix-like systems ( #11770 )
2025-02-10 23:17:48 +01:00
Xuan-Son Nguyen
507f9174fe
server : (webui) introduce conversation branching + idb storage ( #11792 )
...
* server : (webui) introduce conversation branching + idb storage
* mark old conv as "migrated" instead deleting them
* improve migration
* add more comments
* more clarification
2025-02-10 21:23:17 +01:00
Wilken Gottwalt
19b392d58d
llama-mmap: fix missing include ( #11796 )
...
Technically the fixed width types come only from iostream and
cstdint/stdint.h headers. memory and vector headers should not provide
these. In GCC 15 the headers are cleaned up and you require the proper
header cstdint.
src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type
26 | uint32_t read_u32() const;
| ^~~~~~~~
2025-02-10 20:58:18 +02:00
Xuan-Son Nguyen
0893e0114e
server : correct signal handler ( #11795 )
2025-02-10 18:03:28 +01:00
Georgi Gerganov
2cd8a903c8
context : make output functions members
...
ggml-ci
2025-02-10 17:01:27 +02:00
Georgi Gerganov
d1d8d53008
bman : remove ubatch member
...
ggml-ci
2025-02-10 16:50:14 +02:00
Georgi Gerganov
ef358ee78f
context : add decode/encode
...
ggml-ci
2025-02-10 16:14:13 +02:00
Georgi Gerganov
879ba82777
server : increase context size for the tests
...
ggml-ci
2025-02-10 15:00:02 +02:00
Georgi Gerganov
f9971ef2e1
llama : dedup reserve code
2025-02-10 14:59:51 +02:00
Georgi Gerganov
972f91c7d7
Merge branch 'master' into gg/llama-kv-cache
...
ggml-ci
2025-02-10 14:45:54 +02:00
Olivier Chafik
d7b31a9d84
sync: minja ( a72057e519) ( #11774 )
2025-02-10 09:34:09 +00:00
pascal-lc
9ac3457b39
Update README.md [no ci] ( #11781 )
...
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Danny Milosavljevic
c2a67efe38
vulkan: Make Vulkan optional at runtime ( #11493 ). ( #11494 )
...
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna
b044a0fe3c
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation ( #11592 )
2025-02-10 07:08:22 +01:00
Eric Curtin
19d3c8293b
There's a better way of clearing lines ( #11756 )
...
Use the ANSI escape code for clearing a line.
Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Jeff Bolz
98f6b0fd1e
vulkan: account for lookup tables when checking shared memory size ( #11502 )
2025-02-09 08:43:51 +01:00
Xuan-Son Nguyen
55ac8c7791
server : (webui) revamp Settings dialog, add Pyodide interpreter ( #11759 )
...
* redo Settings modal UI
* add python code interpreter
* fix auto scroll
* build
* fix overflow for long output lines
* bring back sticky copy button
* adapt layout on mobile view
* fix multiple lines output and color scheme
* handle python exception
* better state management
* add webworker
* add headers
* format code
* speed up by loading pyodide on page load
* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog
e6e6583199
server : (webui) increase edit textarea size ( #11763 )
2025-02-08 20:09:55 +01:00
Georgi Gerganov
aaa5505307
server : minor log updates ( #11760 )
...
ggml-ci
2025-02-08 18:08:43 +02:00
Georgi Gerganov
bdcf8b6a56
cont : fix mmap flag print ( #11699 )
2025-02-08 16:49:38 +02:00