Commit Graph

4834 Commits

Author SHA1 Message Date
Johannes Gäßler c3d6af7cd2
CUDA: fix CUDART_VERSION checks (#11821) 2025-02-12 13:16:39 +01:00
Georgi Gerganov 0ab50f1bbb
context : prepare llama_model graph build
ggml-ci
2025-02-12 14:09:55 +02:00
Georgi Gerganov e633dc171a
context : introduce llama_graph_i
ggml-ci
2025-02-12 13:49:44 +02:00
Georgi Gerganov 5eae8e5183
context : move build_rope_factors to base class
ggml-ci
2025-02-12 13:32:02 +02:00
Georgi Gerganov d146a14f77
context : minor naming fix 2025-02-12 12:41:36 +02:00
Georgi Gerganov 8da7f612b7
context : improve llama_context encapsulation
ggml-ci
2025-02-12 12:15:04 +02:00
Georgi Gerganov b52b79b048
context : move encode/decode to llama-context.cpp 2025-02-12 11:23:38 +02:00
Daniel Bevenius 369be5598a
llama : fix typo in llama-grammar.h [no ci] (#11816) 2025-02-12 09:40:01 +02:00
lhez 4078c77f98
docs: add OpenCL (#11697) 2025-02-11 15:04:13 -07:00
Georgi Gerganov 02ef4be975
context : initial abstraction
ggml-ci
2025-02-11 22:27:21 +02:00
Sheldon Robinson 90e4dba461
Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx (#11803)
* Fix #11802: Compile bug - RegQueryValueExA changed to RegQueryValueEx

* Fix #11802: PR #11803 - keep RegQueryValueExA, remove TEXT macro, description needs to be ANSI string
2025-02-11 16:55:45 +01:00
Daniel Bevenius a18f481f99
server : use common_token_to_piece instead of common_detokenize (#11740)
* server : use common_token_to_piece instead of common_detokenize

This commit replaces the call to common_detokenize with
common_token_to_piece in the populate_token_probs.

The motivation for this change is to avoid an issue where
common_detokenize would remove the word boundary character for tokens,
which caused a regression in the server generated token probabilities.

Resolves: https://github.com/ggerganov/llama.cpp/issues/11728

* squash! server : use common_token_to_piece instead of common_detokenize

Use common_token_to_piece for post_sampling_probs as well.
2025-02-11 14:06:45 +01:00
Johannes Gäßler b9ab0a4d0b
CUDA: use arch list for compatibility check (#11775)
* CUDA: use arch list for feature availability check

---------

Co-authored-by: Diego Devesa <slarengh@gmail.com>
2025-02-11 00:17:22 +01:00
Maxim Evtush 7b891bdc86
fix: typos in documentation files (#11791)
* Update ggml.c

* Update arg.cpp

* Update speculative.h
2025-02-10 23:21:31 +01:00
jason_w 81732619fd
docs: utilize the forward slash (/) as the path separator for Unix-like systems (#11770) 2025-02-10 23:17:48 +01:00
Xuan-Son Nguyen 507f9174fe
server : (webui) introduce conversation branching + idb storage (#11792)
* server : (webui) introduce conversation branching + idb storage

* mark old conv as "migrated" instead deleting them

* improve migration

* add more comments

* more clarification
2025-02-10 21:23:17 +01:00
Wilken Gottwalt 19b392d58d
llama-mmap: fix missing include (#11796)
Technically the fixed width types come only from iostream and
cstdint/stdint.h headers. memory and vector headers should not provide
these. In GCC 15 the headers are cleaned up and you require the proper
header cstdint.

src/llama-mmap.h:26:5: error: ‘uint32_t’ does not name a type
   26 |     uint32_t read_u32() const;
      |     ^~~~~~~~
2025-02-10 20:58:18 +02:00
Xuan-Son Nguyen 0893e0114e
server : correct signal handler (#11795) 2025-02-10 18:03:28 +01:00
Georgi Gerganov 2cd8a903c8
context : make output functions members
ggml-ci
2025-02-10 17:01:27 +02:00
Georgi Gerganov d1d8d53008
bman : remove ubatch member
ggml-ci
2025-02-10 16:50:14 +02:00
Georgi Gerganov ef358ee78f
context : add decode/encode
ggml-ci
2025-02-10 16:14:13 +02:00
Georgi Gerganov 879ba82777
server : increase context size for the tests
ggml-ci
2025-02-10 15:00:02 +02:00
Georgi Gerganov f9971ef2e1
llama : dedup reserve code 2025-02-10 14:59:51 +02:00
Georgi Gerganov 972f91c7d7
Merge branch 'master' into gg/llama-kv-cache
ggml-ci
2025-02-10 14:45:54 +02:00
Olivier Chafik d7b31a9d84
sync: minja (a72057e519) (#11774) 2025-02-10 09:34:09 +00:00
pascal-lc 9ac3457b39
Update README.md [no ci] (#11781)
typo: `\` -> `/`
Change the UNIX path separator to` \`.
2025-02-10 09:05:57 +01:00
Danny Milosavljevic c2a67efe38
vulkan: Make Vulkan optional at runtime (#11493). (#11494)
Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
2025-02-10 07:17:21 +01:00
Wagner Bruna b044a0fe3c
vulkan: add environment variable GGML_VK_PREFER_HOST_MEMORY to avoid VRAM allocation (#11592) 2025-02-10 07:08:22 +01:00
Eric Curtin 19d3c8293b
There's a better way of clearing lines (#11756)
Use the ANSI escape code for clearing a line.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-09 10:34:49 +00:00
Jeff Bolz 98f6b0fd1e
vulkan: account for lookup tables when checking shared memory size (#11502) 2025-02-09 08:43:51 +01:00
Xuan-Son Nguyen 55ac8c7791
server : (webui) revamp Settings dialog, add Pyodide interpreter (#11759)
* redo Settings modal UI

* add python code interpreter

* fix auto scroll

* build

* fix overflow for long output lines

* bring back sticky copy button

* adapt layout on mobile view

* fix multiple lines output and color scheme

* handle python exception

* better state management

* add webworker

* add headers

* format code

* speed up by loading pyodide on page load

* (small tweak) add small animation to make it feels like claude
2025-02-08 21:54:50 +01:00
Woof Dog e6e6583199
server : (webui) increase edit textarea size (#11763) 2025-02-08 20:09:55 +01:00
Georgi Gerganov aaa5505307
server : minor log updates (#11760)
ggml-ci
2025-02-08 18:08:43 +02:00
Georgi Gerganov bdcf8b6a56
cont : fix mmap flag print (#11699) 2025-02-08 16:49:38 +02:00
Karol Kontny 4d3465c5ae
ggml: Fix data race in ggml threadpool (#11736)
After the barrier in last iteration is executed, still the loop termination
condition will be executed. However main thread can destroy the cgraph object
and its nodes already, then another thread will access it, but the thing is already gone.
Also trouble can happen when n_nodes == 0 or abort is called, but I'm not sure if the
prior situation is possible.

Last syncronization should be done after the loop to ensure the cgraph/cplan won't be
accessed after the main thread exits from the function.
2025-02-08 15:30:53 +01:00
Johannes Gäßler d80be897ac
CUDA: fix min. version for movmatrix (#11751) 2025-02-08 10:46:07 +01:00
Nikolaos Pothitos 3ab410f55f
readme : update front-end framework (#11753)
After the migration to React with #11688
2025-02-08 10:43:04 +01:00
Xuan-Son Nguyen 0cf867160c
server : (webui) fix numeric settings being saved as string (#11739)
* server : (webui) fix numeric settings being saved as string

* add some more comments
2025-02-08 10:42:34 +01:00
Eric Curtin d2fe216fb2
Make logging more verbose (#11714)
Debugged an issue with a user who was on a read-only filesystem.

Signed-off-by: Eric Curtin <ecurtin@redhat.com>
2025-02-07 14:42:46 +00:00
Georgi Gerganov ed926d8833
llama : fix defrag logic (#11707)
* llama : fix defrag logic

ggml-ci

* cont : better logic

ggml-ci

* cont : clamp fragmentation to 0.0

ggml-ci
2025-02-07 16:05:34 +02:00
Christian Fillion 2d219b389e
vocab : ignore invalid UTF-8 input in the BPE tokenizer (#11729)
Silently insert U+FFFD(s) (Unicode replacement character) instead until the
next valid codepoint can be found.

This fixes `llama_tokenize` throwing an exception across the C API boundary
or libllama's module boundary (the caller's runtime might be incompatible!)

Returing a proper error code might be desirable, however the signature
of `llama_tokenize` doesn't allow it as all return values already have
existing meaning.
2025-02-07 15:55:47 +02:00
magicse 333820d749
llama : fix progress dots (#11730)
* Update llama.cpp

For display progress dots in terminal.
Without this it didn't display dots progress during loading model from file.

* Update llama.cpp

removed trailing spaces
2025-02-07 15:48:47 +02:00
Jeff Bolz c026ba3c23
vulkan: print shared memory size (#11719) 2025-02-07 11:26:03 +01:00
Christian Fillion 7ee953a64a
llama : add llama_sampler_init for safe usage of llama_sampler_free (#11727)
The C API in llama.h claims users can implement `llama_sampler_i` to
create custom `llama_sampler`. The sampler chain takes ownership and
calls `llama_sampler_free` on them. However, `llama_sampler_free` is
hard-coded to use `delete`. This is undefined behavior if the object
wasn't also allocated via `new` from libllama's C++ runtime. Callers
in C and C-compatible languages do not use C++'s `new` operator. C++
callers may not be sharing the same heap as libllama.
2025-02-07 11:33:27 +02:00
Akarshan Biswas ec3bc8270b
SYCL: remove XMX info from print devices (#11712) 2025-02-07 09:27:53 +00:00
Daniel Bevenius b7552cfcbc
common : add default embeddings presets (#11677)
* common : add default embeddings presets

This commit adds default embeddings presets for the following models:
- bge-small-en-v1.5
- e5-small-v2
- gte-small

These can be used with llama-embedding and llama-server.

For example, with llama-embedding:
```console
./build/bin/llama-embedding --embd-gte-small-default -p "Hello, how are you?"
```

And with llama-server:
```console
./build/bin/llama-server --embd-gte-small-default
```
And the embeddings endpoint can then be called with a POST request:
```console
curl --request POST \
    --url http://localhost:8080/embeddings \
    --header "Content-Type: application/json" \
    --data '{"input": "Hello, how are you?"}'
```

I'm not sure if these are the most common embedding models but hopefully
this can be a good starting point for discussion and further
improvements.

Refs: https://github.com/ggerganov/llama.cpp/issues/10932
2025-02-07 09:15:22 +01:00
Jinyang He 225bbbfa39
ggml : optimize and build warning fix for LoongArch (#11709)
* ggml : optimize convert f32<->f16 for loongarch_asx

* ggml : optimize loongarch_asx extend i16,i8,u8 to i32,i16

* ggml : Fix warnings when run cpu CI locally on LoongArch
2025-02-07 09:38:31 +02:00
tv1wnd 855cd0734a
llama : fix old glm4 models (#11670) 2025-02-06 22:48:51 +01:00
Georgi Gerganov 8a59053f63
sync : ggml 2025-02-06 21:23:03 +02:00
Patrick Peng 1d20e53c40
rpc: fix known RCE in rpc-server (ggml/1103)
Add bounds checking in `rpc_server::copy_tensor` to prevent out-of-bounds writes
+ Check if  `(uint8_t *)dst->data + ggml_nbytes(src)` remains within the destination buffer’s allocated region.
2025-02-06 21:22:54 +02:00