From 27228447d98a56e799a0cc91ed39b736ad0fcf6e Mon Sep 17 00:00:00 2001
From: ytian218 <ytian218@bloomberg.net>
Date: Wed, 10 Dec 2025 18:32:53 -0500
Subject: [PATCH] server: fix crash when batch > ubatch with embeddings
 (#12836)

Fixes #12836 where the server crashes with GGML_ASSERT failure when
running with embeddings enabled and n_batch > n_ubatch.

Root cause: Embeddings use non-causal attention which requires all
tokens to be processed within a single ubatch. When n_batch > n_ubatch,
the server attempts to split processing, causing assertion failure.

Solution:
- Add parameter validation in main() after common_params_parse()
- When embeddings enabled and n_batch > n_ubatch:
  * Log warnings explaining the issue
  * Automatically set n_batch = n_ubatch
  * Prevent server crash

This follows the approach suggested by @ggerganov in issue #12836.

Note: This supersedes stalled PR #12940 which attempted a runtime fix
in the old examples/server/server.cpp location. This implementation
validates at startup in tools/server/server.cpp (current location).

Testing:
- Build: Compiles successfully
- Validation triggers: Warns when -b > -ub with --embedding
- Auto-correction works: Adjusts n_batch = n_ubatch
- No false positives: Valid params don't trigger warnings
- Verified on macOS M3 Pro with embedding model
---
 tools/server/server.cpp | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/server/server.cpp b/tools/server/server.cpp
index 0f39def379..4f9b78b48a 100644
--- a/tools/server/server.cpp
+++ b/tools/server/server.cpp
@@ -3657,6 +3657,15 @@ int main(int argc, char ** argv) {
         return 1;
     }
 
+    // validate batch size for embeddings
+    // embeddings require all tokens to be processed in a single ubatch
+    // see https://github.com/ggml-org/llama.cpp/issues/12836
+    if (params.embedding && params.n_batch > params.n_ubatch) {
+        LOG_WRN("%s: embeddings enabled with n_batch (%d) > n_ubatch (%d)\n", __func__, params.n_batch, params.n_ubatch);
+        LOG_WRN("%s: setting n_batch = n_ubatch = %d to avoid assertion failure\n", __func__, params.n_ubatch);
+        params.n_batch = params.n_ubatch;
+    }
+
     // TODO: should we have a separate n_parallel parameter for the server?
     //       https://github.com/ggml-org/llama.cpp/pull/16736#discussion_r2483763177
     // TODO: this is a common configuration that is suitable for most local use cases