This commit updates the backend sampling implementation to support
intermixed usage of backend and CPU samplers within the same batch.
The initial implementation was developed as an all-or-nothing solution:
either perform backend sampling for the entire batch, or perform CPU
sampling for the entire batch.
The motivation for this change is to support batches with mixed
sequences. For example, we may have a backend sampler configured for
sequence 0, while sequence 1 in the same batch uses CPU sampling. This
was not supported in the initial implementation.
This issue manifested in llama-server with the webui: decoding with
backend samplers would work initially, but after changing to CPU
sampling, a slot (sequence) could still be using a backend sampler.
This meant that logits in output_reserve would not be allocated,
resulting in an error.
The solution in this commit inspects the batch to determine which
sampling modes are needed and allocates buffers accordingly. However,
there is a known inefficiency: when we have intermixed backend/CPU
samplers in the same batch, we currently copy all logits to the host,
even for sequences using backend samplers.
Added test_backend_cpu_mixed_batch to verify correct behavior with
mixed backend/CPU samplers in a single batch, including dynamic
sampler switching between decode calls.