Add support for encoder-decoder models in llama-server, matching the
behavior of llama-cli. This enables translation models like MADLAD
and other T5-based models to work with the server.
Changes:
- Add has_encoder flag to detect encoder-decoder models at load time
- Implement llama_encode() call for encoder-decoder prompt processing
- Use decoder_start_token to initialize decoder after encoding
- Clear decoder KV cache before each new request (no prefix caching)
- Disable incompatible features for encoder-decoder models:
- Context shift (encoder outputs are fixed)
- Speculative decoding (not supported)
- Prompt caching (encoder outputs depend on entire input)
- Slot selection by LCP similarity (meaningless for enc-dec)
- Add edge case handling for empty text tokens
The encoder processes the full prompt, then the decoder generates
output using cross-attention to the encoder's hidden states.