llama.cpp/examples/server/tests/features/truncation.feature

# run with: ./tests.sh --no-skipped --tags truncation
@trucation
@slow
Feature: Chat truncation

  Background: Server startup
    Given a server listening on localhost:8080
    And   a model file mistral-7b-v0.2-iq3_s-imat.gguf from HF repo ggml-org/models
    And   prompt caching is enabled
    And   a list of stop strings ["\n"]
    And   82 tokens to keep
    And   256 KV cache size
    And   32 server max tokens to predict
    Then  the server is starting
    Then  the server is healthy

  Scenario: Correctly truncate the prompt when the prompt exceeds the context size
    Given a prompt:
    """
    Continue the chat below.
    Me: Hey there, how's it going?
    You: I'm doing well, thanks for asking! How are you?
    Me: I'm doing good, just trying to get some work done. How's your day?
    You: My day has been pretty productive so far. I've been working on a new project.
    Me: That's great to hear! What's the new project you're working on?
    You: It's a web application that's designed to help people manage their personal finances. I'm really excited about it.
    Me: That sounds really useful, I'd be interested to hear more about it. Do you have a timeframe for when you expect to have it ready to launch?
    You: I'm aiming to have the initial version ready within the next few months. I want to ensure it's robust before launching it.
    Me: That's really nice, are you happy with the progress so far?

    """
    And   an ongoing completion request
    Then  -1 tokens are predicted matching You:
    Given an ongoing prompt:
    """

    Me: I have one more question for you my friend. What's the most value thing you learned during your development journey?

    """
    And   52 tokens to truncate
    And   a completion request with no api error
    Then  -1 tokens are predicted matching You:
    # 28 because '\n' stop string is not pushed to the context
    And   28 prompt tokens are processed