• Jesse Gross's avatar
    ollamarunner: Ensure batch size limits are not exceeded · 5d097277
    Jesse Gross authored
    With the llama runner, we can generate up to NUM_PARALLEL batches
    at once, which will then get broken up to into individual batches
    to get executed by llama.cpp (i.e. we add up to 2048 tokens and
    this gets split into 4 batches of 512 tokens at default settings).
    
    This splitting can improve parallelism on multi-GPU systems because
    the individual batches can move though the pipeline without blocking
    on the first one to fully complete. However, we don't yet support
    this in the Ollama runner, partially because it makes it hard to
    enforce model-specified batch constraints, which didn't exist
    previously.
    
    The result is that we will try to execute the full, unsplit batch.
    This could result in out of memory or insufficient KV cache space
    errors.
    
    This triggers batch breaking when the total inputs from all sequences
    exceeds the batch size, rather than per-sequence. In order to ensure
    fairness, it also reintroduces round-robinning around sequences so
    that we don't let one busy sequence starve the others.
    5d097277
runner.go 21.9 KB