• Jesse Gross's avatar
    ollamarunner: Worst case batch for token generation · 26465fb8
    Jesse Gross authored
    We currently allocate the worst case batch for max sized
    batches, which corresponds to prompt processing. However,
    there are some cases where the generated graph is different
    for small and large batches. To ensure that we don't need
    to allocate memory later after layout has taken place, we
    should run the worst case batch both ways and take the larger
    amount of memory.
    
    This does not noticeably affect loading speed as the most expensive
    part of this logic is from image processing and that does not
    occur during token generation.
    26465fb8
runner.go 37.9 KB