• Jesse Gross's avatar
    ollamarunner: Preallocate worst case graph at startup · dbb149e6
    Jesse Gross authored
    Currently, the KV cache and graph are lazily allocated as needed.
    The cache is fully allocated on first use of the corresponding
    layer whereas the graph grows with the size of the context.
    
    This can be an issue if another application allocates more VRAM
    after we do our calculations - Ollama will crash in the middle of
    inference. If we instead allocate the maximum needed memory at
    startup of the runner, we will either succeed or fail at that point
    rather than at some surprising time in the future.
    
    Currently, this only generates a worst case batch for text, which
    means that vision models may get a partial allocation and continue
    to lazily allocate the rest.
    dbb149e6
wrapper.go 2.46 KB