1. 14 Apr, 2025 2 commits
  2. 11 Apr, 2025 4 commits
    • Jesse Gross's avatar
      ggml: Fix memory leak on input tensors · f50d6912
      Jesse Gross authored
      For every forward pass through the model, we need to allocate input
      tensors: tokens, images, positions, outputs and masks. These get
      allocated in system memory.
      
      However, when we close the context that the tensors were allocated
      through, the metadata gets freed but the actual backend memory does
      not. This results in a significant memory leak.
      
      This makes it so that all the memory allocated through a context
      gets freed when it is closed.
      
      Fixes #10040
      f50d6912
    • Jesse Gross's avatar
      ggml: Don't allocate CPU buffers as CUDA Host buffers · 34c3b68f
      Jesse Gross authored
      Allocating (and in particular, freeing) memory from CUDA host buffers
      is expensive and can cause a significant performance hit if we do
      it for every token. Using normal system memory avoids this issue
      and also gives the OS more flexibility to manage it.
      
      There is no performance impact from this patch directly (either
      positive or negative) but it makes a difference once we start
      freeing memory correctly.
      34c3b68f
    • Jesse Gross's avatar
      ggml: Use pointer receivers for Context · f33ccd5d
      Jesse Gross authored
      Context is currently mixed between pointer and value receivers. Change
      this to be all pointer receivers so don't have to reason about whether
      the things we are updating in the struct will be retained.
      f33ccd5d
    • Jesse Gross's avatar
      ggml: Log filesystem errors · bc108b9a
      Jesse Gross authored
      Sometimes loading the GGUF file fails with:
      panic: context canceled
      
      This is probably a filesystem error but it doesn't provide any
      information about what happened.
      bc108b9a
  3. 10 Apr, 2025 1 commit
  4. 09 Apr, 2025 2 commits
  5. 08 Apr, 2025 7 commits
  6. 07 Apr, 2025 6 commits
  7. 05 Apr, 2025 1 commit
  8. 03 Apr, 2025 4 commits
  9. 02 Apr, 2025 5 commits
  10. 01 Apr, 2025 4 commits
  11. 31 Mar, 2025 4 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Blake Mizerany's avatar
      server/internal/client/ollama: cache completed chunks (#9933) · ef27d52e
      Blake Mizerany authored
      This change adds tracking of download chunks during the pull process so
      that subsequent pulls can skip downloading already completed chunks.
      This works across restarts of ollama.
      
      Currently, download state will be lost if a prune is triggered during a
      pull (e.g. restart or remove). This issue should be addressed in a
      follow-up PR.
      ef27d52e
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529
    • Jesse Gross's avatar
      ollamarunner: Ensure batch size limits are not exceeded · 5d097277
      Jesse Gross authored
      With the llama runner, we can generate up to NUM_PARALLEL batches
      at once, which will then get broken up to into individual batches
      to get executed by llama.cpp (i.e. we add up to 2048 tokens and
      this gets split into 4 batches of 512 tokens at default settings).
      
      This splitting can improve parallelism on multi-GPU systems because
      the individual batches can move though the pipeline without blocking
      on the first one to fully complete. However, we don't yet support
      this in the Ollama runner, partially because it makes it hard to
      enforce model-specified batch constraints, which didn't exist
      previously.
      
      The result is that we will try to execute the full, unsplit batch.
      This could result in out of memory or insufficient KV cache space
      errors.
      
      This triggers batch breaking when the total inputs from all sequences
      exceeds the batch size, rather than per-sequence. In order to ensure
      fairness, it also reintroduces round-robinning around sequences so
      that we don't let one busy sequence starve the others.
      5d097277