1. 11 Apr, 2025 2 commits
    • Jesse Gross's avatar
      ggml: Use pointer receivers for Context · f33ccd5d
      Jesse Gross authored
      Context is currently mixed between pointer and value receivers. Change
      this to be all pointer receivers so don't have to reason about whether
      the things we are updating in the struct will be retained.
      f33ccd5d
    • Jesse Gross's avatar
      ggml: Log filesystem errors · bc108b9a
      Jesse Gross authored
      Sometimes loading the GGUF file fails with:
      panic: context canceled
      
      This is probably a filesystem error but it doesn't provide any
      information about what happened.
      bc108b9a
  2. 10 Apr, 2025 1 commit
  3. 09 Apr, 2025 2 commits
  4. 08 Apr, 2025 7 commits
  5. 07 Apr, 2025 6 commits
  6. 05 Apr, 2025 1 commit
  7. 03 Apr, 2025 4 commits
  8. 02 Apr, 2025 5 commits
  9. 01 Apr, 2025 4 commits
  10. 31 Mar, 2025 5 commits
    • Bruce MacDonald's avatar
      runner: clear cache when shift is not possible (#9433) · 66b25392
      Bruce MacDonald authored
      Clear KV cache when shift operation is not supported by model.
      Added KvCacheCanShift() check to handle models that can't perform cache shifts,
      falling back to full cache clear while preserving logical token history to
      maintain expected behavior when context window fills up.
      66b25392
    • Blake Mizerany's avatar
      server/internal/client/ollama: cache completed chunks (#9933) · ef27d52e
      Blake Mizerany authored
      This change adds tracking of download chunks during the pull process so
      that subsequent pulls can skip downloading already completed chunks.
      This works across restarts of ollama.
      
      Currently, download state will be lost if a prune is triggered during a
      pull (e.g. restart or remove). This issue should be addressed in a
      follow-up PR.
      ef27d52e
    • Jesse Gross's avatar
      runner: Release semaphore and improve error messages on failures · b2a46529
      Jesse Gross authored
      If we have an error after creating a new sequence but before
      finding a slot for it, we return without releasing the semaphore.
      This reduces our parallel sequences and eventually leads to deadlock.
      
      In practice this should never happen because once we have acquired
      the semaphore, we should always be able to find a slot. However, the
      code is clearly not correct.
      b2a46529
    • Jesse Gross's avatar
      ollamarunner: Ensure batch size limits are not exceeded · 5d097277
      Jesse Gross authored
      With the llama runner, we can generate up to NUM_PARALLEL batches
      at once, which will then get broken up to into individual batches
      to get executed by llama.cpp (i.e. we add up to 2048 tokens and
      this gets split into 4 batches of 512 tokens at default settings).
      
      This splitting can improve parallelism on multi-GPU systems because
      the individual batches can move though the pipeline without blocking
      on the first one to fully complete. However, we don't yet support
      this in the Ollama runner, partially because it makes it hard to
      enforce model-specified batch constraints, which didn't exist
      previously.
      
      The result is that we will try to execute the full, unsplit batch.
      This could result in out of memory or insufficient KV cache space
      errors.
      
      This triggers batch breaking when the total inputs from all sequences
      exceeds the batch size, rather than per-sequence. In order to ensure
      fairness, it also reintroduces round-robinning around sequences so
      that we don't let one busy sequence starve the others.
      5d097277
    • Leandro Borges Ferreira's avatar
  11. 28 Mar, 2025 1 commit
  12. 27 Mar, 2025 2 commits
    • Jesse Gross's avatar
      ml: Remove Output from Context interface · 01aa7887
      Jesse Gross authored
      Model implementations should use Input for all of their tensors
      supplied to the model. This includes tensors that relate to the
      outputs, which is confusing since there is also an Output funciton.
      
      Since Output is only used internally in GGML and not used by any
      model implementations, we can remove it from the interface to
      reduce confusion.
      01aa7887
    • saman-amd's avatar
      Add gfx1200 & gfx1201 support on linux (#9878) · ead27aa9
      saman-amd authored
      ead27aa9