1. 21 Mar, 2025 8 commits
    • Blake Mizerany's avatar
    • Parth Sareen's avatar
      00ebda8c
    • Parth Sareen's avatar
      d14ce75b
    • Jesse Gross's avatar
      kvcache: Optimize sliding window attention · 2d6eac90
      Jesse Gross authored
      Currently sliding window attention allocates and uses the full
      context size and just masks out any tokens that are outside of the
      window. However, we really only need (roughly) the sliding window
      size.
      
      At large context sizes this improves two things:
       - Memory allocated - since the fully context size is allocated up front,
         memory requirements drop substantially. On Gemma3:4b with a 32k
         context window, total memory usage (including weights and non-sliding
         layers) drops from ~20GB to ~8GB.
       - Computation - ranges that are completely outside of the sliding
         window are now removed from the tensors that are returned from the
         cache rather than simply being masked out. This results in more
         efficient processing, scaling with the size of the context that
         has actually been used.
      
      Notable, this does not update the scheduler for any model to be aware of
      the smaller memory requirements. This is difficult for Gemma3 because
      the layers are heterogeneous between sliding and non-sliding attention.
      As a result, while actual memory consumption will be reduced, the
      scheduler will over-estimate the requirements of the model. This means
      that splitting between GPUs or GPUs and CPUs will still be suboptimal.
      
      Bug #9730
      2d6eac90
    • Jesse Gross's avatar
      kvcache: Pass granular cache size into implementations · 3ed7ad3a
      Jesse Gross authored
      Currently the runner computes the kv size needed and creates a
      cache of that size. This is the context size times number of
      parallel sequences.
      
      Cache implementations can make better decisions about their memory
      usage, so instead pass in the required capacity, number of sequences
      and maximum batch size. For now, the causal cache just uses this to
      compute the size in the same way as before.
      3ed7ad3a
    • Patrick Devine's avatar
    • Jesse Gross's avatar
      ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758
      Jesse Gross authored
      This enables the runner to report progress back to the Ollama server,
      both for showing status to the user and also to prevent the server
      from killing the runner if it thinks things have stalled.
      
      Most of the infrastructure was already there, this extends it to
      be available to the backends.
      0ff28758
    • Jesse Gross's avatar
      kvcache: Account for source tensors in defrag operation count · d3e9ca3e
      Jesse Gross authored
      Defragging the KV cache can generate a lot of operations, so we
      need to be careful that we don't overflow the number that the graph
      can support. We currently account for all of the nodes that we add
      to the graph for each move but we also need to include the original
      cache tensors as well.
      
      Fixes #9904
      d3e9ca3e
  2. 20 Mar, 2025 6 commits
  3. 19 Mar, 2025 2 commits
  4. 18 Mar, 2025 2 commits
  5. 17 Mar, 2025 9 commits
  6. 15 Mar, 2025 3 commits
    • Patrick Devine's avatar
      fix: correctly save in interactive mode (#9788) · 2c8b4846
      Patrick Devine authored
      This fixes the case where a FROM line in previous modelfile points to a
      file which may/may not be present in a different ollama instance. We
      shouldn't be relying on the filename though and instead just check if
      the FROM line was instead a valid model name and point to that instead.
      2c8b4846
    • Blake Mizerany's avatar
      server/internal/client/ollama: set User-Agent for registry client (#9775) · 82946761
      Blake Mizerany authored
      This sets the agent header in DefaultRegistry to include the version of
      the client, OS, and architecture in the previous format, with a minor
      twist.
      
      Note: The version is obtained from the build info, instead of the
      version in version.Version, which should not longer be necessary, but we
      can remove in a future commit. Using the build info is more accurate and
      also provides extra build information if the build is not tagged, and if
      it is "dirty". Previously, the version was just "0.0.0" with no other
      helpful information. The ollama.com registry and others handle this
      swimmingly.
      82946761
    • Patrick Devine's avatar
      gemma3 quantization (#9776) · ef378ad6
      Patrick Devine authored
      ef378ad6
  7. 14 Mar, 2025 7 commits
    • Daniel Hiltgen's avatar
      Align versions for local builds (#9635) · 2d2247e5
      Daniel Hiltgen authored
      Darwin was using a different pattern for the version string
      than linux or windows.
      2d2247e5
    • Jesse Gross's avatar
      gemma3: Allow multiple image in a single input · 7bf793a6
      Jesse Gross authored
      Previously processing multiple images in a batch would trigger
      segfaults so sending images together was disabled as a way to
      mitigate this. The trigger was processing one image on the CPU
      and one on the GPU.
      
      This can no longer happen:
       - The vision encoder is now on the GPU so both images would be
         processed on the GPU.
       - We require images to be fully contained in a batch and each
         image including its special tokens is over half the batch size.
         As a result, we will never get two images in the same batch.
      
      Fixes #9731
      7bf793a6
    • Jesse Gross's avatar
      ollamarunner: Use a separate context per multimodal input · 282bfaaa
      Jesse Gross authored
      Currently there is a single context per sequence, shared all by
      all multimodal inputs. Since we build a vision encoder graph per
      image, with a large number of inputs we can eventually hit the
      maximum number of graph nodes per context.
      
      This changes to use a separate context for each image, ensuring
      that available resource limits are consistent.
      282bfaaa
    • Jesse Gross's avatar
      ml: Allow models to constrain inputs to a single batch · 9679f401
      Jesse Gross authored
      Models may require that a set of inputs all be processed as part
      of the same batch. For example, if an image has multiple patches
      with fully connected attention between them, we should not split
      the batch in the middle of an image.
      
      Fixes #9697
      9679f401
    • Bruce MacDonald's avatar
      llm: remove internal subprocess req and resp types (#9324) · 3892c3a7
      Bruce MacDonald authored
      This commit refactors the LLM subsystem by removing internal subprocess
      request and response types. It consolidates duplicate type definitions
      across the codebase, moving them to centralized locations. The change also
      standardizes interfaces between components, simplifies the ServerStatusResp
      struct, and moves the ParseDurationMs function to a common package. This
      cleanup reduces code duplication between different runner implementations
      (llamarunner and ollamarunner).
      3892c3a7
    • Blake Mizerany's avatar
      4e320b8b
    • Blake Mizerany's avatar
      server/internal/client: use chunksums for concurrent blob verification (#9746) · eb2b22b0
      Blake Mizerany authored
      Replace large-chunk blob downloads with parallel small-chunk
      verification to solve timeout and performance issues. Registry users
      experienced progressively slowing download speeds as large-chunk
      transfers aged, often timing out completely.
      
      The previous approach downloaded blobs in a few large chunks but
      required a separate, single-threaded pass to read the entire blob back
      from disk for verification after download completion.
      
      This change uses the new chunksums API to fetch many smaller
      chunk+digest pairs, allowing concurrent downloads and immediate
      verification as each chunk arrives. Chunks are written directly to their
      final positions, eliminating the entire separate verification pass.
      
      The result is more reliable downloads that maintain speed throughout the
      transfer process and significantly faster overall completion, especially
      over unstable connections or with large blobs.
      eb2b22b0
  8. 13 Mar, 2025 3 commits