1. 30 Oct, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Enable op_offload to improve partial offload performance · afaf7ce8
      Jesse Gross authored
      When a model is partially offloaded to system RAM, we can either
      do the calculations on the CPU or we can temporarily transfer the
      data to the GPU to do the calculations there. Small batches tend
      to be better on the CPU, large batches on the GPU.
      
      The llamarunner used the GPU in most cases and the ollamarunner
      used the CPU. Although the ollamarunner saw an improvement in
      token generation performance, there was a large performance hit
      in prompt processing (3-10x).
      
      There is an existing heuristic to dynamically switch between these
      two modes but in practice it doesn't have enough information to
      accurately make that decision. This adds authoritative data to make
      the check work to get the best of both worlds.
      
      Fixes #12037
      afaf7ce8
  2. 28 Oct, 2025 1 commit
  3. 22 May, 2025 2 commits
    • Jesse Gross's avatar
      ml: Panic rather than return error on tensor allocation failure · 1f371ea9
      Jesse Gross authored
      FromFloatSlice and FromIntSlice return an error if the shape doesn't
      match the passed data or if memory can't be allocated. Since these
      are inputs, the memory being allocated is system memory rather than VRAM.
      
      In many cases, the caller can't really handle the error and panics.
      
      Empty and Zeros directly panic if they can't allocate memory.
      
      This makes things consistent by panicing for the first two cases,
      removing a fair amount of error handling code. This is also consistent
      with how Go typically handles these situations.
      1f371ea9
    • Jesse Gross's avatar
      ollamarunner: Memory usage reporting · 73d6a82c
      Jesse Gross authored
      This provides granular information about the backend memory allocations
      required by the runner:
       - Per backend
       - Per layer
       - Weights, cache and graph
       - Allocation status
      
      This can be used for debugging and validating memory estimates.
      73d6a82c
  4. 15 May, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Multi-modal worst case graph · fe623c2c
      Jesse Gross authored
      We currently preallocate compute graph memory for the worst case
      batch of text tokens. This adds support for doing the same for
      images.
      
      Note that image models are more complicated than text models in
      how they process their inputs so there may be cases where this
      approach isn't completely generic for all models. It covers all
      currently supported models though.
      fe623c2c
    • Jesse Gross's avatar
      ollamarunner: Separate text and multimodal graphs · 3c14461d
      Jesse Gross authored
      For some multimodal models (such as gemma3), we create a single
      graph that generates the image embedding and then use this in the
      text model. The embedding tensor is completely opaque to the runner.
      
      However, this doesn't work if we need to use the embedding in multiple
      batches. This can arise if the embedding is larger than the batch size.
      In these cases (as with llama4), we would like to create views that
      are more appropriately sized. However, if we do this then the original
      source tensor is used in multiple graphs, which isn't allowed. To
      avoid that problem, models with this pattern compute the embedding
      tensor on first use and recreate the individual views. There is no
      longer a single vision and text graph.
      
      This codifies the pattern of separating vision and text graphs. The
      logic of computing tensors on demand is moved to the runner, so models
      no longer have to worry about this. It also gives the runner visibility
      into the multimodal tensors, which is important for memory management.
      3c14461d