1. 11 Apr, 2025 3 commits
    • Jesse Gross's avatar
      ggml: Don't allocate CPU buffers as CUDA Host buffers · 34c3b68f
      Jesse Gross authored
      Allocating (and in particular, freeing) memory from CUDA host buffers
      is expensive and can cause a significant performance hit if we do
      it for every token. Using normal system memory avoids this issue
      and also gives the OS more flexibility to manage it.
      
      There is no performance impact from this patch directly (either
      positive or negative) but it makes a difference once we start
      freeing memory correctly.
      34c3b68f
    • Jesse Gross's avatar
      ggml: Use pointer receivers for Context · f33ccd5d
      Jesse Gross authored
      Context is currently mixed between pointer and value receivers. Change
      this to be all pointer receivers so don't have to reason about whether
      the things we are updating in the struct will be retained.
      f33ccd5d
    • Jesse Gross's avatar
      ggml: Log filesystem errors · bc108b9a
      Jesse Gross authored
      Sometimes loading the GGUF file fails with:
      panic: context canceled
      
      This is probably a filesystem error but it doesn't provide any
      information about what happened.
      bc108b9a
  2. 08 Apr, 2025 2 commits
    • Jesse Gross's avatar
      ollamarunner: Preallocate worst case graph at startup · dbb149e6
      Jesse Gross authored
      Currently, the KV cache and graph are lazily allocated as needed.
      The cache is fully allocated on first use of the corresponding
      layer whereas the graph grows with the size of the context.
      
      This can be an issue if another application allocates more VRAM
      after we do our calculations - Ollama will crash in the middle of
      inference. If we instead allocate the maximum needed memory at
      startup of the runner, we will either succeed or fail at that point
      rather than at some surprising time in the future.
      
      Currently, this only generates a worst case batch for text, which
      means that vision models may get a partial allocation and continue
      to lazily allocate the rest.
      dbb149e6
    • Jesse Gross's avatar
      ggml: Check for OOM and return as Go errors · a807985e
      Jesse Gross authored
      If there is a CUDA OOM, we currently don't check the return value
      and will evetually segfault. This checks for the problem and generates
      a Go error. At the moment, this will still result in a panic but having
      the error is the first step to being able to handle it more gracefully.
      a807985e
  3. 05 Apr, 2025 1 commit
  4. 03 Apr, 2025 2 commits
  5. 27 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ml: Remove Output from Context interface · 01aa7887
      Jesse Gross authored
      Model implementations should use Input for all of their tensors
      supplied to the model. This includes tensors that relate to the
      outputs, which is confusing since there is also an Output funciton.
      
      Since Output is only used internally in GGML and not used by any
      model implementations, we can remove it from the interface to
      reduce confusion.
      01aa7887
  6. 21 Mar, 2025 1 commit
  7. 18 Mar, 2025 1 commit
  8. 17 Mar, 2025 2 commits
  9. 11 Mar, 2025 8 commits
  10. 08 Mar, 2025 2 commits
  11. 07 Mar, 2025 12 commits
  12. 04 Mar, 2025 1 commit
    • Michael Yang's avatar
      ml/backend/ggml: consolidate system info logging · 05a01fde
      Michael Yang authored
      - output backend system info when initializing the backend. this ensures
        this information is always present without needing to be called
        explicitly
      - convert to structured logging
      - enumerate devices rather than backends since devices are ordered
      - track device indices grouped by device name
      05a01fde
  13. 02 Mar, 2025 4 commits
    • Jesse Gross's avatar
      ml: Enable support for flash attention · 21aa666a
      Jesse Gross authored
      The GGML flash attention kernel has specific requirements for
      padding and permutation. This adds support to the KV cache
      for conforming to these requirements so that flash attention
      can be enabled.
      
      Flash attention can be used in the same situations as the llama
      engine and is enabled by the user in the same way.
      21aa666a
    • Jesse Gross's avatar
      ml: Empty tensor constructor for tensors · ee141cc8
      Jesse Gross authored
      In cases where we allocate a tensor and then fully overwrite it with
      copied data, it is wasteful to first zero out the memory.
      ee141cc8
    • Jesse Gross's avatar
      ggml-backend: Store parent backend as part of tensor · 55e5776c
      Jesse Gross authored
      It can be important for a tensor to know what backend it came from -
      for example, to know if flash attention is enabled.
      55e5776c
    • Jesse Gross's avatar
      attention: Remove unnecessary contiguous operations · 854a9195
      Jesse Gross authored
      Prior to performing attention, we need to permute query, key
      and value. Currently we call Contiguous after each of these
      permutations, which is correct but expensive. Avoiding the
      3 calls to Contiguous increases performance by over 20%.
      
      The permutations of query and key do not violate the continuity
      rules for mulmat and the Contiguous call can be simply removed.
      
      Value requires a different permutation and does require Contiguous.
      However, we can use the copy into the cache as a way to perform this
      without further overhead.
      
      To support this and avoid unexpected tensor shapes that are seen by
      models, we need tighter integration between attention, cache
      and backend. Future optimization will also likely need this structure
       - for example, flash attention has special padding requirements in
      the cache and other backends may have their own needs.
      
      This further contains the operations that go into attention so that
      these and other optimizations can be handled transparently. Models
      that have special requirements for attention can still implement
      their own version of it.
      854a9195