1. 04 Aug, 2025 2 commits
  2. 26 Jun, 2025 3 commits
  3. 20 Jun, 2025 1 commit
  4. 18 Jun, 2025 1 commit
  5. 16 Jun, 2025 1 commit
  6. 12 Jun, 2025 1 commit
  7. 19 May, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Seperate tensor load from backend creation · 94ab428e
      Jesse Gross authored
      Currently, when the backend is created, the tensors are loaded at the
      same time, which is a slow operation. This separates them to be two
      steps:
       - Create backend, including enumerating tensors and memory allocation
       - Loading tensor data
      
      This allows more flexibility in managing model loading.
      94ab428e
  8. 14 May, 2025 3 commits
  9. 12 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Follow up to #10363 (#10647) · 9d6df908
      Daniel Hiltgen authored
      The quantization PR didn't block all unsupported file types,
      which this PR fixes.  It also updates the API docs to reflect
      the now reduced set of supported types.
      9d6df908
  10. 07 May, 2025 1 commit
  11. 06 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Move quantization to new backend (#10363) · 42481045
      Daniel Hiltgen authored
      * Move quantization logic to GGML via new backend
      
      This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
      
      * Remove "add model quantizations"
      
      This is no longer needed now that quantization is implemented in Go+GGML code directly.
      42481045
  12. 05 May, 2025 1 commit
  13. 01 May, 2025 1 commit
  14. 27 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      ggml: fix crash for array head counts · 6ed88985
      Devon Rifkin authored
      If it's an array, it uses the max value in the array
      
      If array values for head counts becomes more popular, we can consider a
      more invasive change like #10225 to calculate more accurate estimates.
      
      Fixes: #9984
      6ed88985
  15. 25 Apr, 2025 9 commits
  16. 16 Apr, 2025 1 commit
  17. 03 Apr, 2025 2 commits
  18. 26 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3
      Jesse Gross authored
      Gemma3 uses sliding windows for its context on 5/6 layers, significantly
      reducing memory usage but leading to uneven usage across layers,
      which makes allocation to the correct GPU difficult. We currently
      estimate very conservatively by assuming all layers are consistent
      at the max size.
      
      Llama3.2-vision is also inconsistent between self attention and cross
      attention layers - at moment, we calculate the correct total size
      and then average this across layers. In some cases, this may lead
      to crashes if a large layer is placed on a GPU sized by the average.
      
      This allows memory estimation to calculate per-layer KV cache size
      and take this account when placing layers onto GPUs. We already do
      this for weights that vary per-tensor, so this is a logical extension.
      
      Fixes #9730
      Fixes #9890
      f66216e3
  19. 13 Mar, 2025 6 commits
  20. 11 Mar, 2025 2 commits