1. 28 Oct, 2025 1 commit
  2. 27 Oct, 2025 1 commit
    • nicole pardal's avatar
      server: Consolidate embedding truncation in runner (#12730) · 5d347f6d
      nicole pardal authored
      Currently, checking the length of prompts for embeddings to ensure
      they fit in the context window (and possible truncation) occurs in
      two places - the Ollama server and runner. This can lead to
      inconsistencies in both the checks and reported number of tokens
      processed. Since we have to do this processing in the runner, this
      consolidates all of the logic there.
      5d347f6d
  3. 25 Oct, 2025 1 commit
  4. 23 Oct, 2025 4 commits
    • Jesse Gross's avatar
      llm: Change memory allocation backoff from exponential to incremental · ad6f6a1d
      Jesse Gross authored
      If we create a memory layout that should fit based on report free VRAM
      but allocation still fails, we start applying a backoff. This reduces
      free VRAM by an exponential percentage (1%, 2%, 4%...). However, the
      points chosen tend to be too dense at the beginning and too sparse at
      the end. Therefore, this switches to an incremental backoff (10%, 20%,
      30%...).
      ad6f6a1d
    • Vinh Nguyen's avatar
    • Daniel Hiltgen's avatar
      DRY out the runner lifecycle code (#12540) · 3258a89b
      Daniel Hiltgen authored
      * DRY out the runner lifecycle code
      
      Now that discovery uses the runners as well, this unifies the runner spawning code
      into a single place.  This also unifies GPU discovery types with the newer ml.DeviceInfo
      
      * win: make incremental builds better
      
      Place build artifacts in discrete directories so incremental builds don't have to start fresh
      
      * Adjust sort order to consider iGPUs
      
      * handle cpu inference oom scenarios
      
      * review comments
      3258a89b
    • Jesse Gross's avatar
      kvcache: Remove special case for reservation mask · 1c093e97
      Jesse Gross authored
      We currently short circuit generation of the cache mask and just
      generate an empty tensor of the correct size. However, in some
      cases, this can also skip a cast operation. This can result in the
      worst case graph being not fully worst case.
      
      We don't actually need the fast path for mask generation, so it's
      better to just use the normal code path.
      1c093e97
  5. 22 Oct, 2025 4 commits
  6. 20 Oct, 2025 5 commits
  7. 18 Oct, 2025 2 commits
  8. 17 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: harden scheduler tests (#12662) · 68e04c7f
      Daniel Hiltgen authored
      * test: harden scheduler tests
      
      This removes reschedDelay which was stale code, and adds
      a new configurable timeout for the waitForVRAMRecovery so
      tests can now set the timeout to be very short to avoid the
      scheduler getting stuck and hitting a test timeout.
      
      * test: tune tests for partial loads
      
      Give stress tests more time when the model is split between CPU/GPU
      68e04c7f
  9. 16 Oct, 2025 11 commits
  10. 15 Oct, 2025 7 commits
  11. 14 Oct, 2025 3 commits