1. 17 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: harden scheduler tests (#12662) · 68e04c7f
      Daniel Hiltgen authored
      * test: harden scheduler tests
      
      This removes reschedDelay which was stale code, and adds
      a new configurable timeout for the waitForVRAMRecovery so
      tests can now set the timeout to be very short to avoid the
      scheduler getting stuck and hitting a test timeout.
      
      * test: tune tests for partial loads
      
      Give stress tests more time when the model is split between CPU/GPU
      68e04c7f
  2. 08 Oct, 2025 1 commit
  3. 02 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Update GGML to b6646 (#12245) · c68f367e
      Daniel Hiltgen authored
      Notable EOLs with this change:
      - MacOS v12 and v13 are no longer supported (v14+ required)
      - AMD gfx900 and gfx906 are no longer supported
      c68f367e
  4. 22 Sep, 2025 1 commit
  5. 12 Sep, 2025 1 commit
  6. 09 Sep, 2025 1 commit
    • Jesse Gross's avatar
      llm: Clamp batch size to context size · e119783e
      Jesse Gross authored
      The context must always be able to store the current batch, so
      if the user requests a small context then we should also shrink
      the batch to match. This also fixes the TestLongInputContext
      test on the new engine. (The old engine already has this behavior.)
      e119783e
  7. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  8. 15 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: improve scheduler/concurrency stress tests (#11906) · d6f7233a
      Daniel Hiltgen authored
      * test: improve scheduler/concurrency stress tests
      
      The scheduler test used to use approximate memory figures and would often
      over or under shoot a systems capcity leading to flaky test results.
      This should improve the reliability of this scenario by leveraging
      ps output to determinie exactly how many models it takes to
      trigger thrashing.
      
      The concurrency test is also refined to target num_parallel + 1 and handle
      timeouts better.
      
      With these refinements, TestMultiModelConcurrency was redundant
      
      * test: add parallel generate with history
      
      TestGenerateWithHistory will help verify caching and context
      are properly handled while making requests
      
      * test: focus embed tests on embedding models
      
      remove non-embedding models from the embedding tests
      d6f7233a
  9. 02 Apr, 2025 1 commit
  10. 20 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
  11. 09 Jul, 2024 1 commit
  12. 14 Jun, 2024 3 commits
  13. 23 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a
  14. 04 Apr, 2024 1 commit
  15. 01 Apr, 2024 1 commit
  16. 26 Mar, 2024 1 commit
  17. 25 Mar, 2024 1 commit
  18. 23 Mar, 2024 1 commit