1. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  2. 15 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: improve scheduler/concurrency stress tests (#11906) · d6f7233a
      Daniel Hiltgen authored
      * test: improve scheduler/concurrency stress tests
      
      The scheduler test used to use approximate memory figures and would often
      over or under shoot a systems capcity leading to flaky test results.
      This should improve the reliability of this scenario by leveraging
      ps output to determinie exactly how many models it takes to
      trigger thrashing.
      
      The concurrency test is also refined to target num_parallel + 1 and handle
      timeouts better.
      
      With these refinements, TestMultiModelConcurrency was redundant
      
      * test: add parallel generate with history
      
      TestGenerateWithHistory will help verify caching and context
      are properly handled while making requests
      
      * test: focus embed tests on embedding models
      
      remove non-embedding models from the embedding tests
      d6f7233a
  3. 02 Apr, 2025 1 commit
  4. 20 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
  5. 09 Jul, 2024 1 commit
  6. 14 Jun, 2024 3 commits
  7. 23 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a
  8. 04 Apr, 2024 1 commit
  9. 01 Apr, 2024 1 commit
  10. 26 Mar, 2024 1 commit
  11. 25 Mar, 2024 1 commit
  12. 23 Mar, 2024 1 commit