1. 15 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: improve scheduler/concurrency stress tests (#11906) · d6f7233a
      Daniel Hiltgen authored
      * test: improve scheduler/concurrency stress tests
      
      The scheduler test used to use approximate memory figures and would often
      over or under shoot a systems capcity leading to flaky test results.
      This should improve the reliability of this scenario by leveraging
      ps output to determinie exactly how many models it takes to
      trigger thrashing.
      
      The concurrency test is also refined to target num_parallel + 1 and handle
      timeouts better.
      
      With these refinements, TestMultiModelConcurrency was redundant
      
      * test: add parallel generate with history
      
      TestGenerateWithHistory will help verify caching and context
      are properly handled while making requests
      
      * test: focus embed tests on embedding models
      
      remove non-embedding models from the embedding tests
      d6f7233a
  2. 02 Apr, 2025 1 commit
  3. 20 Nov, 2024 1 commit
    • Jesse Gross's avatar
      runner.go: Retry decoding after defragmentation if needed · 7121dfa3
      Jesse Gross authored
      Fragmentation of the KV cache can occur due to cache shifting or
      different sequences getting processed. Decode uses a heuristic to
      decide if it should defrag. However, this heuristic isn't 100%
      accurate, so decoding can sometimes fail by surprise.
      
      For these cases, if decode indicates that there is no KV cache space,
      we should defrag and then try again.
      7121dfa3
  4. 09 Jul, 2024 1 commit
  5. 14 Jun, 2024 3 commits
  6. 23 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a
  7. 04 Apr, 2024 1 commit
  8. 01 Apr, 2024 1 commit
  9. 26 Mar, 2024 1 commit
  10. 25 Mar, 2024 1 commit
  11. 23 Mar, 2024 1 commit