1. 14 Jun, 2024 1 commit
    • Daniel Hiltgen's avatar
      Improve multi-gpu handling at the limit · 6fd04ca9
      Daniel Hiltgen authored
      Still not complete, needs some refinement to our prediction to understand the
      discrete GPUs available space so we can see how many layers fit in each one
      since we can't split one layer across multiple GPUs we can't treat free space
      as one logical block
      6fd04ca9
  2. 09 Jun, 2024 1 commit
  3. 04 Jun, 2024 2 commits
  4. 01 Jun, 2024 1 commit
  5. 30 May, 2024 1 commit
  6. 29 May, 2024 1 commit
  7. 28 May, 2024 2 commits
  8. 25 May, 2024 1 commit
  9. 24 May, 2024 1 commit
  10. 23 May, 2024 2 commits
  11. 20 May, 2024 1 commit
    • Sam's avatar
      feat: add support for flash_attn (#4120) · e15307fd
      Sam authored
      * feat: enable flash attention if supported
      
      * feat: enable flash attention if supported
      
      * feat: enable flash attention if supported
      
      * feat: add flash_attn support
      e15307fd
  12. 15 May, 2024 2 commits
  13. 14 May, 2024 1 commit
  14. 11 May, 2024 1 commit
  15. 10 May, 2024 2 commits
  16. 09 May, 2024 5 commits
  17. 08 May, 2024 1 commit
  18. 07 May, 2024 1 commit
  19. 06 May, 2024 3 commits
  20. 05 May, 2024 1 commit
    • Daniel Hiltgen's avatar
      Centralize server config handling · f56aa200
      Daniel Hiltgen authored
      This moves all the env var reading into one central module
      and logs the loaded config once at startup which should
      help in troubleshooting user server logs
      f56aa200
  21. 01 May, 2024 4 commits
  22. 29 Apr, 2024 1 commit
  23. 26 Apr, 2024 1 commit
  24. 25 Apr, 2024 1 commit
  25. 23 Apr, 2024 2 commits
    • Daniel Hiltgen's avatar
      Detect and recover if runner removed · 58888a74
      Daniel Hiltgen authored
      Tmp cleaners can nuke the file out from underneath us.  This detects the missing
      runner, and re-initializes the payloads.
      58888a74
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a