1. 25 Jun, 2024 1 commit
    • Blake Mizerany's avatar
      llm: speed up gguf decoding by a lot (#5246) · cb42e607
      Blake Mizerany authored
      Previously, some costly things were causing the loading of GGUF files
      and their metadata and tensor information to be VERY slow:
      
        * Too many allocations when decoding strings
        * Hitting disk for each read of each key and value, resulting in a
          not-okay amount of syscalls/disk I/O.
      
      The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
      m3.
      
      This commit also prevents collecting large arrays of values when
      decoding GGUFs (if desired). When such keys are encountered, their
      values are null, and are encoded as such in JSON.
      
      Also, this fixes a broken test that was not encoding valid GGUF.
      cb42e607
  2. 14 Jun, 2024 6 commits
  3. 04 Jun, 2024 3 commits
  4. 24 May, 2024 1 commit
  5. 21 May, 2024 1 commit
  6. 14 May, 2024 2 commits
  7. 10 May, 2024 2 commits
  8. 09 May, 2024 1 commit
    • Daniel Hiltgen's avatar
      Wait for GPU free memory reporting to converge · 354ad925
      Daniel Hiltgen authored
      The GPU drivers take a while to update their free memory reporting, so we need
      to wait until the values converge with what we're expecting before proceeding
      to start another runner in order to get an accurate picture.
      354ad925
  9. 06 May, 2024 2 commits
  10. 05 May, 2024 3 commits
  11. 01 May, 2024 2 commits
  12. 28 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Fix concurrency for CPU mode · d6e3b645
      Daniel Hiltgen authored
      Prior refactoring passes accidentally removed the logic to bypass VRAM
      checks for CPU loads.  This adds that back, along with test coverage.
      
      This also fixes loaded map access in the unit test to be behind the mutex which was
      likely the cause of various flakes in the tests.
      d6e3b645
  13. 25 Apr, 2024 2 commits
  14. 24 Apr, 2024 2 commits
  15. 23 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a