1. 02 Aug, 2024 1 commit
  2. 31 Jul, 2024 2 commits
  3. 30 Jul, 2024 2 commits
    • royjhan's avatar
      Add Metrics to `api\embed` response (#5709) · 1b44d873
      royjhan authored
      * add prompt tokens to embed response
      
      * rm slog
      
      * metrics
      
      * types
      
      * prompt n
      
      * clean up
      
      * reset submodule
      
      * update tests
      
      * test name
      
      * list metrics
      1b44d873
    • Daniel Hiltgen's avatar
      Prevent partial loading on mixed GPU brands · 34542099
      Daniel Hiltgen authored
      In mult-brand GPU setups, if we couldn't fully load the model we
      would fall through the scheduler and mistakenly try to load across
      a mix of brands.  This makes sure we find the set of GPU(s) that
      best fit for the partial load.
      34542099
  4. 22 Jul, 2024 1 commit
  5. 21 Jul, 2024 1 commit
  6. 15 Jul, 2024 1 commit
    • royjhan's avatar
      Introduce `/api/embed` endpoint supporting batch embedding (#5127) · b9f5e16c
      royjhan authored
      * Initial Batch Embedding
      
      * Revert "Initial Batch Embedding"
      
      This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.
      
      * Initial Draft
      
      * mock up notes
      
      * api/embed draft
      
      * add server function
      
      * check normalization
      
      * clean up
      
      * normalization
      
      * playing around with truncate stuff
      
      * Truncation
      
      * Truncation
      
      * move normalization to go
      
      * Integration Test Template
      
      * Truncation Integration Tests
      
      * Clean up
      
      * use float32
      
      * move normalize
      
      * move normalize test
      
      * refactoring
      
      * integration float32
      
      * input handling and handler testing
      
      * Refactoring of legacy and new
      
      * clear comments
      
      * merge conflicts
      
      * touches
      
      * embedding type 64
      
      * merge conflicts
      
      * fix hanging on single string
      
      * refactoring
      
      * test values
      
      * set context length
      
      * clean up
      
      * testing clean up
      
      * testing clean up
      
      * remove function closure
      
      * Revert "remove function closure"
      
      This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.
      
      * remove function closure
      
      * remove redundant error check
      
      * clean up
      
      * more clean up
      
      * clean up
      b9f5e16c
  7. 09 Jul, 2024 1 commit
  8. 03 Jul, 2024 2 commits
    • Daniel Hiltgen's avatar
      Only set default keep_alive on initial model load · 955f2a4e
      Daniel Hiltgen authored
      This change fixes the handling of keep_alive so that if client
      request omits the setting, we only set this on initial load.  Once
      the model is loaded, if new requests leave this unset, we'll keep
      whatever keep_alive was there.
      955f2a4e
    • Daniel Hiltgen's avatar
      Prevent loading models larger than total memory · 3c75113e
      Daniel Hiltgen authored
      Users may not realize the siny new model they're trying to load
      fits on their disk, but can't load into system+GPU memory.  Today
      we crash, but with this fix, we'll give them a better error message
      before even trying to load it.
      3c75113e
  9. 25 Jun, 2024 1 commit
    • Blake Mizerany's avatar
      llm: speed up gguf decoding by a lot (#5246) · cb42e607
      Blake Mizerany authored
      Previously, some costly things were causing the loading of GGUF files
      and their metadata and tensor information to be VERY slow:
      
        * Too many allocations when decoding strings
        * Hitting disk for each read of each key and value, resulting in a
          not-okay amount of syscalls/disk I/O.
      
      The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
      m3.
      
      This commit also prevents collecting large arrays of values when
      decoding GGUFs (if desired). When such keys are encountered, their
      values are null, and are encoded as such in JSON.
      
      Also, this fixes a broken test that was not encoding valid GGUF.
      cb42e607
  10. 21 Jun, 2024 1 commit
    • Daniel Hiltgen's avatar
      Enable concurrency by default · 17b7186c
      Daniel Hiltgen authored
      This adjusts our default settings to enable multiple models and parallel
      requests to a single model.  Users can still override these by the same
      env var settings as before.  Parallel has a direct impact on
      num_ctx, which in turn can have a significant impact on small VRAM GPUs
      so this change also refines the algorithm so that when parallel is not
      explicitly set by the user, we try to find a reasonable default that fits
      the model on their GPU(s).  As before, multiple models will only load
      concurrently if they fully fit in VRAM.
      17b7186c
  11. 14 Jun, 2024 4 commits
  12. 04 Jun, 2024 1 commit
  13. 24 May, 2024 1 commit
  14. 23 May, 2024 1 commit
  15. 14 May, 2024 1 commit
  16. 06 May, 2024 2 commits
  17. 05 May, 2024 1 commit
    • Daniel Hiltgen's avatar
      Centralize server config handling · f56aa200
      Daniel Hiltgen authored
      This moves all the env var reading into one central module
      and logs the loaded config once at startup which should
      help in troubleshooting user server logs
      f56aa200
  18. 03 May, 2024 1 commit
  19. 28 Apr, 2024 1 commit
    • Daniel Hiltgen's avatar
      Fix concurrency for CPU mode · d6e3b645
      Daniel Hiltgen authored
      Prior refactoring passes accidentally removed the logic to bypass VRAM
      checks for CPU loads.  This adds that back, along with test coverage.
      
      This also fixes loaded map access in the unit test to be behind the mutex which was
      likely the cause of various flakes in the tests.
      d6e3b645
  20. 25 Apr, 2024 1 commit
  21. 24 Apr, 2024 3 commits
  22. 23 Apr, 2024 2 commits
    • Daniel Hiltgen's avatar
      Harden sched TestLoad · d8851cb7
      Daniel Hiltgen authored
      Give the go routine a moment to deliver the expired event
      d8851cb7
    • Daniel Hiltgen's avatar
      Request and model concurrency · 34b9db5a
      Daniel Hiltgen authored
      This change adds support for multiple concurrent requests, as well as
      loading multiple models by spawning multiple runners. The default
      settings are currently set at 1 concurrent request per model and only 1
      loaded model at a time, but these can be adjusted by setting
      OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
      34b9db5a