1. 01 Jul, 2024 9 commits
  2. 29 Jun, 2024 2 commits
  3. 28 Jun, 2024 4 commits
  4. 27 Jun, 2024 5 commits
  5. 25 Jun, 2024 2 commits
    • Blake Mizerany's avatar
      llm: speed up gguf decoding by a lot (#5246) · cb42e607
      Blake Mizerany authored
      Previously, some costly things were causing the loading of GGUF files
      and their metadata and tensor information to be VERY slow:
      
        * Too many allocations when decoding strings
        * Hitting disk for each read of each key and value, resulting in a
          not-okay amount of syscalls/disk I/O.
      
      The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
      m3.
      
      This commit also prevents collecting large arrays of values when
      decoding GGUFs (if desired). When such keys are encountered, their
      values are null, and are encoded as such in JSON.
      
      Also, this fixes a broken test that was not encoding valid GGUF.
      cb42e607
    • Blake Mizerany's avatar
      cmd: defer stating model info until necessary (#5248) · 2aa91a93
      Blake Mizerany authored
      This commit changes the 'ollama run' command to defer fetching model
      information until it really needs it. That is, when in interactive mode.
      
      It also removes one such case where the model information is fetch in
      duplicate, just before calling generateInteractive and then again, first
      thing, in generateInteractive.
      
      This positively impacts the performance of the command:
      
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.168 total
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.220 total
          ; time ./before run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.217 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 4% cpu 0.652 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.01s user 0.01s system 5% cpu 0.498 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?
      
          ./after run llama3 'hi'  0.01s user 0.01s system 3% cpu 0.479 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
          ; time ./after run llama3 'hi'
          Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?
      
          ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
      2aa91a93
  6. 21 Jun, 2024 8 commits
    • Daniel Hiltgen's avatar
      Merge pull request #5205 from dhiltgen/modelfile_use_mmap · ccef9431
      Daniel Hiltgen authored
      Fix use_mmap parsing for modelfiles
      ccef9431
    • Daniel Hiltgen's avatar
      Sort the ps output · 642cee13
      Daniel Hiltgen authored
      Provide consistent ordering for the ps command - longest duration listed first
      642cee13
    • royjhan's avatar
      Docs (#5149) · 9a9e7d83
      royjhan authored
      9a9e7d83
    • Daniel Hiltgen's avatar
      Disable concurrency for AMD + Windows · 9929751c
      Daniel Hiltgen authored
      Until ROCm v6.2 ships, we wont be able to get accurate free memory
      reporting on windows, which makes automatic concurrency too risky.
      Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
      All other platforms and GPUs have accurate VRAM reporting wired
      up now, so we can turn on concurrency by default.
      9929751c
    • Daniel Hiltgen's avatar
      Enable concurrency by default · 17b7186c
      Daniel Hiltgen authored
      This adjusts our default settings to enable multiple models and parallel
      requests to a single model.  Users can still override these by the same
      env var settings as before.  Parallel has a direct impact on
      num_ctx, which in turn can have a significant impact on small VRAM GPUs
      so this change also refines the algorithm so that when parallel is not
      explicitly set by the user, we try to find a reasonable default that fits
      the model on their GPU(s).  As before, multiple models will only load
      concurrently if they fully fit in VRAM.
      17b7186c
    • Michael Yang's avatar
      Merge pull request #5206 from ollama/mxyng/quantize · 189a43ca
      Michael Yang authored
      fix: quantization with template
      189a43ca
    • Michael Yang's avatar
      fix: quantization with template · e835ef18
      Michael Yang authored
      e835ef18
    • Daniel Hiltgen's avatar
      Fix use_mmap parsing for modelfiles · 7e774922
      Daniel Hiltgen authored
      Add the new tristate parsing logic for the code path for modelfiles,
      as well as a unit test.
      7e774922
  7. 20 Jun, 2024 9 commits
  8. 19 Jun, 2024 1 commit
    • royjhan's avatar
      Extend api/show and ollama show to return more model info (#4881) · fedf7163
      royjhan authored
      
      
      * API Show Extended
      
      * Initial Draft of Information
      Co-Authored-By: default avatarPatrick Devine <pdevine@sonic.net>
      
      * Clean Up
      
      * Descriptive arg error messages and other fixes
      
      * Second Draft of Show with Projectors Included
      
      * Remove Chat Template
      
      * Touches
      
      * Prevent wrapping from files
      
      * Verbose functionality
      
      * Docs
      
      * Address Feedback
      
      * Lint
      
      * Resolve Conflicts
      
      * Function Name
      
      * Tests for api/show model info
      
      * Show Test File
      
      * Add Projector Test
      
      * Clean routes
      
      * Projector Check
      
      * Move Show Test
      
      * Touches
      
      * Doc update
      
      ---------
      Co-authored-by: default avatarPatrick Devine <pdevine@sonic.net>
      fedf7163