1. 06 Jan, 2026 1 commit
    • Devon Rifkin's avatar
      preserve tool definition and call JSON ordering (#13525) · e51dead6
      Devon Rifkin authored
      * preserve tool definition and call JSON ordering
      
      This is another iteration of
      <https://github.com/ollama/ollama/pull/12518>, but this time we've
      simplified things by relaxing the competing requirements of being
      compatible AND order-preserving with templates (vs. renderers). We
      maintain backwards compatibility at the cost of not guaranteeing order
      for templates. We plan on moving more and more models to renderers,
      which have been updated to use these new data types, and additionally
      we could add an opt-in way of templates getting an order-preserved list
      (e.g., via sibling template vars)
      
      * orderedmap_test: remove testify
      e51dead6
  2. 11 Dec, 2025 1 commit
    • nicole pardal's avatar
      embeddings: modified batch size (#13429) · 3475d915
      nicole pardal authored
      
      
      This PR detects embedding models and sets batch_size = context_size so the full input fits in a single batch.
      Previously, if batch size was smaller than the input, tokens could be split across batches and cause a SIGTRAP crash.
      This change ensures all tokens stay in one batch and prevents crashes.
      Fixes: #12938 #13054
      Co-authored-by: default avatarJesse Gross <jesse@ollama.com>
      3475d915
  3. 08 Dec, 2025 1 commit
  4. 02 Dec, 2025 2 commits
  5. 13 Nov, 2025 1 commit
  6. 11 Nov, 2025 1 commit
    • Baptiste Jamin's avatar
      server: add logprobs and top_logprobs support to Ollama's API (#12899) · 59241c5b
      Baptiste Jamin authored
      
      
      Adds logprobs support to Ollama's API including support for Ollama's
      OpenAI-compatible API. By specifying the new 'logprobs' boolean parameter
      in the API, Ollama will return the log probabilities for each token generated.
      'top_logprobs', an integer value can also be specified up to the value 20.
      When specified, the API will also provide the number of most likely tokens to
      return at each token position
      Co-authored-by: default avatarBaptiste Jamin <baptiste@crisp.chat>
      59241c5b
  7. 31 Oct, 2025 1 commit
  8. 30 Oct, 2025 1 commit
  9. 29 Oct, 2025 3 commits
  10. 28 Oct, 2025 2 commits
  11. 27 Oct, 2025 1 commit
    • nicole pardal's avatar
      server: Consolidate embedding truncation in runner (#12730) · 5d347f6d
      nicole pardal authored
      Currently, checking the length of prompts for embeddings to ensure
      they fit in the context window (and possible truncation) occurs in
      two places - the Ollama server and runner. This can lead to
      inconsistencies in both the checks and reported number of tokens
      processed. Since we have to do this processing in the runner, this
      consolidates all of the logic there.
      5d347f6d
  12. 20 Oct, 2025 1 commit
  13. 17 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: harden scheduler tests (#12662) · 68e04c7f
      Daniel Hiltgen authored
      * test: harden scheduler tests
      
      This removes reschedDelay which was stale code, and adds
      a new configurable timeout for the waitForVRAMRecovery so
      tests can now set the timeout to be very short to avoid the
      scheduler getting stuck and hitting a test timeout.
      
      * test: tune tests for partial loads
      
      Give stress tests more time when the model is split between CPU/GPU
      68e04c7f
  14. 16 Oct, 2025 1 commit
  15. 08 Oct, 2025 1 commit
  16. 02 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Update GGML to b6646 (#12245) · c68f367e
      Daniel Hiltgen authored
      Notable EOLs with this change:
      - MacOS v12 and v13 are no longer supported (v14+ required)
      - AMD gfx900 and gfx906 are no longer supported
      c68f367e
  17. 01 Oct, 2025 1 commit
    • Daniel Hiltgen's avatar
      Use runners for GPU discovery (#12090) · bc8909fb
      Daniel Hiltgen authored
      This revamps how we discover GPUs in the system by leveraging the Ollama
      runner.  This should eliminate inconsistency between our GPU discovery and the
      runners capabilities at runtime, particularly for cases where we try to filter
      out unsupported GPUs.  Now the runner does that implicitly based on the actual
      device list.  In some cases free VRAM reporting can be unreliable which can
      leaad to scheduling mistakes, so this also includes a patch to leverage more
      reliable VRAM reporting libraries if available.
      
      Automatic workarounds have been removed as only one GPU leveraged this, which
      is now documented. This GPU will soon fall off the support matrix with the next
      ROCm bump.
      
      Additional cleanup of the scheduler and discovery packages can be done in the
      future once we have switched on the new memory management code, and removed
      support for the llama runner.
      bc8909fb
  18. 22 Sep, 2025 1 commit
  19. 18 Sep, 2025 1 commit
  20. 12 Sep, 2025 1 commit
  21. 09 Sep, 2025 3 commits
    • Parth Sareen's avatar
      20b53eaa
    • Daniel Hiltgen's avatar
      tests: reduce stress on CPU to 2 models (#12161) · 67451828
      Daniel Hiltgen authored
      * tests: reduce stress on CPU to 2 models
      
      This should avoid flakes due to systems getting overloaded with 3 (or more) models running concurrently
      
      * tests: allow slow systems to pass on timeout
      
      If a slow system is still streaming a response, and the response
      will pass validation, don't fail just because the system is slow.
      
      * test: unload embedding models more quickly
      67451828
    • Jesse Gross's avatar
      llm: Clamp batch size to context size · e119783e
      Jesse Gross authored
      The context must always be able to store the current batch, so
      if the user requests a small context then we should also shrink
      the batch to match. This also fixes the TestLongInputContext
      test on the new engine. (The old engine already has this behavior.)
      e119783e
  22. 29 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      perf: build graph for next batch async to keep GPU busy (#11863) · 517807cd
      Daniel Hiltgen authored
      * perf: build graph for next batch in parallel to keep GPU busy
      
      This refactors the main run loop of the ollama runner to perform the main GPU
      intensive tasks (Compute+Floats) in a go routine so we can prepare the next
      batch in parallel to reduce the amount of time the GPU stalls waiting for the
      next batch of work.
      
      * tests: tune integration tests for ollama engine
      
      This tunes the integration tests to focus more on models supported
      by the new engine.
      517807cd
  23. 15 Aug, 2025 1 commit
    • Daniel Hiltgen's avatar
      test: improve scheduler/concurrency stress tests (#11906) · d6f7233a
      Daniel Hiltgen authored
      * test: improve scheduler/concurrency stress tests
      
      The scheduler test used to use approximate memory figures and would often
      over or under shoot a systems capcity leading to flaky test results.
      This should improve the reliability of this scenario by leveraging
      ps output to determinie exactly how many models it takes to
      trigger thrashing.
      
      The concurrency test is also refined to target num_parallel + 1 and handle
      timeouts better.
      
      With these refinements, TestMultiModelConcurrency was redundant
      
      * test: add parallel generate with history
      
      TestGenerateWithHistory will help verify caching and context
      are properly handled while making requests
      
      * test: focus embed tests on embedding models
      
      remove non-embedding models from the embedding tests
      d6f7233a
  24. 14 Aug, 2025 1 commit
  25. 13 Aug, 2025 1 commit
  26. 07 Aug, 2025 1 commit
  27. 11 Jul, 2025 1 commit
  28. 05 Jul, 2025 1 commit
    • Daniel Hiltgen's avatar
      int: add performance integration tests (#11173) · 4f473e22
      Daniel Hiltgen authored
      usage example:
        go test --tags=integration,perf -count 1 ./integration -v -timeout 1h -run TestModelsPerf 2>&1 | tee int.log
        cat int.log | grep MODEL_PERF_HEADER | cut -f2- -d: > perf.csv
        cat int.log | grep MODEL_PERF_DATA | cut -f2- -d: >> perf.csv
      4f473e22
  29. 19 Jun, 2025 1 commit
  30. 24 May, 2025 1 commit
  31. 22 May, 2025 1 commit
  32. 06 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Move quantization to new backend (#10363) · 42481045
      Daniel Hiltgen authored
      * Move quantization logic to GGML via new backend
      
      This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
      
      * Remove "add model quantizations"
      
      This is no longer needed now that quantization is implemented in Go+GGML code directly.
      42481045
  33. 04 May, 2025 1 commit
  34. 29 Apr, 2025 1 commit