1. 06 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      Move quantization to new backend (#10363) · 42481045
      Daniel Hiltgen authored
      * Move quantization logic to GGML via new backend
      
      This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.
      
      * Remove "add model quantizations"
      
      This is no longer needed now that quantization is implemented in Go+GGML code directly.
      42481045
  2. 05 May, 2025 1 commit
  3. 03 May, 2025 1 commit
    • Daniel Hiltgen's avatar
      sched: logging improvements (#10550) · 76ea735a
      Daniel Hiltgen authored
      This enhances our logging in the scheduler.  The initial "waiting for server" log
      no longer claims an initial error state (now "not responding" which better reflects
      the actual state).  Runners now have slog wiring to report more details about the
      runner, including PID.
      76ea735a
  4. 01 May, 2025 1 commit
  5. 30 Apr, 2025 2 commits
    • Devon Rifkin's avatar
      strip out thinking tags in message history for qwen3 & r1 (#10490) · ad3c7c9b
      Devon Rifkin authored
      * strip out thinking tags in message history for qwen3 & r1
      
      This is in advance of "proper" support where we'll make reasoning
      configurable and we'll parse out thinking/reasoning tags and provide
      them to the caller. These models expect there to be no thinking tags in
      the message history, so this should improve quality
      
      * parse model names instead of hacky prefix check
      ad3c7c9b
    • Daniel Hiltgen's avatar
      Fix "Stopping..." scheduler hang (#10487) · 415c8fcc
      Daniel Hiltgen authored
      * Adjust initial scheduler refCount
      
      Ensure we only set the refCount on success
      
      * sched: fix lock order inversion deadlock
      
      Under certain race conditions, there was a scenario where the scheduler would
      get into a deadlock while trying to update free space information while a model
      was trying to unload.
      415c8fcc
  6. 29 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      lower default num parallel to 2 · fe5b9bb2
      Devon Rifkin authored
      this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
      fe5b9bb2
  7. 28 Apr, 2025 1 commit
  8. 25 Apr, 2025 2 commits
    • Michael Yang's avatar
      explicitly decode maxarraysize 1024 · 340448d2
      Michael Yang authored
      340448d2
    • Michael Yang's avatar
      fix superfluous call to WriteHeader · 214a7678
      Michael Yang authored
      the first call to http.ResponseWriter.Write implicitly calls WriteHeader
      with http.StatusOK if it hasn't already been called. once WriteHeader
      has been called, subsequent calls has no effect. Write is called when
      JSON encoding progressUpdateJSON{}. calls to
      http.ResponseWriter.WriteHeader after the first encode is useless and
      produces a warning:
      
      http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77)
      214a7678
  9. 22 Apr, 2025 1 commit
    • Devon Rifkin's avatar
      increase default context length to 4096 (#10364) · 424f6486
      Devon Rifkin authored
      * increase default context length to 4096
      
      We lower the default numParallel from 4 to 2 and use these "savings" to
      double the default context length from 2048 to 4096.
      
      We're memory neutral in cases when we previously would've used
      numParallel == 4, but we add the following mitigation to handle some
      cases where we would have previously fallen back to 1x2048 due to low
      VRAM: we decide between 2048 and 4096 using a runtime check, choosing
      2048 if we're on a one GPU system with total VRAM of <= 4 GB. We
      purposefully don't check the available VRAM because we don't want the
      context window size to change unexpectedly based on the available VRAM.
      
      We plan on making the default even larger, but this is a relatively
      low-risk change we can make to quickly double it.
      
      * fix tests
      
      add an explicit context length so they don't get truncated. The code
      that converts -1 from being a signal for doing a runtime check isn't
      running as part of these tests.
      
      * tweak small gpu message
      
      * clarify context length default
      
      also make it actually show up in `ollama serve --help`
      424f6486
  10. 19 Apr, 2025 2 commits
  11. 17 Apr, 2025 1 commit
  12. 16 Apr, 2025 4 commits
    • Blake Mizerany's avatar
      server/internal/registry: remove superfluous progress bar flush (#10303) · 369de832
      Blake Mizerany authored
      This removes the extra flushProgress() at the end of handlePull. It is
      unnecessary because final progress updates are flushed in all cases of
      the main select loop.
      369de832
    • Blake Mizerany's avatar
      server/internal/client/ollama: cleanup use of multiple counters (#10304) · 3457a315
      Blake Mizerany authored
      The completed and received counters must work in tandem and the code
      should better reflect that. Previously, the act of updating them was 2-3
      lines of code duplicated in multiple places. This consolidates them into
      a single update closure for easy reading and maintenance.
      
      This also simplifies error handling in places where we can use a return
      parameter and defer to handle the error case for updates.
      
      Also, remove the old Layer field from the trackingReader struct.
      3457a315
    • Daniel Hiltgen's avatar
      Give tests more time to run (#10306) · 56dc316a
      Daniel Hiltgen authored
      Fix flake failures on windows
      56dc316a
    • Blake Mizerany's avatar
      cmd: add retry/backoff (#10069) · 1e7f62cb
      Blake Mizerany authored
      This commit adds retry/backoff to the registry client for pull requests.
      
      Also, revert progress indication to match original client's until we can
      "get it right."
      
      Also, make WithTrace wrap existing traces instead of clobbering them.
      This allows clients to compose traces.
      1e7f62cb
  13. 14 Apr, 2025 1 commit
  14. 10 Apr, 2025 1 commit
  15. 09 Apr, 2025 1 commit
  16. 08 Apr, 2025 1 commit
  17. 07 Apr, 2025 1 commit
  18. 03 Apr, 2025 1 commit
    • Bruce MacDonald's avatar
      llm: set done reason at server level (#9830) · e53b3cbd
      Bruce MacDonald authored
      No functional change. Many different done reasons can be set at the runner
      level, so rather than obsuring them we should return them to the server
      process and let it choose what to do with the done reason. This separates
      the API concerns from the runner.
      e53b3cbd
  19. 02 Apr, 2025 1 commit
  20. 01 Apr, 2025 1 commit
  21. 31 Mar, 2025 1 commit
    • Blake Mizerany's avatar
      server/internal/client/ollama: cache completed chunks (#9933) · ef27d52e
      Blake Mizerany authored
      This change adds tracking of download chunks during the pull process so
      that subsequent pulls can skip downloading already completed chunks.
      This works across restarts of ollama.
      
      Currently, download state will be lost if a prune is triggered during a
      pull (e.g. restart or remove). This issue should be addressed in a
      follow-up PR.
      ef27d52e
  22. 28 Mar, 2025 1 commit
  23. 26 Mar, 2025 1 commit
    • Jesse Gross's avatar
      ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3
      Jesse Gross authored
      Gemma3 uses sliding windows for its context on 5/6 layers, significantly
      reducing memory usage but leading to uneven usage across layers,
      which makes allocation to the correct GPU difficult. We currently
      estimate very conservatively by assuming all layers are consistent
      at the max size.
      
      Llama3.2-vision is also inconsistent between self attention and cross
      attention layers - at moment, we calculate the correct total size
      and then average this across layers. In some cases, this may lead
      to crashes if a large layer is placed on a GPU sized by the average.
      
      This allows memory estimation to calculate per-layer KV cache size
      and take this account when placing layers onto GPUs. We already do
      this for weights that vary per-tensor, so this is a logical extension.
      
      Fixes #9730
      Fixes #9890
      f66216e3
  24. 21 Mar, 2025 2 commits
  25. 20 Mar, 2025 1 commit
  26. 19 Mar, 2025 1 commit
    • Blake Mizerany's avatar
      server/internal/client/ollama: confirm all chunksums were received (#9893) · 2ddacd75
      Blake Mizerany authored
      If the chunksums response is missing a chunk, the client should fail
      the download. This changes the client to check that all bytes are
      accounted for in the chunksums response.
      
      It is possible there are overlaps or gaps in the chunksums response and
      so the size is not the only thing left to check, but this provides
      enough coverage for now. We may want to check that chunks are contiguous
      later.
      2ddacd75
  27. 15 Mar, 2025 1 commit
    • Blake Mizerany's avatar
      server/internal/client/ollama: set User-Agent for registry client (#9775) · 82946761
      Blake Mizerany authored
      This sets the agent header in DefaultRegistry to include the version of
      the client, OS, and architecture in the previous format, with a minor
      twist.
      
      Note: The version is obtained from the build info, instead of the
      version in version.Version, which should not longer be necessary, but we
      can remove in a future commit. Using the build info is more accurate and
      also provides extra build information if the build is not tagged, and if
      it is "dirty". Previously, the version was just "0.0.0" with no other
      helpful information. The ollama.com registry and others handle this
      swimmingly.
      82946761
  28. 14 Mar, 2025 3 commits
    • Jesse Gross's avatar
      gemma3: Allow multiple image in a single input · 7bf793a6
      Jesse Gross authored
      Previously processing multiple images in a batch would trigger
      segfaults so sending images together was disabled as a way to
      mitigate this. The trigger was processing one image on the CPU
      and one on the GPU.
      
      This can no longer happen:
       - The vision encoder is now on the GPU so both images would be
         processed on the GPU.
       - We require images to be fully contained in a batch and each
         image including its special tokens is over half the batch size.
         As a result, we will never get two images in the same batch.
      
      Fixes #9731
      7bf793a6
    • Blake Mizerany's avatar
      4e320b8b
    • Blake Mizerany's avatar
      server/internal/client: use chunksums for concurrent blob verification (#9746) · eb2b22b0
      Blake Mizerany authored
      Replace large-chunk blob downloads with parallel small-chunk
      verification to solve timeout and performance issues. Registry users
      experienced progressively slowing download speeds as large-chunk
      transfers aged, often timing out completely.
      
      The previous approach downloaded blobs in a few large chunks but
      required a separate, single-threaded pass to read the entire blob back
      from disk for verification after download completion.
      
      This change uses the new chunksums API to fetch many smaller
      chunk+digest pairs, allowing concurrent downloads and immediate
      verification as each chunk arrives. Chunks are written directly to their
      final positions, eliminating the entire separate verification pass.
      
      The result is more reliable downloads that maintain speed throughout the
      transfer process and significantly faster overall completion, especially
      over unstable connections or with large blobs.
      eb2b22b0
  29. 13 Mar, 2025 2 commits
  30. 11 Mar, 2025 1 commit