1. 12 Dec, 2025 6 commits
    • Eva H's avatar
      95fdd8d6
    • Devon Rifkin's avatar
      docs: add docs for v1/responses and rework openai compat section (#13416) · 9f782285
      Devon Rifkin authored
      
      
      * docs: add docs for v1/responses and rework openai compat section
      
      I reworked the examples to be separated by topic and to be fully
      runnable (i.e., they now log output instead of just suggesting how a
      call might be made).
      
      We now use `<CodeGroup>`s so that each example has a dropdown on the
      docs site for users to choose, which makes the examples a lot more
      digestible (since you only see approx 1/3 of the code you used to).
      
      I also added a new tool to extract code examples into files so that it's
      easier to actually run them and check that they work.
      
      ## Example
      
      ```shell
      go run docs/tools/extract-examples/main.go docs/api/openai-compatibility.mdx
      ```
      
      Output:
      
      ```
      Extracting code examples to: /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
      
        - 01_basic.py
        - 01_basic.js
        - 01_basic.sh
        - 02_responses.py
        - 02_responses.js
        - 02_responses.sh
        - 03_vision.py
        - 03_vision.js
        - 03_vision.sh
      
      Extracted 9 file(s) to /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
      
      To run examples:
      
        cd /var/folders/vq/wfm2g6k917d3ldzpjdxc8ph00000gn/T/mdx-examples-3271754368
        npm install   # for JS examples
      
      then run individual files with `node file.js`, `python file.py`, `bash file.sh`
      ```
      
      In the future we should consider actually running the examples in CI and
      having some sort of acceptance test so we can automatically detect when
      our examples break. So this is just a start in that direction.
      
      * Update docs/api/openai-compatibility.mdx
      Co-authored-by: default avatarParth Sareen <parth.sareen@ollama.com>
      
      * Update docs/api/openai-compatibility.mdx
      Co-authored-by: default avatarParth Sareen <parth.sareen@ollama.com>
      
      ---------
      Co-authored-by: default avatarParth Sareen <parth.sareen@ollama.com>
      9f782285
    • Parth Sareen's avatar
      openai: add tool call appending to previous assistant message (#13434) · 9b2035d1
      Parth Sareen authored
      * openai: add tool call appending to previous asst message
      
      * add tests for thinking appending
      9b2035d1
    • Alexander Gusak's avatar
      docs: fix link to modelfile.mdx (#13220) · 93d45d7a
      Alexander Gusak authored
      93d45d7a
    • JJ's avatar
      Update README.md (#13373) · 709f8424
      JJ authored
      Correct Markdown syntax for Swollama GitHub and DocC documentation links
      709f8424
    • Jeffrey Morgan's avatar
      2dfb7441
  2. 11 Dec, 2025 5 commits
  3. 10 Dec, 2025 6 commits
  4. 09 Dec, 2025 5 commits
  5. 08 Dec, 2025 5 commits
  6. 06 Dec, 2025 1 commit
  7. 05 Dec, 2025 1 commit
  8. 04 Dec, 2025 7 commits
    • Jesse Gross's avatar
      9191dfaf
    • Jesse Gross's avatar
      ggml: Enable flash attention for vision encoders · 1108d8b3
      Jesse Gross authored
      Although the vision component of multimodal models typically already
      call the optimized nn.Attention, it is converted into non-fused
      operations. That is because the backend-specific fused kernels may
      have requirements, such as padding, and they is performed by the
      cache, which vision encoders don't use.
      
      This implements a fallback path in the backend, softening the
      requirements into optimizations. In turn, this allows flash attention
      to be used for vision encoders, saving a significant amount of VRAM
      and improving performance.
      1108d8b3
    • Jesse Gross's avatar
      ggml: Always set cache padding to 256 · 7837a5bc
      Jesse Gross authored
      We currently use cache padding of 32 when not using flash attention
      and 256 with flash attention, which is based on the historic alignment
      requirements of these kernels. The restrictions have since been
      loosened but there are still performance benefits, such as better
      CUDA graph reuse.
      
      Since the requirement is no longer kernel-specific, set the padding
      uniformly to 256, as llama.cpp has.
      7837a5bc
    • Patrick Devine's avatar
      convert: add deepseek converter (#12980) · 0a844f8e
      Patrick Devine authored
      This change adds the ability for `ollama create` to convert models that use
      the DeepSeek2 architecture (specifically DeepSeekV3 and DeepSeek-R1).
      0a844f8e
    • Eloi Torrents's avatar
      cmd/bench: support writing benchmark output to file (#13263) · a03223b8
      Eloi Torrents authored
      
      
      * cmd/bench: support writing benchmark output to file
      
      This changes Ollama to allow the bench command to write benchmark
      results to a user-specified output file instead of stdout when the
      --output flag is provided.
      
      ---------
      Co-authored-by: default avatarPatrick Devine <patrick@infrahq.com>
      a03223b8
    • Daniel Hiltgen's avatar
      ggml update to b7108 (#12992) · 0cf7794b
      Daniel Hiltgen authored
      * Revert "vulkan: temporary cary of vulkan fixes (#12971)"
      
      This reverts commit 3a9e8e9f.
      
      * ggml update to b7087
      
      * fix argsort on metal
      
      * update to b7108
      
      * fix bakllava regression
      
      This model lacks the metadata for the projector type.
      
      * update to b7209
      
      * fix TopK perf
      
      * only build arm code on arm
      0cf7794b
    • Jeffrey Morgan's avatar
      854d40ed
  9. 03 Dec, 2025 2 commits
    • Bruce MacDonald's avatar
      app: relay thinking false to server (#13319) · 84a2cedf
      Bruce MacDonald authored
      This fixes a bug where disabling thinking on deepseek-v3.1 did not stop the model from thinking.
      
      When thinking is not defined it should not be sent to the server since this will cause error responses in some cases where the model does not support thinking. However if it is defined as false it should still be sent.
      84a2cedf
    • Daniel Hiltgen's avatar
      CUDA: filter devices on secondary discovery (#13317) · 3f308367
      Daniel Hiltgen authored
      We now do a deeper probe of CUDA devices to verify the library version has
      the correct compute capability coverage for the device.  Due to ROCm also
      interpreting the CUDA env var to filter AMD devices, we try to avoid setting
      it which leads to problems in mixed vendor systems.  However without setting
      it for this deeper probe, each CUDA library subprocess discovers all CUDA GPUs
      and on systems with lots of GPUs, this can lead to hitting timeouts.  The fix is
      to turn on the CUDA visibility env var just for this deeper probe use-case.
      3f308367
  10. 02 Dec, 2025 2 commits