Commits · 9f8a18ec050ef67fca11d4f9bea0508eece93a68 · OpenDAS / ollama

12 Jun, 2025 2 commits

tools: loosen tool parsing to allow for more formats (#11030) · 9f8a18ec
Jeffrey Morgan authored Jun 12, 2025

9f8a18ec

feat: incremental gguf parser (#10822) · 6b04cad7

Michael Yang authored Jun 12, 2025



* incremental gguf parser
* gguf: update test to not rely on gguf on disc
* re-use existing create gguf
* read capabilities from gguf kv
* kv exists
* update tests
* s/doneFunc/successFunc/g
* new buffered reader

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

6b04cad7

11 Jun, 2025 3 commits

feat: uneven splits (#11048) · 45f56355

Michael Yang authored Jun 11, 2025

The current splitDim function only operates on tensors that are split evenly which isn't always the case, e.g. a QKV tensor. This change allows the function to be used for arbitrary splits

45f56355

skip tokenizer.model if possible (#11050) · 0dabb4ef
Michael Yang authored Jun 11, 2025
```
if tokenizer.json is already copied, skip tokenizer.model
```
0dabb4ef

use nn.Linear in place of ml.Tensor (#11049) · 2e77aa1a

Michael Yang authored Jun 11, 2025

while nn.Linear.Forward isn't applicable for sparse MLP, it's still
a nice container for the tensors

2e77aa1a

10 Jun, 2025 3 commits
- readme: add ollama-multirun to community integrations (#11038) · deaabe29
  Attogram Project authored Jun 10, 2025
  
  deaabe29
- readme: update quickstart link text to Gemma 3 · af21a5ac
  Jeffrey Morgan authored Jun 10, 2025
  
  af21a5ac
- readme: update quickstart example to Gemma 3 · f63d7f68
  Jeffrey Morgan authored Jun 10, 2025
  
  f63d7f68
09 Jun, 2025 1 commit

mac: handle "keep" named apps (#11031) · 82ad1dbc

Daniel Hiltgen authored Jun 09, 2025

When a user elects to keep the existing app, the
new Ollama is named `Ollama 2.app`
This fixes the app startup flow to handle this naming pattern.

82ad1dbc

08 Jun, 2025 1 commit
- spawn desktop quickly (#11011) · feeabdad
  Daniel Hiltgen authored Jun 08, 2025
```
Give the desktop app a hint to start fast.
```
  feeabdad
07 Jun, 2025 2 commits
- docs: update link to AMD drivers in linux.md (#10973) · fc030961
  Krzysztof Jeziorny authored Jun 07, 2025
  
  fc030961
- Revert "server: add model capabilities to the list endpoint (#10174)" (#11004) · 09d308d6
  Jeffrey Morgan authored Jun 06, 2025
```
This reverts commit 09430011.
```
  09d308d6
06 Jun, 2025 4 commits
- launch app hidden (#10962) · a8ed68bd
  Daniel Hiltgen authored Jun 06, 2025
```
When starting the app in the background, start it hidden.
```
  a8ed68bd
- win: handle more than 2048 processes (#10997) · 2ae65ae4
  Daniel Hiltgen authored Jun 06, 2025
```
Fix an array out of bounds crash
```
  2ae65ae4
- move thinking logic into its own package (#10990) · a3b6886b
  Devon Rifkin authored Jun 06, 2025
```
move thinking logic into its own package
```
  a3b6886b
- docs: fix typo in development.md (#10998) · c6a6d729
  Hunter Wittenborn authored Jun 06, 2025
  
  c6a6d729
05 Jun, 2025 2 commits
- Merge pull request #10987 from ollama/drifkin/export-thinking-parser · 2cf007c9
  Devon Rifkin authored Jun 05, 2025
```
export ThinkingParser
```
  2cf007c9
- export ThinkingParser · 0683efa6
  Devon Rifkin authored Jun 05, 2025
  
  0683efa6
04 Jun, 2025 1 commit
- server: add model capabilities to the list endpoint (#10174) · 09430011
  JasonHonKL authored Jun 05, 2025
  
  09430011
31 May, 2025 1 commit
- readme: add SimpleOllamaUnity to community integrations (#10817) · 5c42800f
  HardCodeDev authored May 31, 2025
  
  5c42800f
30 May, 2025 1 commit
- tools: resiliency upgrade to name and arg extraction from template (#10917) · 65f10c28
  Parth Sareen authored May 30, 2025
  
  65f10c28
29 May, 2025 3 commits

ggml: Export GPU UUIDs · aaa78180

Jesse Gross authored Apr 24, 2025

This enables matching up devices and information reported by the backend
with system management libraries such as nvml to get accurate free
memory reporting.

aaa78180

llm: Make "POST predict" error message more informative · f15ffc43

Jesse Gross authored May 13, 2025

"POST predict" basically means that the runner has crashed, which
can have many reasons. However, many people think this is a specific
error and either report only this message or group together unrelated
bugs. This replaces it with a more friendly and helpful message.

f15ffc43

add thinking support to the api and cli (#10584) · 5f57b0ef

Devon Rifkin authored May 28, 2025

- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/set think` or `/set nothink`
- A `--hidethinking` option has also been added to the CLI. This makes
  it easy to use thinking in scripting scenarios like
  `ollama run qwen3 --think --hidethinking "my question here"` where you
  just want to see the answer but still want the benefits of thinking
  models

5f57b0ef

27 May, 2025 5 commits

client: add request signing to the client (#10881) · aa25aff1
Patrick Devine authored May 27, 2025
```
If OLLAMA_AUTH is set, sign each request w/ a timestamp and pass the signature in the token header
```
aa25aff1

kvcache: Skip computing causal mask for worst case graph reservation · ea790031

Jesse Gross authored May 27, 2025

Computing an attention mask for a large context and max batch is
expensive - over 100ms. Models like Gemma3 that have multiple types
of caches and custom attention masks need to do this 4 times, so this
adds approximately 500ms to startup time when using 128k context

When we are reserving the worst case graph, we don't need the mask,
only its shape, so we can skip this.

ea790031

server: abort download on empty digest · 9239a254
Kyle Steere authored May 27, 2025
```
Signed-off-by: Kyle Steere <kyle.steere@chainguard.dev>
```
9239a254
tools: relax JSON parse constraints for tool calling (#10872) · 066d0f47
Parth Sareen authored May 26, 2025

066d0f47
tools: remove newline stripping (#10869) · aea6fb9b
Parth Sareen authored May 26, 2025

aea6fb9b

26 May, 2025 1 commit
- readme: add AWS Strands Agents SDK example to community integrations (#10865) · 012cf653
  RAPID ARCHITECT authored May 26, 2025
  
  012cf653
24 May, 2025 5 commits

readme: Add macLlama to community integrations (#10790) · a45231af

Min Yoo authored May 25, 2025

This commit updates the README to include macLlama within the community integrations section.

macLlama is a native macOS application built for lightweight and efficient LLM interaction.  Key features include:

*   **Lightweight & Native:** Designed to be resource-friendly and perform optimally on macOS.
*   **Chat-like Interface:** Provides a user-friendly, conversational interface.
*   **Multiple Window Support:** Allows users to manage multiple conversations simultaneously.

The primary goal of macLlama is to offer a simple and easy-to-run LLM experience on macOS.

a45231af

tests: drop llama3.2-vision embedding tests (#10837) · 2307fc2b
Daniel Hiltgen authored May 24, 2025

2307fc2b
docs: remove unsupported quantizations (#10842) · 66238981
frob authored May 24, 2025

66238981
server: add hint to the error message when model path access fails (#10843) · eda472df
frob authored May 24, 2025

eda472df
ml: Improve slog formatting for BackendMemory · f18e0cb5
Jesse Gross authored May 23, 2025

f18e0cb5

23 May, 2025 2 commits
- tools: refactor tool call parsing and enable streaming (#10415) · e8b981fa
  Parth Sareen authored May 23, 2025
  
  e8b981fa
- llama: add minimum memory for grammar (#10820) · 884d2609
  Parth Sareen authored May 22, 2025
  
  884d2609
22 May, 2025 3 commits

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

ollamarunner: Memory usage reporting · 73d6a82c

Jesse Gross authored Apr 17, 2025

This provides granular information about the backend memory allocations
required by the runner:
 - Per backend
 - Per layer
 - Weights, cache and graph
 - Allocation status

This can be used for debugging and validating memory estimates.

73d6a82c

ggml: Report graph memory for failed allocations · 6db8a377

Jesse Gross authored May 16, 2025

GGML has a function to report the allocated size of a backend buffer.
However, this returns 0 if we tried to allocate a buffer and it failed.
For memory management purposes, it's important to know how much we were
trying to allocate. This extends the API to report attempted sizes for
all buffers and whether it succeeeded.

6db8a377