- 02 May, 2025 2 commits
-
-
Harsh Nevse authored
-
Jeffrey Morgan authored
-
- 01 May, 2025 6 commits
-
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Jesse Gross authored
In some cases, we can't find a cache slot when using sliding window attention. It would be helpful in this (and other cases) to know what the batch size is. Bug #10127
-
Jesse Gross authored
The context (and therefore associated input tensors) was not being properly closed when images were being processed. We were trying to close them but in reality we were closing over an empty list, preventing anything from actually being freed. Fixes #10434
-
AliAhmedNada authored
-
aritra saha authored
Update the list for readme
-
Michael Yang authored
* add gguf_test * fix padding padding was being added to offset but not to the running count
-
- 30 Apr, 2025 4 commits
-
-
Devon Rifkin authored
* strip out thinking tags in message history for qwen3 & r1 This is in advance of "proper" support where we'll make reasoning configurable and we'll parse out thinking/reasoning tags and provide them to the caller. These models expect there to be no thinking tags in the message history, so this should improve quality * parse model names instead of hacky prefix check
-
Daniel Hiltgen authored
* Adjust initial scheduler refCount Ensure we only set the refCount on success * sched: fix lock order inversion deadlock Under certain race conditions, there was a scenario where the scheduler would get into a deadlock while trying to update free space information while a model was trying to unload.
-
Daniel Hiltgen authored
Users may have other incompatible GGML installs on their systems. This will prevent us from trying to load them from the path.
-
Shahin R authored
-
- 29 Apr, 2025 9 commits
-
-
batuhankadioglu authored
-
Daniel Hiltgen authored
The cleanup routine from InitServerconnection should run in the defer of the test case to properly detect failures and report the server logs
-
Jesse Gross authored
When we later have a large batch running purely on a CPU, this results the error: GGML_ASSERT(talloc->buffer_id >= 0) Disabling this means that we will incrementally reallocate memory as the graph grows. Fixes #10410
-
crStiv authored
-
Devon Rifkin authored
-
Devon Rifkin authored
this is in part to "pay" for #10452, which doubled the default context length. The combination isn't fully neutral though, because even though the old 4x2k limit and the new 2x4k limit are memory equivalent, the 1x fallback is larger with 4k
-
Devon Rifkin authored
config: update default context length to 4096
-
Devon Rifkin authored
-
Devon Rifkin authored
Revert "increase default context length to 4096"
-
- 28 Apr, 2025 1 commit
-
-
Devon Rifkin authored
This reverts commit 424f6486.
-
- 26 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 25 Apr, 2025 17 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
Co-authored-by:Patrick Devine <patrick@infrahq.com>
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
use a default of 1024 when asking for zero is confusing since most calls seem to assume 0 means do not ready any data
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
the first call to http.ResponseWriter.Write implicitly calls WriteHeader with http.StatusOK if it hasn't already been called. once WriteHeader has been called, subsequent calls has no effect. Write is called when JSON encoding progressUpdateJSON{}. calls to http.ResponseWriter.WriteHeader after the first encode is useless and produces a warning: http: superfluous response.WriteHeader call from github.com/ollama/ollama/server/internal/registry.(*statusCodeRecorder).WriteHeader (server.go:77) -
Michael Yang authored
-