- 18 May, 2025 1 commit
-
-
Ronald Wilson authored
This PR adds Tiny Notepad, a lightweight, notepad-like interface to chat with local LLMs via Ollama. - It’s designed as a simple, distraction-free alternative. - The app supports basic note-taking, timestamped logs, and model parameter controls. - Built with Tkinter, it runs entirely offline and available via PyPI. Aims to provide a lightweight easy to run and install interface for ollama.
-
- 16 May, 2025 1 commit
-
-
Michael Yang authored
* get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)
-
- 15 May, 2025 7 commits
-
-
Daniel Hiltgen authored
-
Bruce MacDonald authored
When a piece of information has been truncated in the show output an ellipses to indicate that more data has not been displayed
-
Jesse Gross authored
We currently preallocate compute graph memory for the worst case batch of text tokens. This adds support for doing the same for images. Note that image models are more complicated than text models in how they process their inputs so there may be cases where this approach isn't completely generic for all models. It covers all currently supported models though.
-
Jesse Gross authored
For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.
-
Jesse Gross authored
When we restore a sequence from the cache, we split the prompt into the already used tokens (stored in the cache) and new tokens that need to be processed. Currently, the references to the used tokens are coming from the stored previous sequence. However, even though we know that the used tokens are semantically equivalent to the prefix of the prompt, tokens can contain pointers which are no longer valid. As a result, it is better to get the used tokens from the prompt, which has currently valid pointers. This doesn't currently have any impact because it isn't possible to reuse the pointers (which are tensors) anyways. However, it becomes an issue once we can.
-
Michael Yang authored
* panic if trying to pad 4d * fix pixel values padding
-
Michael Yang authored
cross attention Q and K projections needs to have their heads swapped, similar to non-cross attention Q and K tensors
-
- 14 May, 2025 4 commits
-
-
Bruce MacDonald authored
-
Daniel Hiltgen authored
Older clients assumed the digest was at least 19 characters long so increase the size of the dummy digest to avoid array out of bounds crashes.
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 13 May, 2025 7 commits
-
-
tej authored
Co-authored-by:
Tej Kiran <kiran.tej@amd.com> Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Tej Kiran <itej89@gmailcom>
-
Parth Sareen authored
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
Bring back v11 until we can better warn users that their driver is too old. This reverts commit fa393554.
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Michael Yang authored
-
- 12 May, 2025 5 commits
-
-
Bruce MacDonald authored
-
Daniel Hiltgen authored
The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.
-
Jeffrey Morgan authored
-
Bruce MacDonald authored
When creating a quantized model from safetensors we need the array KV values to be loaded.Changing this value to -1 loads the KV values on the returned layer to be used and saved during quantization.
-
Michael Yang authored
reduce prompt log to trace level
-
- 11 May, 2025 2 commits
-
-
HardCodeDev authored
-
HardCodeDev authored
-
- 10 May, 2025 5 commits
-
-
frob authored
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Michael Yang authored
ml.Dump will preserve default values if not specified
-
AliAhmedNada authored
-
Bruce MacDonald authored
-
- 08 May, 2025 5 commits
-
-
Michael Yang authored
the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.
-
Michael Yang authored
-
Michael Yang authored
-
Jeffrey Morgan authored
-
Jesse Gross authored
The correct constant to remove all entries to the end of the sequence for the Ollama engine is math.MaxInt32. -1 is used by the old engine. The impact of this is currently minimal because it would only occur in situations that are not supported by the implemented models or rarely used options.
-
- 07 May, 2025 3 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
If a model is loading, and the request context is canceled during the load by a client closing the connection, and another request is inbound for the same model with a different configuration (context size, etc.) thus requiring a reload, two unload events can be in flight. The first shuts down the original model load, but the second one caused the loss of the new reloading runner reference, thus triggering the leak. The primary fix is detecting the duplicate unload and ignoring the second instance. The load routine is also hardened to ensure we detect clobbering an already present runner and unload it with a warning.
-
Jeffrey Morgan authored
-