Commits · 7edfdd2f5f48a7be035cec23b4acd12f7c112e1c · OpenDAS / ollama

18 May, 2025 1 commit

readme: add TinyNotepad to community integrations (#10763) · 7edfdd2f

Ronald Wilson authored May 19, 2025

This PR adds Tiny Notepad, a lightweight, notepad-like interface to chat with local LLMs via Ollama. 

- It’s designed as a simple, distraction-free alternative. 
- The app supports basic note-taking, timestamped logs, and model parameter controls. 
- Built with Tkinter, it runs entirely offline and available via PyPI.

Aims to provide a lightweight easy to run and install interface for ollama.

7edfdd2f

16 May, 2025 1 commit

model: handle multiple eos tokens (#10577) · 333e3604

Michael Yang authored May 16, 2025

* get eos_token_id from generation_config.json

* refactor

* include both ids and strings in trace

* comments

* remove special case for gemma3 special vocab (#10743)

333e3604

15 May, 2025 7 commits

Fix lingering Q4_0 help reference (#10720) · 27da2cdd
Daniel Hiltgen authored May 15, 2025

27da2cdd

cmd: add ellipses to truncated show metadata (#10717) · feb8923a

Bruce MacDonald authored May 15, 2025

When a piece of information has been truncated in the show output an ellipses to indicate that more data has not been displayed

feb8923a

ollamarunner: Multi-modal worst case graph · fe623c2c

Jesse Gross authored Apr 07, 2025

We currently preallocate compute graph memory for the worst case
batch of text tokens. This adds support for doing the same for
images.

Note that image models are more complicated than text models in
how they process their inputs so there may be cases where this
approach isn't completely generic for all models. It covers all
currently supported models though.

fe623c2c

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

ollamarunner: Base cached tokens on current prompt · 499ae731

Jesse Gross authored May 09, 2025

When we restore a sequence from the cache, we split the prompt into
the already used tokens (stored in the cache) and new tokens that
need to be processed. Currently, the references to the used tokens
are coming from the stored previous sequence.

However, even though we know that the used tokens are semantically
equivalent to the prefix of the prompt, tokens can contain pointers
which are no longer valid. As a result, it is better to get the
used tokens from the prompt, which has currently valid pointers.

This doesn't currently have any impact because it isn't possible
to reuse the pointers (which are tensors) anyways. However, it
becomes an issue once we can.

499ae731

fix pixel values padding (#10718) · ef202789
Michael Yang authored May 15, 2025
```
* panic if trying to pad 4d

* fix pixel values padding
```
ef202789

fix mllama conversion (#10716) · 55760195

Michael Yang authored May 15, 2025

cross attention Q and K projections needs to have their heads swapped, similar to non-cross attention Q and K tensors

55760195

14 May, 2025 4 commits
- ggml: update qwen25vl vision size estimate (#10711) · bd68d3ae
  Bruce MacDonald authored May 14, 2025
  
  bd68d3ae
- fix crash in old clients with quantization progress (#10710) · ff80718e
  Daniel Hiltgen authored May 14, 2025
```
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
```
  ff80718e
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 7 commits
- Fixed over vram allcation dure to small initial layer sizes. · 0478d440
  tej authored May 13, 2025
```
Co-authored-by: Tej Kiran <kiran.tej@amd.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Tej Kiran <itej89@gmailcom>
```
  0478d440
- llama: fix memory leak for grammar (#10696) · 8cc33f4c
  Parth Sareen authored May 13, 2025
  
  8cc33f4c
- llama: fix defrag patch to defragment when no slots are available (#10695) · f46df4e5
  Jeffrey Morgan authored May 13, 2025
  
  f46df4e5
- Revert "remove cuda v11 (#10569)" (#10692) · c6bcdc42
  Daniel Hiltgen authored May 13, 2025
```
Bring back v11 until we can better warn users that their driver
is too old.

This reverts commit fa393554.
```
  c6bcdc42
- llama: fix crash on snowflake embedding model (#10690) · 4b903f08
  Jeffrey Morgan authored May 13, 2025
  
  4b903f08
- server: add webp image input support (#10653) · c7f4ae7b
  Jeffrey Morgan authored May 12, 2025
  
  c7f4ae7b
- fix vocabulary (#10679) · 526b2ed1
  Michael Yang authored May 12, 2025
  
  526b2ed1
12 May, 2025 5 commits
- models: remove unused qwen2vl processing (#10677) · a7240c6d
  Bruce MacDonald authored May 12, 2025
  
  a7240c6d
- Follow up to #10363 (#10647) · 9d6df908
  Daniel Hiltgen authored May 12, 2025
```
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
```
  9d6df908
- llama: update to commit de4c07f93 (#10655) · 0cefd46f
  Jeffrey Morgan authored May 12, 2025
  
  0cefd46f
- convert: quantize from safetensors needs kv (#10675) · ad035ad5
  Bruce MacDonald authored May 12, 2025
```
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
```
  ad035ad5
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
11 May, 2025 2 commits
- readme: add UnityCodeLama to community integrations (#10665) · 82a9e946
  HardCodeDev authored May 12, 2025
  
  82a9e946
- readme: add OllamaPlusPlus C++ library to community integrations (#10664) · 76724e2f
  HardCodeDev authored May 12, 2025
  
  76724e2f
10 May, 2025 5 commits
- llama: allocate grammar buffer based on schema length (#10649) · ecf14a22
  frob authored May 10, 2025
  
  ecf14a22
- envconfig: Remove no longer supported max vram var (#10623) · 69ce44b3
  frob authored May 10, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  69ce44b3
- feat: add threshold to dump options (#10639) · 5969674c
  Michael Yang authored May 10, 2025
```
ml.Dump will preserve default values if not specified
```
  5969674c
- readme: add ojira to community integrations (#10648) · 867d75b2
  AliAhmedNada authored May 10, 2025
  
  867d75b2
- cmd: strip single quotes from image page (#10636) · 3fa78598
  Bruce MacDonald authored May 09, 2025
  
  3fa78598
08 May, 2025 5 commits

fix: stream accumulator exits early (#10593) · 0d6e35d3

Michael Yang authored May 08, 2025

the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.

0d6e35d3

lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
Michael Yang authored May 08, 2025

6e9a7a25
chore: remove unused ZipReader type (#10621) · b585a581
Michael Yang authored May 08, 2025

b585a581
api: remove unused sampling parameters (#10581) · fa9973cd
Jeffrey Morgan authored May 08, 2025

fa9973cd

ollamarunner: Use correct constant to remove cache entries · 3d9498a4

Jesse Gross authored May 07, 2025

The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.

The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.

3d9498a4

07 May, 2025 3 commits

CI: trigger downstream release process (#10508) · 3098c8b2
Daniel Hiltgen authored May 07, 2025

3098c8b2

sched: fix race leading to orphaned runners (#10599) · 5e380c3b

Daniel Hiltgen authored May 07, 2025

If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.

5e380c3b

api: remove unused RetrieveModelResponse type (#10603) · 392de840
Jeffrey Morgan authored May 06, 2025

392de840