Commits · 499ae7311fd26cb4e655ebea69712de3e242f629 · OpenDAS / ollama

15 May, 2025 3 commits

ollamarunner: Base cached tokens on current prompt · 499ae731

Jesse Gross authored May 09, 2025

When we restore a sequence from the cache, we split the prompt into
the already used tokens (stored in the cache) and new tokens that
need to be processed. Currently, the references to the used tokens
are coming from the stored previous sequence.

However, even though we know that the used tokens are semantically
equivalent to the prefix of the prompt, tokens can contain pointers
which are no longer valid. As a result, it is better to get the
used tokens from the prompt, which has currently valid pointers.

This doesn't currently have any impact because it isn't possible
to reuse the pointers (which are tensors) anyways. However, it
becomes an issue once we can.

499ae731

fix pixel values padding (#10718) · ef202789
Michael Yang authored May 15, 2025
```
* panic if trying to pad 4d

* fix pixel values padding
```
ef202789

fix mllama conversion (#10716) · 55760195

Michael Yang authored May 15, 2025

cross attention Q and K projections needs to have their heads swapped, similar to non-cross attention Q and K tensors

55760195

14 May, 2025 4 commits
- ggml: update qwen25vl vision size estimate (#10711) · bd68d3ae
  Bruce MacDonald authored May 14, 2025
  
  bd68d3ae
- fix crash in old clients with quantization progress (#10710) · ff80718e
  Daniel Hiltgen authored May 14, 2025
```
Older clients assumed the digest was at least 19 characters long so increase the size
of the dummy digest to avoid array out of bounds crashes.
```
  ff80718e
- model: add Qwen2.5-VL support (#10385) · 0aa8b371
  Bruce MacDonald authored May 13, 2025
  
  0aa8b371
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 7 commits
- Fixed over vram allcation dure to small initial layer sizes. · 0478d440
  tej authored May 13, 2025
```
Co-authored-by: Tej Kiran <kiran.tej@amd.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Tej Kiran <itej89@gmailcom>
```
  0478d440
- llama: fix memory leak for grammar (#10696) · 8cc33f4c
  Parth Sareen authored May 13, 2025
  
  8cc33f4c
- llama: fix defrag patch to defragment when no slots are available (#10695) · f46df4e5
  Jeffrey Morgan authored May 13, 2025
  
  f46df4e5
- Revert "remove cuda v11 (#10569)" (#10692) · c6bcdc42
  Daniel Hiltgen authored May 13, 2025
```
Bring back v11 until we can better warn users that their driver
is too old.

This reverts commit fa393554.
```
  c6bcdc42
- llama: fix crash on snowflake embedding model (#10690) · 4b903f08
  Jeffrey Morgan authored May 13, 2025
  
  4b903f08
- server: add webp image input support (#10653) · c7f4ae7b
  Jeffrey Morgan authored May 12, 2025
  
  c7f4ae7b
- fix vocabulary (#10679) · 526b2ed1
  Michael Yang authored May 12, 2025
  
  526b2ed1
12 May, 2025 5 commits
- models: remove unused qwen2vl processing (#10677) · a7240c6d
  Bruce MacDonald authored May 12, 2025
  
  a7240c6d
- Follow up to #10363 (#10647) · 9d6df908
  Daniel Hiltgen authored May 12, 2025
```
The quantization PR didn't block all unsupported file types,
which this PR fixes.  It also updates the API docs to reflect
the now reduced set of supported types.
```
  9d6df908
- llama: update to commit de4c07f93 (#10655) · 0cefd46f
  Jeffrey Morgan authored May 12, 2025
  
  0cefd46f
- convert: quantize from safetensors needs kv (#10675) · ad035ad5
  Bruce MacDonald authored May 12, 2025
```
When creating a quantized model from safetensors we
need the array KV values to be loaded.Changing this
value to -1 loads the KV values on the returned
layer to be used and saved during quantization.
```
  ad035ad5
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
11 May, 2025 2 commits
- readme: add UnityCodeLama to community integrations (#10665) · 82a9e946
  HardCodeDev authored May 12, 2025
  
  82a9e946
- readme: add OllamaPlusPlus C++ library to community integrations (#10664) · 76724e2f
  HardCodeDev authored May 12, 2025
  
  76724e2f
10 May, 2025 5 commits
- llama: allocate grammar buffer based on schema length (#10649) · ecf14a22
  frob authored May 10, 2025
  
  ecf14a22
- envconfig: Remove no longer supported max vram var (#10623) · 69ce44b3
  frob authored May 10, 2025
```
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
```
  69ce44b3
- feat: add threshold to dump options (#10639) · 5969674c
  Michael Yang authored May 10, 2025
```
ml.Dump will preserve default values if not specified
```
  5969674c
- readme: add ojira to community integrations (#10648) · 867d75b2
  AliAhmedNada authored May 10, 2025
  
  867d75b2
- cmd: strip single quotes from image page (#10636) · 3fa78598
  Bruce MacDonald authored May 09, 2025
  
  3fa78598
08 May, 2025 5 commits

fix: stream accumulator exits early (#10593) · 0d6e35d3

Michael Yang authored May 08, 2025

the stream accumulator exits as soon as it sees `api.ProgressResponse(status="success")` which isn't strictly correctly
since some requests may have multiple successes, e.g. `/api/create` when the source model needs to be pulled.

0d6e35d3

lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
Michael Yang authored May 08, 2025

6e9a7a25
chore: remove unused ZipReader type (#10621) · b585a581
Michael Yang authored May 08, 2025

b585a581
api: remove unused sampling parameters (#10581) · fa9973cd
Jeffrey Morgan authored May 08, 2025

fa9973cd

ollamarunner: Use correct constant to remove cache entries · 3d9498a4

Jesse Gross authored May 07, 2025

The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.

The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.

3d9498a4

07 May, 2025 5 commits

CI: trigger downstream release process (#10508) · 3098c8b2
Daniel Hiltgen authored May 07, 2025

3098c8b2

sched: fix race leading to orphaned runners (#10599) · 5e380c3b

Daniel Hiltgen authored May 07, 2025

If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.

5e380c3b

api: remove unused RetrieveModelResponse type (#10603) · 392de840
Jeffrey Morgan authored May 06, 2025

392de840
fix data race in WriteGGUF (#10598) · af31ccef
Daniel Hiltgen authored May 06, 2025
```
err in the go routine should not be shared with the outer scope
```
af31ccef

remove cuda v11 (#10569) · fa393554

Daniel Hiltgen authored May 06, 2025

This reduces the size of our Windows installer payloads by ~256M by dropping
support for nvidia drivers older than Feb 2023. Hardware support is unchanged.

Linux default bundle sizes are reduced by ~600M to 1G.

fa393554

06 May, 2025 4 commits
- readme: add Flufy to community integrations (#9719) · 307e3b3e
  Aharon Bensadoun authored May 07, 2025
  
  307e3b3e
- server: send 405 instead of 404 for unallowed methods (#10275) · 4090aca9
  Devon Rifkin authored May 06, 2025
```
Fixes: #5483
```
  4090aca9
- server: remove internal cmd (#10595) · 92ce438d
  Michael Yang authored May 06, 2025
  
  92ce438d
- Move quantization to new backend (#10363) · 42481045
  Daniel Hiltgen authored May 06, 2025
```
* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.
```
  42481045