Commits · 1f371ea92f7ebe4edd208b6732753473b2c4d0cd · OpenDAS / ollama

22 May, 2025 2 commits

ml: Panic rather than return error on tensor allocation failure · 1f371ea9

Jesse Gross authored May 19, 2025

FromFloatSlice and FromIntSlice return an error if the shape doesn't
match the passed data or if memory can't be allocated. Since these
are inputs, the memory being allocated is system memory rather than VRAM.

In many cases, the caller can't really handle the error and panics.

Empty and Zeros directly panic if they can't allocate memory.

This makes things consistent by panicing for the first two cases,
removing a fair amount of error handling code. This is also consistent
with how Go typically handles these situations.

1f371ea9

ollamarunner: Memory usage reporting · 73d6a82c

Jesse Gross authored Apr 17, 2025

This provides granular information about the backend memory allocations
required by the runner:
 - Per backend
 - Per layer
 - Weights, cache and graph
 - Allocation status

This can be used for debugging and validating memory estimates.

73d6a82c

19 May, 2025 1 commit

ggml: Seperate tensor load from backend creation · 94ab428e

Jesse Gross authored Apr 17, 2025

Currently, when the backend is created, the tensors are loaded at the
same time, which is a slow operation. This separates them to be two
steps:
 - Create backend, including enumerating tensors and memory allocation
 - Loading tensor data

This allows more flexibility in managing model loading.

94ab428e

15 May, 2025 3 commits

ollamarunner: Multi-modal worst case graph · fe623c2c

Jesse Gross authored Apr 07, 2025

We currently preallocate compute graph memory for the worst case
batch of text tokens. This adds support for doing the same for
images.

Note that image models are more complicated than text models in
how they process their inputs so there may be cases where this
approach isn't completely generic for all models. It covers all
currently supported models though.

fe623c2c

ollamarunner: Separate text and multimodal graphs · 3c14461d

Jesse Gross authored May 05, 2025

For some multimodal models (such as gemma3), we create a single
graph that generates the image embedding and then use this in the
text model. The embedding tensor is completely opaque to the runner.

However, this doesn't work if we need to use the embedding in multiple
batches. This can arise if the embedding is larger than the batch size.
In these cases (as with llama4), we would like to create views that
are more appropriately sized. However, if we do this then the original
source tensor is used in multiple graphs, which isn't allowed. To
avoid that problem, models with this pattern compute the embedding
tensor on first use and recreate the individual views. There is no
longer a single vision and text graph.

This codifies the pattern of separating vision and text graphs. The
logic of computing tensors on demand is moved to the runner, so models
no longer have to worry about this. It also gives the runner visibility
into the multimodal tensors, which is important for memory management.

3c14461d

ollamarunner: Base cached tokens on current prompt · 499ae731

Jesse Gross authored May 09, 2025

When we restore a sequence from the cache, we split the prompt into
the already used tokens (stored in the cache) and new tokens that
need to be processed. Currently, the references to the used tokens
are coming from the stored previous sequence.

However, even though we know that the used tokens are semantically
equivalent to the prefix of the prompt, tokens can contain pointers
which are no longer valid. As a result, it is better to get the
used tokens from the prompt, which has currently valid pointers.

This doesn't currently have any impact because it isn't possible
to reuse the pointers (which are tensors) anyways. However, it
becomes an issue once we can.

499ae731

14 May, 2025 1 commit
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
12 May, 2025 1 commit
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
08 May, 2025 2 commits

api: remove unused sampling parameters (#10581) · fa9973cd
Jeffrey Morgan authored May 08, 2025

fa9973cd

ollamarunner: Use correct constant to remove cache entries · 3d9498a4

Jesse Gross authored May 07, 2025

The correct constant to remove all entries to the end of the sequence
for the Ollama engine is math.MaxInt32. -1 is used by the old engine.

The impact of this is currently minimal because it would only occur
in situations that are not supported by the implemented models or
rarely used options.

3d9498a4

05 May, 2025 1 commit

api: remove unused or unsupported api options (#10574) · 3b2d2c83

Jeffrey Morgan authored May 05, 2025

Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options

3b2d2c83

02 May, 2025 1 commit

ollamarunner: Re-enable worst case graph preallocation. · c2f5d666

Jesse Gross authored May 02, 2025

Worst case graph preallocation was disabled by a27462b7
"ollamarunner: Temporarily disable worst case graph preallocation"
since it caused crashes with large batches when not using the GPU.

This backports upstream llama.cpp commit f057808
"ggml: Don't assert fail when tensor data changes (#13222)", which
fixes the underlying bug and allows reverting the previous workaround.

c2f5d666

01 May, 2025 1 commit

ollamarunner: Fix memory leak when processing images · 8e8f2c6d

Jesse Gross authored May 01, 2025

The context (and therefore associated input tensors) was not being
properly closed when images were being processed. We were trying to
close them but in reality we were closing over an empty list, preventing
anything from actually being freed.

Fixes #10434

8e8f2c6d

29 Apr, 2025 1 commit

ollamarunner: Temporarily disable worst case graph preallocation · a27462b7

Jesse Gross authored Apr 29, 2025

When we later have a large batch running purely on a CPU, this
results the error:
GGML_ASSERT(talloc->buffer_id >= 0)

Disabling this means that we will incrementally reallocate memory
as the graph grows.

Fixes #10410

a27462b7

24 Apr, 2025 1 commit
- llama: remove model loading for grammar (#10096) · a53d744b
  Parth Sareen authored Apr 24, 2025
  
  a53d744b
08 Apr, 2025 1 commit

ollamarunner: Preallocate worst case graph at startup · dbb149e6

Jesse Gross authored Apr 03, 2025

Currently, the KV cache and graph are lazily allocated as needed.
The cache is fully allocated on first use of the corresponding
layer whereas the graph grows with the size of the context.

This can be an issue if another application allocates more VRAM
after we do our calculations - Ollama will crash in the middle of
inference. If we instead allocate the maximum needed memory at
startup of the runner, we will either succeed or fail at that point
rather than at some surprising time in the future.

Currently, this only generates a worst case batch for text, which
means that vision models may get a partial allocation and continue
to lazily allocate the rest.

dbb149e6

03 Apr, 2025 1 commit

llm: set done reason at server level (#9830) · e53b3cbd

Bruce MacDonald authored Apr 03, 2025

No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.

e53b3cbd

02 Apr, 2025 2 commits

kvcache: Add check for values that fall out of sliding window cache · b4297006

jmorganca authored Mar 30, 2025

The sliding window cache trims entries that are outside the window for
the latest token. This works when we are extending the cache, such as
when the conversation continues. However, if we have a partial overlap
in conversation (including the BOS tokens), then we resume from a past
point in the conversation and the needed tokens are no longer stored
in memory. This verifies that the new window overlaps with the old one
before reusing the cache.
Co-authored-by: Jesse Gross <jesse@ollama.com>

b4297006

ollamarunner: Don't truncate a SameBatch · 493385eb

Jesse Gross authored Apr 01, 2025

When truncating inputs to the the context window at the beginning of
a sequence, we remove the minimum amount possible. However, this
may cause us to truncate to the middle of a set of inputs that
the model specified should not be split up. To avoid this, we
need to remove the rest of the partial batch.

493385eb

31 Mar, 2025 3 commits

runner: clear cache when shift is not possible (#9433) · 66b25392

Bruce MacDonald authored Mar 31, 2025

Clear KV cache when shift operation is not supported by model.
Added KvCacheCanShift() check to handle models that can't perform cache shifts,
falling back to full cache clear while preserving logical token history to
maintain expected behavior when context window fills up.

66b25392

runner: Release semaphore and improve error messages on failures · b2a46529

Jesse Gross authored Mar 14, 2025

If we have an error after creating a new sequence but before
finding a slot for it, we return without releasing the semaphore.
This reduces our parallel sequences and eventually leads to deadlock.

In practice this should never happen because once we have acquired
the semaphore, we should always be able to find a slot. However, the
code is clearly not correct.

b2a46529

ollamarunner: Ensure batch size limits are not exceeded · 5d097277

Jesse Gross authored Mar 27, 2025

With the llama runner, we can generate up to NUM_PARALLEL batches
at once, which will then get broken up to into individual batches
to get executed by llama.cpp (i.e. we add up to 2048 tokens and
this gets split into 4 batches of 512 tokens at default settings).

This splitting can improve parallelism on multi-GPU systems because
the individual batches can move though the pipeline without blocking
on the first one to fully complete. However, we don't yet support
this in the Ollama runner, partially because it makes it hard to
enforce model-specified batch constraints, which didn't exist
previously.

The result is that we will try to execute the full, unsplit batch.
This could result in out of memory or insufficient KV cache space
errors.

This triggers batch breaking when the total inputs from all sequences
exceeds the batch size, rather than per-sequence. In order to ensure
fairness, it also reintroduces round-robinning around sequences so
that we don't let one busy sequence starve the others.

5d097277

21 Mar, 2025 3 commits

ml/backend/ggml: load tensors in 32KiB chunks · 74bd0965
Michael Yang authored Mar 19, 2025

74bd0965

kvcache: Pass granular cache size into implementations · 3ed7ad3a

Jesse Gross authored Mar 18, 2025

Currently the runner computes the kv size needed and creates a
cache of that size. This is the context size times number of
parallel sequences.

Cache implementations can make better decisions about their memory
usage, so instead pass in the required capacity, number of sequences
and maximum batch size. For now, the causal cache just uses this to
compute the size in the same way as before.

3ed7ad3a

ollamarunner: Provide mechanism for backends to report loading progress · 0ff28758

Jesse Gross authored Mar 20, 2025

This enables the runner to report progress back to the Ollama server,
both for showing status to the user and also to prevent the server
from killing the runner if it thinks things have stalled.

Most of the infrastructure was already there, this extends it to
be available to the backends.

0ff28758

20 Mar, 2025 2 commits

model: Pass input tensor instead of raw data to models · 0fbfcf3c

Jesse Gross authored Mar 19, 2025

Rather than directly giving the input data to models, we can
pass a tensor instead. In the short term, this saves some duplicated
code.

Longer term, we will want to overlap setting up the next batch with
processing of the current one. In this case, we will only have the
shape of tensor but it will not be loaded with data at the time of
graph generation. By passing only a tensor to models now, we set up
this possibility and prevent them from relying on data that they won't
have in the future.

Although the same could be done for Positions and Outputs, in some
cases we either need the raw input data or don't use them at all.
Therefore, for now we leave them as they are and allow models to
convert them to tensors as needed.

0fbfcf3c

input: Rename Options to Batch · 0c220935
Jesse Gross authored Mar 19, 2025
```
Options is no longer very descriptive of this struct.
```
0c220935

17 Mar, 2025 2 commits

ollamarunner: Check for minBatch of context space when shifting · bf24498b

Jesse Gross authored Mar 17, 2025

Models can specify that a group of inputs need to be handled a single
batch. However, context shifting didn't respect this and could trigger
a break anyways. In this case, we should instead trigger a context
shift earlier so that it occurs before the grouped batch.

Note that there still some corner cases:
 - A long prompt that exceeds the context window can get truncated
   in the middle of an image. With the current models, this will
   result in the model not recognizing the image at all, which is
   pretty much the expected result with truncation.
 - The context window is set less than the minimum batch size. The
   only solution to this is to refuse to load the model with these
   settings. However, this can never occur with current models and
   default settings.

Since users are unlikely to run into these scenarios, fixing them is
left as a follow up.

bf24498b

runner: remove cache prompt flag from ollama runner (#9826) · 95e271d9

Bruce MacDonald authored Mar 17, 2025

We do not need to bypass the prompt caching in the ollama runner yet, as
only embedding models needed to bypass the prompt caching. When embedding
models are implemented they can skip initializing this cache completely.

95e271d9

14 Mar, 2025 3 commits

ollamarunner: Use a separate context per multimodal input · 282bfaaa

Jesse Gross authored Mar 13, 2025

Currently there is a single context per sequence, shared all by
all multimodal inputs. Since we build a vision encoder graph per
image, with a large number of inputs we can eventually hit the
maximum number of graph nodes per context.

This changes to use a separate context for each image, ensuring
that available resource limits are consistent.

282bfaaa

ml: Allow models to constrain inputs to a single batch · 9679f401

Jesse Gross authored Mar 12, 2025

Models may require that a set of inputs all be processed as part
of the same batch. For example, if an image has multiple patches
with fully connected attention between them, we should not split
the batch in the middle of an image.

Fixes #9697

9679f401

llm: remove internal subprocess req and resp types (#9324) · 3892c3a7

Bruce MacDonald authored Mar 14, 2025

This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).

3892c3a7

13 Mar, 2025 1 commit
- engine: error on embeddings; not currently implemented · ec46f328
  Michael Yang authored Mar 13, 2025
  
  ec46f328
11 Mar, 2025 2 commits

Revert "Allow models to force a new batch" · 65b0f329
jmorganca authored Mar 11, 2025
```
This reverts commit c7eae586b899083acebcd9b3847b89ea78c2850c.
```
65b0f329

Allow models to force a new batch · 06007c0a

Jesse Gross authored Mar 10, 2025

This is useful for a few things:
 - Work around bugs, such as having 2 images in one batch
 - Keep the image in a single batch for fully connected attention
 - Improve performance by not evaluating embeddings multiple times

06007c0a

10 Mar, 2025 2 commits

sample: temporarily use grammars for constrained generation in new engine (#9586) · e093db92
Jeffrey Morgan authored Mar 10, 2025

e093db92

model: Update encoder cache to use multimodal input processing handler · a1cda80b

Jesse Gross authored Mar 08, 2025

The encoder cache needs to know the position of images in the input
stream so that it knows when to delete them. Previously images didn't
have a position, so we implied one by breaking batches before an
image and then assuming the image was in the first position. However,
multimodal objects are now given explicit positions in the input
stream, so we can use that instead.

Breaking batches was also a way to simulate a cross attention mask
for mllama. However, given that it only supports a single sequence
and a single image, this mask doesn't serve any real purpose.
Removing the batch break does not appear to affect the quality of
the output.

Most of this is simply moving the input data structures to a new
package to avoid import cycles.

a1cda80b

09 Mar, 2025 1 commit

ollamarunner: Don't panic for unimplemented features at runtime. · 4614fafa

Jesse Gross authored Mar 08, 2025

It's ok to fail on startup but we shouldn't panic during runtime
based on user input. Downgrade the panic to a warning.

4614fafa

08 Mar, 2025 2 commits
- ml: Add support for quantized KV cache · 4100ed7b
  Jesse Gross authored Feb 21, 2025
```
Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.
```
  4100ed7b
- ollamarunner: Quiet debug logging and panic on unimplemented features · 0daaaef8
  Jesse Gross authored Mar 07, 2025
```
Debug logging of every token has previously caused test timeouts
on slower machines.
```
  0daaaef8