Commits · 4100ed7bdd417ae6d25bf64467fb9df33f3f6525 · OpenDAS / ollama

"tests/vscode:/vscode.git/clone" did not exist on "a72a057d62d0adb2743b20968c72ae9cb5e5d62b"

08 Mar, 2025 2 commits

ml: Add support for quantized KV cache · 4100ed7b

Jesse Gross authored Feb 21, 2025

Similar to the llama engine, quantizing the KV cache requires
flash attention to be enabled through the Ollama server.

4100ed7b

ggml-backend: Ensure allocation meet backend requirements · 25f9b152

Jesse Gross authored Mar 07, 2025

Backends can impose additional alignment requirements on buffer sizes.
We should ensure that we meet these or allocations can fail.

25f9b152

07 Mar, 2025 13 commits
- additional review comments · 98272fbd
  Jesse Gross authored Mar 07, 2025
  
  98272fbd
- ml/backend/ggml: use backend buffer type · b27e8f3f
  Michael Yang authored Mar 05, 2025
```
this ensures the tensor is created on the right buffer type for backends
such as cpu
```
  b27e8f3f
- comments · 45df786f
  Michael Yang authored Mar 04, 2025
  
  45df786f
- ml/backend/ggml: clean up · daaf42e4
  Michael Yang authored Feb 28, 2025
  
  daaf42e4
- ml/backend/ggml: offload vision to cpu · 2dc60d46
  Michael Yang authored Feb 27, 2025
```
temporary until tensor loading can accurately account for vision models
```
  2dc60d46
- ml/backend/ggml: handle tensor split · b5312f30
  Michael Yang authored Feb 26, 2025
  
  b5312f30
- ml/backend/ggml: handle user specified cpu offloading · 26c2e0bd
  Michael Yang authored Feb 26, 2025
  
  26c2e0bd
- ml/backend/ggml: set cpu n_threads · bf920883
  Michael Yang authored Feb 26, 2025
  
  bf920883
- ml/backend/ggml: create tensor on specific backend · 7bae7fa5
  Michael Yang authored Feb 25, 2025
```
some tensors should be created on specific backends to reduce number of
copies and improve performance
```
  7bae7fa5
- kvcache: create cache ctx per layer · 764e199d
  Michael Yang authored Feb 25, 2025
```
each cache layer creates and maintains its own context instead of using
a large context for all layers
```
  764e199d
- model: load non-repeated tensors into multiple backends · bfce55db
  Michael Yang authored Feb 24, 2025
```
some tensors are expected to be used in repeating layers but are not
themselves repeated. this change copies these tensors into the same
backends as their repeating counterparts to minimize copying tensors
between backends
```
  bfce55db
- ml/backend/ggml: update model loading for hybrid/multi backends · bab6f34d
  Michael Yang authored Feb 19, 2025
```
use a similar strategy as llama.cpp for deciding where tensors should be
allocated. this will be improved later to be aware of usable memory
before assigning the tensor
```
  bab6f34d
- llama: fix kv loading on snowflake-arctic-embed models (#9536) · 4289c743
  Jeffrey Morgan authored Mar 07, 2025
  
  4289c743
04 Mar, 2025 1 commit

ml/backend/ggml: consolidate system info logging · 05a01fde

Michael Yang authored Feb 28, 2025

- output backend system info when initializing the backend. this ensures
  this information is always present without needing to be called
  explicitly
- convert to structured logging
- enumerate devices rather than backends since devices are ordered
- track device indices grouped by device name

05a01fde

03 Mar, 2025 1 commit

fix: own lib/ollama directory · ba7d3124

Michael Yang authored Mar 03, 2025

expand backend loading error handling to catch more problems and log
them instead of panicing

ba7d3124

02 Mar, 2025 4 commits

ml: Enable support for flash attention · 21aa666a

Jesse Gross authored Feb 25, 2025

The GGML flash attention kernel has specific requirements for
padding and permutation. This adds support to the KV cache
for conforming to these requirements so that flash attention
can be enabled.

Flash attention can be used in the same situations as the llama
engine and is enabled by the user in the same way.

21aa666a

ml: Empty tensor constructor for tensors · ee141cc8

Jesse Gross authored Feb 28, 2025

In cases where we allocate a tensor and then fully overwrite it with
copied data, it is wasteful to first zero out the memory.

ee141cc8

ggml-backend: Store parent backend as part of tensor · 55e5776c

Jesse Gross authored Feb 27, 2025

It can be important for a tensor to know what backend it came from -
for example, to know if flash attention is enabled.

55e5776c

attention: Remove unnecessary contiguous operations · 854a9195

Jesse Gross authored Feb 22, 2025

Prior to performing attention, we need to permute query, key
and value. Currently we call Contiguous after each of these
permutations, which is correct but expensive. Avoiding the
3 calls to Contiguous increases performance by over 20%.

The permutations of query and key do not violate the continuity
rules for mulmat and the Contiguous call can be simply removed.

Value requires a different permutation and does require Contiguous.
However, we can use the copy into the cache as a way to perform this
without further overhead.

To support this and avoid unexpected tensor shapes that are seen by
models, we need tighter integration between attention, cache
and backend. Future optimization will also likely need this structure
 - for example, flash attention has special padding requirements in
the cache and other backends may have their own needs.

This further contains the operations that go into attention so that
these and other optimizations can be handled transparently. Models
that have special requirements for attention can still implement
their own version of it.

854a9195

27 Feb, 2025 5 commits
- ml: update Context.Forward interface · 3e8b8a19
  Michael Yang authored Feb 21, 2025
```
update Context.Forward to accept multiple tensors to match
Context.Compute signature

update Context.Forward to return Context such that it can be chained
with Context.Compute
```
  3e8b8a19
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
- ml/backend/ggml: fix debug logging · a59f6652
  Michael Yang authored Feb 26, 2025
  
  a59f6652
- ml/backend/ggml: follow on fixes after updating vendored code (#9388) · a5272130
  Jeffrey Morgan authored Feb 26, 2025
```
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
```
  a5272130
- llama: update llama.cpp vendor code to commit d7cfe1ff (#9356) · d7d7e996
  Jeffrey Morgan authored Feb 26, 2025
  
  d7d7e996
25 Feb, 2025 1 commit

.github: always run tests, and other helpful fixes (#9348) · 0d694793

Blake Mizerany authored Feb 25, 2025

During work on our new registry client, I ran into frustrations with CI
where a misspelling in a comment caused the linter to fail, which caused
the tests to not run, which caused the build to not be cached, which
caused the next run to be slow, which caused me to be sad.

This commit address these issues, and pulls in some helpful changes
we've had in CI on ollama.com for some time now.

They are:

* Always run tests, even if the other checks fail.

Tests are the most important part of CI, and should always run. Failures
in tests can be correlated with failures in other checks, and can help
surface the root cause of the failure sooner. This is especially
important when the failure is platform specific, and the tests are not
platform independent.

* Check that `go generate` is clean.

This prevents 'go generate' abuse regressions. This codebase used to use
it to generate platform specific binary build artifacts. Let's make sure
that does not happen again and this powerful tool is used correctly, and
the generated code is checked in.

Also, while adding `go generate` the check, it was revealed that the
generated metal code was putting dates in the comments, resulting in
non-deterministic builds. This is a bad practice, and this commit fixes
that. Git tells us the most important date: the commit date along with
other associated changes.

* Check that `go mod tidy` is clean.

A new job to check that `go mod tidy` is clean was added, to prevent
easily preventable merge conflicts or go.mod changes being deferred to a
future PR that is unrelated to the change that caused the go.mod to
change.

* More robust caching.

We now cache the go build cache, and the go mod download cache
independently. This is because the download cache contains zips that can
be unpacked in parallel faster than they can be fetched and extracted by
tar. This speeds up the build significantly.

The linter is hostile enough. It does not need to also punish us with
longer build times due to small failures like misspellings.

0d694793

24 Feb, 2025 1 commit
- ml/backend/ggml: fix crash on windows paths with wide characters (#9305) · 8c13cfa4
  Jeffrey Morgan authored Feb 23, 2025
  
  8c13cfa4
21 Feb, 2025 2 commits

ml: Abstract attention out of model definitions · f53f4198

Jesse Gross authored Feb 14, 2025



There are two benefits to doing this:
 - Provide a library function that models can use, reducing code for
   each model implementation
 - Enables a single place to drop in optimized implementations of
   attention based on the backend or other factors. One is provided for
   GGML.

On CUDA this improves token generation rate by about 3%. It does not
have a significant effect on Metal.
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

f53f4198

ml/backend/ggml: fix rms norm · 2192a28e
Michael Yang authored Feb 20, 2025

2192a28e

20 Feb, 2025 2 commits

ggml-backend: Don't recreate the scheduler for each context · e5bcc51a

Jesse Gross authored Feb 18, 2025

We don't need to create and destroy the GGML scheduler for every
context. This introduces extra CPU overhead for every forward
pass and extra memory for contexts that don't actually get scheduled
(for example, KV caches). We can instead just have one scheduler
for the backend and reset it each time we call Compute.

This improves token generation performance by 1-2% and removes
scheduler create/destroy from profile traces.

e5bcc51a

ollamarunner: Pass runner performance parameters to backends · bd6a7d5e

Jesse Gross authored Feb 20, 2025

Currently the following parameters are in the runner but not used:
 - numGPULayers
 - mainGPU
 - threads
 - tensorSplit

This passes them through to the backend, which is where they would
actually get used. However, the GGML backend does not yet do anything
with them.

bd6a7d5e

19 Feb, 2025 1 commit
- llama: add patch to fix ggml backend reg on Linux with utf-8 characters in the path (#9159) · d2eb226c
  Jeffrey Morgan authored Feb 18, 2025
  
  d2eb226c
18 Feb, 2025 1 commit

build: remove backend build for sapphirerapids · 5f8c0318

Michael Yang authored Feb 18, 2025

sapphire rapids has amx support but it ends up having a negative
performance impact.

emerald rapids also has amx support with a positive performance impact
however there's no reasonable way in ggml to differentiate between the
two. the impact is small (~6%) so disable amx entirely for simplicity

5f8c0318

14 Feb, 2025 6 commits

Wire up system info log for new engine (#9123) · df2680b4
Daniel Hiltgen authored Feb 14, 2025

df2680b4
ml/backend/ggml: stable sort devices by score (#9081) · 6600bd7d
Jeffrey Morgan authored Feb 13, 2025

6600bd7d

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

ggml-backend: Close on nil should be a no-op · d223f3b6
Jesse Gross authored Feb 10, 2025

d223f3b6

ggml-backend: Ensure data is available after async computation · 60830695

Jesse Gross authored Feb 05, 2025

We need to sync before retrieving data after async computation.
It is also important to ensure that the Go buffer is not moved by
the GC across function calls so we do a synchronous copy.

60830695

ggml-backend: Let GGML allocate context memory · 01d9a468

Jesse Gross authored Jan 30, 2025

Passing in a Go buffer is not safe because the garbage collector could
free or move the memory while the context is still open. However, if
we pass in the size and a nil pointer then GGML will allocate it from
the C side.

01d9a468