Commits · 5b446cc8150db0986c4c6afa9dbcee3cefcdc27f · OpenDAS / ollama

06 Feb, 2025 1 commit
- chore: update gitattributes (#8860) · 5b446cc8
  Michael Yang authored Feb 05, 2025
```
* chore: update gitattributes
* chore: add build info source
```
  5b446cc8
05 Feb, 2025 1 commit
- llama: use dynamic backend loading for mllama and clip (#8835) · cd3fbf1c
  Jeffrey Morgan authored Feb 05, 2025
  
  cd3fbf1c
31 Jan, 2025 1 commit
- Revert "cgo: use O3" · 548a9f56
  Michael Yang authored Jan 30, 2025
```
This reverts commit bea1f1fa.
```
  548a9f56
30 Jan, 2025 1 commit
- cgo: use O3 · bea1f1fa
  Michael Yang authored Jan 30, 2025
  
  bea1f1fa
29 Jan, 2025 1 commit

Michael Yang authored Jan 29, 2025



* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

dcfb7a10

14 Jan, 2025 1 commit
- llama: move grammar tests to llama_test.go (#8411) · 61676fb5
  Jeffrey Morgan authored Jan 14, 2025
  
  61676fb5
08 Jan, 2025 1 commit
- llama: update vendored code to commit 46e3556 (#8308) · 1deafd82
  Jeffrey Morgan authored Jan 08, 2025
  
  1deafd82
04 Jan, 2025 1 commit
- llama: fix runner api example url in README.md (#8307) · 3919f4ba
  Ubaldo Porcheddu authored Jan 04, 2025
  
  3919f4ba
19 Dec, 2024 1 commit

llama: test key order preservation in schema_to_grammar (#8078) · 290cf204

Parth Sareen authored Dec 18, 2024

This change adds a test to catch a regression in schema_to_grammar where
the order of keys in the JSON schema is not preserved in the generated
grammar, which is critical for step-by-step reasoning.

290cf204

17 Dec, 2024 1 commit

llama: Ensure KV cache is fully defragmented. · 08a832b4

Jesse Gross authored Dec 12, 2024

Sometimes the KV cache requires defragmentation even without
triggering the threshold heuristic. In this case, decoding
will not being able to find a KV cache slot. This is particularly
difficult for the caller to handle if it happens in between
ubatches. To avoid this, we should immediately trigger a defrag.

In addition, a heavily fragmented cache can require more than
max_moves to defragment. Currently, we stop when we hit the limit
but this can leave a cache that still does not have adequate space
even after defragmentation is triggered. Instead, we should do
multiple batches of processing until everything is complete.

Fixes #7949

08a832b4

14 Dec, 2024 1 commit
- llama: update vendor code to commit ba1cb19c (#8101) · 7a81daf0
  Jeffrey Morgan authored Dec 14, 2024
  
  7a81daf0
13 Dec, 2024 1 commit
- runner: switch logging back to stderr (#8091) · 60f75560
  Daniel Hiltgen authored Dec 13, 2024
```
This puts the low-level runner logging back on stderr for consistency with prior releases
```
  60f75560
12 Dec, 2024 2 commits
- llama: parse JSON schema using nlohmann::ordered_json to maintain ordering (#8071) · c2168505
  Pascal Patry authored Dec 12, 2024
  
  c2168505
- llama: enable JSON schema key ordering for generating grammars (#8055) · 18f6a98b
  Parth Sareen authored Dec 11, 2024
  
  18f6a98b
11 Dec, 2024 2 commits

llama: preserve field order in user-defined JSON schemas (#8002) · 9039c821

Blake Mizerany authored Dec 11, 2024

Previously we decoded and re-encoded JSON schemas during validation,
which served no purpose since json.RawMessage already validates JSON
syntax. Worse, the re-encoding lost field ordering from the original
schema, which affects inference quality during step-by-step reasoning.

While fixing this ordering issue by using json.RawMessage directly,
testing revealed that schema_to_grammar (from llama.cpp) also fails to
preserve field order during grammar generation. This appears to be the
root cause of inference degradation.

This change prevents us from mangling the user's original schema order,
but we still need to address the ordering issue in schema_to_grammar.
That will be a separate change.

Updates #7978

9039c821

llama: update vendored code to commit 40c6d79f (#7875) · 527cc978
Jeffrey Morgan authored Dec 10, 2024

527cc978

10 Dec, 2024 2 commits

Remove unused runner CpuFeatures (#8032) · b9ccb374

Daniel Hiltgen authored Dec 10, 2024

The final implementation of #7499 removed dynamic vector requirements
in favor of a simpler filename based model, and this was left over logic that
is no longer needed.

b9ccb374

build: Make target improvements (#7499) · 4879a234

Daniel Hiltgen authored Dec 10, 2024

* llama: wire up builtin runner

This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.

* build: Make target improvements

Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.

* Support customized CPU flags for runners

This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash.  If the user builds a customized set, we omit the naming
scheme and don't check for compatibility.  This avoids checking
requirements at runtime, so that logic has been removed as well.  This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.

* Use relative paths

If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.

* Remove payloads from main binary

* install: clean up prior libraries

This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.

4879a234

05 Dec, 2024 1 commit

api: structured outputs - chat endpoint (#7900) · 630e7dc6

Parth Sareen authored Dec 04, 2024



Adds structured outputs to chat endpoint
---------
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Hieu Nguyen <hieunguyen1053@outlook.com>

630e7dc6

03 Dec, 2024 1 commit
- llm: introduce k/v context quantization (vRAM improvements) (#6279) · 1bdab9fd
  Sam authored Dec 04, 2024
  
  1bdab9fd
29 Nov, 2024 1 commit
- llama: fix typo and formatting in readme (#7876) · 39e29ae5
  Jeffrey Morgan authored Nov 28, 2024
  
  39e29ae5
27 Nov, 2024 1 commit
- Support Multiple LoRa Adapters (#7667) · e3936d4f
  ItzCrazyKns authored Nov 28, 2024
```
Closes #7627
```
  e3936d4f
26 Nov, 2024 2 commits

runner.go: Don't try to extract image tags for text models · 71e6a0d0

Jesse Gross authored Nov 20, 2024

When processing a prompt, we look for image tags of the form
[img-0], which are inserted by the Ollama server process.
However, this can cause errors if the original prompt has these
tags - typically an image not found error is returned.

This changes tag searching behavior to be similar to the 0.3.x
series, which will largely avoid these problems. However,they can
still happen when input text with these tags is used with image
models. The correct solution is to escape the tags but this is a
larger issue with special sequences in general so this is an
incremental fix that should avoid the problem for the majority
of cases.

71e6a0d0

runner.go: Add unit tests for context shifting · 2cd11ae3

Jesse Gross authored Nov 25, 2024

This also makes it easier to truncate long inputs the same as
shifting but does not actually implement it. This type of
truncation has a trade off between quality and time to first
token.

2cd11ae3

23 Nov, 2024 1 commit

runner.go: Fix deadlock with many concurrent requests · 3478b2cf

Jesse Gross authored Nov 22, 2024

If there are no avilable slots for new sequences then a request
will not be added to the processing queue but will continue on
to wait for a response that never comes. Besides never giving a
response to the request, this prevents the model from being
unloaded due to the outstanding request.

To prevent this, there are semaphores that prevent more requests
from being processed than there are slots - one in the Ollama
server and one in the runner.
 - The Ollama server one works but it is not designed to protect
the runner's data internal structures and the runner can return a
final response before clearing its data structures.
 - The internal runner semaphore has similar behavior where it
 can release the semaphore when it issues a response. This is
 wrong - it should only release the semaphore after it has
 cleared the data structure.

In addition, we should return an error if a slot is not found
rather than deadlocking in the event we ever get to this spot.

Fixes #7779

3478b2cf

22 Nov, 2024 1 commit

logs: explain client aborts better (#7783) · b85520bf

Daniel Hiltgen authored Nov 22, 2024

Users get confused by "Failed to acquire semaphore" error="context canceled"
messages in the logs, which are actually clients giving up. While there could be
a legitimate hang bug in the system, sometimes this is just short client timeouts
with an overloaded system, so this should help users understand what's going on
better.

b85520bf

21 Nov, 2024 1 commit
- readme: update AMD ROCm links (#7213) · 1a742f54
  boessu authored Nov 21, 2024
  
  1a742f54
20 Nov, 2024 5 commits

runner.go: Truncate inputs that exceed context rather than shifting · c4b34f2a

Jesse Gross authored Nov 20, 2024

Previous versions of the runner would truncate inputs to the context
window before beginning processing. The main processing loop relied
on this behavior if the context needed to be shifted later (due to
token generation). If truncation did not occur then invariants
would be broken, causing crashes or infinite loops.

Later versions attempted to fix these bugs and make the logic less
subtle so that all inputs could be handled. Truncation was removed
to make things consistent.

However, truncation is much faster than processing and shifting, so
removing it caused performance problems when the input vastly exceeded
the context size. This restores the input truncation as a performance
optimization while keeping the more robust processing logic.

Fixes #7762

c4b34f2a

runner.go: Don't add inputs to cache view until actually processed · c3ff9164

Jesse Gross authored Nov 19, 2024

We need to track which tokens are in the cache ourselves. We currently
add tokens to the cache tracker when we add them to batch but they are
not actually in the cache until we call Decode. This can cause
confusion when we are shifting the cache.

Avoids "could not find a KV slot for the batch" issues.

Bug #7545

c3ff9164

runner.go: Hard fail on errors rather than potentially infinite looping · 3fc1dc0e

Jesse Gross authored Nov 19, 2024

We try to recover from errors by dropping the tokens that caused the
problem and re-trying. However, dropping the tokens is not correct
and continuing often leads to infinite loops. To avoid, this we
end the sequence if such a condition is detected, which is also
surprising.

At this point, it is better to just report the error. This will make
it easier to find problems and the alternatives are perhaps even more
surprising to users.

This is not a very satisfactory solution either - we should isolate
the error and return it to the user without killing the whole process.
However, this is an incremental step and consistent with most other
failures (which either manifest as abort() or panic).

3fc1dc0e

runner.go: Retry decoding after defragmentation if needed · 7121dfa3

Jesse Gross authored Nov 19, 2024

Fragmentation of the KV cache can occur due to cache shifting or
different sequences getting processed. Decode uses a heuristic to
decide if it should defrag. However, this heuristic isn't 100%
accurate, so decoding can sometimes fail by surprise.

For these cases, if decode indicates that there is no KV cache space,
we should defrag and then try again.

7121dfa3

runner.go: Use correct index when retrieving embedding results · 5f68fcab

Jesse Gross authored Nov 19, 2024

This doesn't have any impact currently because NUM_PARALLEL is forced
to 1 for embeddings, so both indicies will always be 0.

5f68fcab

19 Nov, 2024 1 commit

fix(runner): Set logits to 0 if false on Batch.Add · 807ace5b

Gabe Goodhart authored Nov 19, 2024

https://github.com/ollama/ollama/issues/7656


Branch: Granite3StoppingBug-7656
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

807ace5b

15 Nov, 2024 2 commits

runner.go: Propagate panics back to the user. · d875e99e

Jesse Gross authored Nov 15, 2024

This is a partial revert of 8a35bb92
"runner.go: Increase survivability of main processing loop", removing
the panic handler.

Although we want to avoid errors taking down the runner, we also
should make the user aware of problems when they happen. In the
future, we can restructure things so both parts are true.

d875e99e

runner.go: Increase survivability of main processing loop · 8a35bb92

Jesse Gross authored Nov 14, 2024

Currently, if an error occurs during the prep stages (such as
tokenizing) of a single request, it will only affect that request.
However, if an error happens during decoding, it can take down the
entire runner.

Instead, it's better to drop the tokens that triggered the error and try to
keep going. However, we also need to stop when we run out of tokens,
otherwise, this just causes an infinite loop. This is likely the cause
of at least some of the hanging issues that have been reported.

Bug #7573

8a35bb92

14 Nov, 2024 3 commits

runner.go: Don't trim whitespace from inputs · c25ffde9

Jesse Gross authored Nov 13, 2024

It's possible to get prompts that consist entirely of whitespace -
this is most likely to happen when generating embeddings. Currently,
we will trim this away, leaving an empty prompt, which will then
generate an error.

Generating embeddings from whitespace should not trigger an error,
as this may break pipelines. It's better to just leave the whitespace
in place and process what we are given. This is consistent with
past versions of Ollama.

Bug #7578

c25ffde9

runner.go: Enforce NUM_PARALLEL directly in the runner · 17b386a8

Jesse Gross authored Nov 12, 2024

NUM_PARALEL is currently enforced by the Ollama server process - it
will only issue requests to the runner if the maximum number of
concurrent requests has not been exceeded. Although this should
be sufficient, it is good for the runner to protect its own data
structures. Currently, if too many requests get through to the
runner, they will just get stuck and never return.

This may help with reports of Ollama hanging, though it is unclear
how it would actually occur.

Bug #7573

17b386a8

fix(mllama): sync backend between batches · 5b3393b6
Michael Yang authored Nov 13, 2024

5b3393b6

12 Nov, 2024 2 commits
- runner.go: Fix off-by-one for num predicted · d7eb05b9
  Jesse Gross authored Nov 12, 2024
  
  d7eb05b9
- Jetpack support for Go server (#7217) · df011054
  Daniel Hiltgen authored Nov 12, 2024
```
This adds support for the Jetson JetPack variants into the Go runner
```
  df011054