Commits · a2cc8571c5b2b8f77a8a5e2f65cb7aaa56482dc4 · OpenDAS / ollama

19 May, 2025 1 commit

llm: Consistently track unassigned model data · a2cc8571

Jesse Gross authored May 13, 2025

In some cases, if we fail to assign a piece of the model to a GPU then
we lose track of this data. Although it doesn't change the memory
allocation, it does affect the total size of the model reported by
tools such as ollama ps (and also the percent offloaded).

This makes it look like setting num_gpu isn't reflected in ollama ps,
which isn't true but the offloading percent may appear to not change.

Spreading the model across more GPUs will continue to impact the
reported total size of the model.

a2cc8571

14 May, 2025 1 commit
- chore: update mllama to use ollama engine (#10637) · 23125648
  Michael Yang authored May 13, 2025
  
  23125648
13 May, 2025 2 commits

Fixed over vram allcation dure to small initial layer sizes. · 0478d440

tej authored May 13, 2025


Co-authored-by: Tej Kiran <kiran.tej@amd.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Tej Kiran <itej89@gmailcom>

0478d440

Revert "remove cuda v11 (#10569)" (#10692) · c6bcdc42

Daniel Hiltgen authored May 13, 2025

Bring back v11 until we can better warn users that their driver
is too old.

This reverts commit fa393554.

c6bcdc42

12 May, 2025 1 commit
- feat: add trace log level (#10650) · f95a1f2b
  Michael Yang authored May 12, 2025
```
reduce prompt log to trace level
```
  f95a1f2b
08 May, 2025 1 commit
- lint: enable usetesting, disable tenv (#10594) · 6e9a7a25
  Michael Yang authored May 08, 2025
  
  6e9a7a25
07 May, 2025 2 commits

sched: fix race leading to orphaned runners (#10599) · 5e380c3b

Daniel Hiltgen authored May 07, 2025

If a model is loading, and the request context is canceled during the load
by a client closing the connection, and another request is inbound for the
same model with a different configuration (context size, etc.) thus requiring
a reload, two unload events can be in flight. The first shuts down the
original model load, but the second one caused the loss of the new
reloading runner reference, thus triggering the leak.

The primary fix is detecting the duplicate unload and ignoring the second
instance. The load routine is also hardened to ensure we detect
clobbering an already present runner and unload it with a warning.

5e380c3b

remove cuda v11 (#10569) · fa393554

Daniel Hiltgen authored May 06, 2025

This reduces the size of our Windows installer payloads by ~256M by dropping
support for nvidia drivers older than Feb 2023. Hardware support is unchanged.

Linux default bundle sizes are reduced by ~600M to 1G.

fa393554

06 May, 2025 1 commit

Move quantization to new backend (#10363) · 42481045

Daniel Hiltgen authored May 06, 2025

* Move quantization logic to GGML via new backend

This moves the model aware logic to Go code and calls GGMLs quantization code for model creation.

* Remove "add model quantizations"

This is no longer needed now that quantization is implemented in Go+GGML code directly.

42481045

05 May, 2025 3 commits

api: remove unused or unsupported api options (#10574) · 3b2d2c83

Jeffrey Morgan authored May 05, 2025

Some options listed in api/types.go are not supported in
newer models, or have been deprecated in the past. This is
the first of a series of PRs to clean up the API options

3b2d2c83

win: lint fix (#10571) · b1c40138
Daniel Hiltgen authored May 05, 2025

b1c40138

Hide empty terminal window (#8668) · 17466217

Ashok Gelal authored May 05, 2025

This hides the LlamaServer blank window when chatting outside of the terminal (say like with an app like Msty). This has no other side effects when invoking it the regular way.

17466217

03 May, 2025 2 commits

win: ensure ollama paths come first (#10549) · 6a74bba7

Daniel Hiltgen authored May 03, 2025

For all search path env vars make sure our dirs are first
to avoid potentially finding other incompatible libraries
on the users system.

Also fixes a minor build script glitch for windows rocm

6a74bba7

sched: logging improvements (#10550) · 76ea735a

Daniel Hiltgen authored May 03, 2025

This enhances our logging in the scheduler. The initial "waiting for server" log
no longer claims an initial error state (now "not responding" which better reflects
the actual state). Runners now have slog wiring to report more details about the
runner, including PID.

76ea735a

30 Apr, 2025 1 commit

Narrow set of paths we load GGML from (#10485) · 718eda1b

Daniel Hiltgen authored Apr 30, 2025

Users may have other incompatible GGML installs on their systems.
This will prevent us from trying to load them from the path.

718eda1b

25 Apr, 2025 1 commit
- explicitly decode maxarraysize 1024 · 340448d2
  Michael Yang authored Apr 25, 2025
  
  340448d2
24 Apr, 2025 1 commit
- server: improve spacing for JSON grammar (#10131) · 11dde418
  Parth Sareen authored Apr 24, 2025
  
  11dde418
03 Apr, 2025 1 commit

llm: set done reason at server level (#9830) · e53b3cbd

Bruce MacDonald authored Apr 03, 2025

No functional change. Many different done reasons can be set at the runner
level, so rather than obsuring them we should return them to the server
process and let it choose what to do with the done reason. This separates
the API concerns from the runner.

e53b3cbd

26 Mar, 2025 2 commits

ggml: Support heterogeneous KV cache layer sizes in memory estimation · f66216e3

Jesse Gross authored Mar 24, 2025

Gemma3 uses sliding windows for its context on 5/6 layers, significantly
reducing memory usage but leading to uneven usage across layers,
which makes allocation to the correct GPU difficult. We currently
estimate very conservatively by assuming all layers are consistent
at the max size.

Llama3.2-vision is also inconsistent between self attention and cross
attention layers - at moment, we calculate the correct total size
and then average this across layers. In some cases, this may lead
to crashes if a large layer is placed on a GPU sized by the average.

This allows memory estimation to calculate per-layer KV cache size
and take this account when placing layers onto GPUs. We already do
this for weights that vary per-tensor, so this is a logical extension.

Fixes #9730
Fixes #9890

f66216e3

llm: Fix debug logging for memory estimates · f4f0992b
Jesse Gross authored Mar 25, 2025

f4f0992b

14 Mar, 2025 1 commit

llm: remove internal subprocess req and resp types (#9324) · 3892c3a7

Bruce MacDonald authored Mar 14, 2025

This commit refactors the LLM subsystem by removing internal subprocess
request and response types. It consolidates duplicate type definitions
across the codebase, moving them to centralized locations. The change also
standardizes interfaces between components, simplifies the ServerStatusResp
struct, and moves the ParseDurationMs function to a common package. This
cleanup reduces code duplication between different runner implementations
(llamarunner and ollamarunner).

3892c3a7

13 Mar, 2025 1 commit
- count gemma3 vision tensors · 033cec23
  Michael Yang authored Mar 12, 2025
  
  033cec23
11 Mar, 2025 1 commit
- llm: auto detect models that require Ollama Engine (#1 ) · ab39e08e
  Daniel Hiltgen authored Mar 11, 2025
  
  ab39e08e
10 Mar, 2025 1 commit
- sample: temporarily use grammars for constrained generation in new engine (#9586) · e093db92
  Jeffrey Morgan authored Mar 10, 2025
  
  e093db92
07 Mar, 2025 1 commit

model: Don't unconditionally add special tokens · b70fc4d5

Jesse Gross authored Mar 05, 2025

We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.

b70fc4d5

04 Mar, 2025 1 commit

New engine: vision models and auto-fallback (#9113) · 1fdb351c

Daniel Hiltgen authored Mar 04, 2025

* Include unified vision layers in memory prediction

For newer vision models with a single gguf, include
the projection estimates.

* Adjust CLI to handle both styles of vision model metadata

* Wire up new tokenizers for new engine

If we're loading the new engine, utilize the new model
text processor instead of calling into cgo wrappers for
llama.cpp.  This also cleans up some tech debt from the
older tokenization flow for the C++ server which was
no longer used.

This also adjusts the grammar handling logic to pass
through to the new engine instead of utilizing the cgo
schema to grammar call.

* Lay foundation for auto selection of new engine

1fdb351c

24 Feb, 2025 1 commit
- config: allow setting context length through env var (#8938) · 314573bf
  Parth Sareen authored Feb 24, 2025
```
* envconfig: allow setting context length through env var
```
  314573bf
14 Feb, 2025 4 commits

llm: attempt to evaluate symlinks, but do not fail (#9089) · 5296f487

Jeffrey Morgan authored Feb 13, 2025

provides a better approach to #9088 that will attempt to
evaluate symlinks (important for macOS where 'ollama' is
often a symlink), but use the result of os.Executable()
as a fallback in scenarios where filepath.EvalSymlinks
fails due to permission erorrs or other issues

5296f487

llm: do not evaluate symlink for exe path lookup (#9088) · f05774b0

Jeffrey Morgan authored Feb 13, 2025

In some cases, the directories in the executable path read by
filepath.EvalSymlinks are not accessible, resulting in permission
errors which results in an error when running models. It also
doesn't work well on long paths on windows, also resulting in
errors. This change removes filepath.EvalSymlinks when accessing
os.Executable() altogether

f05774b0

Runner for Ollama engine · ed443a03

Jesse Gross authored Dec 17, 2024

This provides integration with the new Ollama engine
(58245413 next ollama runner (#7913)) and the rest of the Ollama
infrastructure such as the runner and Ollama server.

In addition, it also builds out the KV cache infrastructure to
support requirements of how Ollama runs models such as:
 - Parallel processing
 - Memory management for defragmentation and shifting
 - Multi-modal modals

Both old and new engines continue to be supported. By default, only
the old engine is used. To enable the new engine:

Start the server with the OLLAMA_NEW_ENGINE environment variable set:
OLLAMA_NEW_ENGINE=1 ./ollama serve

Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M:
./ollama run jessegross/llama3.1

ed443a03

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

04 Feb, 2025 1 commit
- ml/backend/ggml: fix library loading on macOS amd64 (#8827) · 4759ecae
  Jeffrey Morgan authored Feb 04, 2025
  
  4759ecae
03 Feb, 2025 1 commit
- llm: do not error if LibOllamaPath does not exist (#8801) · 50566113
  Jeffrey Morgan authored Feb 03, 2025
  
  50566113
29 Jan, 2025 1 commit

next build (#8539) · dcfb7a10

Michael Yang authored Jan 29, 2025



* add build to .dockerignore

* test: only build one arch

* add build to .gitignore

* fix ccache path

* filter amdgpu targets

* only filter if autodetecting

* Don't clobber gpu list for default runner

This ensures the GPU specific environment variables are set properly

* explicitly set CXX compiler for HIP

* Update build_windows.ps1

This isn't complete, but is close.  Dependencies are missing, and it only builds the "default" preset.

* build: add ollama subdir

* add .git to .dockerignore

* docs: update development.md

* update build_darwin.sh

* remove unused scripts

* llm: add cwd and build/lib/ollama to library paths

* default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS

* add additional cmake output vars for msvc

* interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12

* remove unncessary filepath.Dir, cleanup

* add hardware-specific directory to path

* use absolute server path

* build: linux arm

* cmake install targets

* remove unused files

* ml: visit each library path once

* build: skip cpu variants on arm

* build: install cpu targets

* build: fix workflow

* shorter names

* fix rocblas install

* docs: clean up development.md

* consistent build dir removal in development.md

* silence -Wimplicit-function-declaration build warnings in ggml-cpu

* update readme

* update development readme

* llm: update library lookup logic now that there is one runner (#8587)

* tweak development.md

* update docs

* add windows cuda/rocm tests

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

dcfb7a10

08 Jan, 2025 1 commit
- llama: update vendored code to commit 46e3556 (#8308) · 1deafd82
  Jeffrey Morgan authored Jan 08, 2025
  
  1deafd82
17 Dec, 2024 3 commits

llm: do not error on "null" format (#8139) · 2ddc32d5
Blake Mizerany authored Dec 17, 2024
```
This fixes another regression in the previous commit that fixed other
known bugs.
```
2ddc32d5

llm: do not silently fail for supplied, but invalid formats (#8130) · 87f0a49f

Blake Mizerany authored Dec 16, 2024

Changes in #8002 introduced fixes for bugs with mangling JSON Schemas.
It also fixed a bug where the server would silently fail when clients
requested invalid formats. It also, unfortunately, introduced a bug
where the server would reject requests with an empty format, which
should be allowed.

The change in #8127 updated the code to allow the empty format, but also
reintroduced the regression where the server would silently fail when
the format was set, but invalid.

This commit fixes both regressions. The server does not reject the empty
format, but it does reject invalid formats. It also adds tests to help
us catch regressions in the future.

Also, the updated code provides a more detailed error message when a
client sends a non-empty, but invalid format, echoing the invalid format
in the response.

This commits also takes the opportunity to remove superfluous linter
checks.

87f0a49f

llm: loosen format check to default to no format (#8127) · 0f06a6da
Jeffrey Morgan authored Dec 16, 2024

0f06a6da

11 Dec, 2024 2 commits

llama: preserve field order in user-defined JSON schemas (#8002) · 9039c821

Blake Mizerany authored Dec 11, 2024

Previously we decoded and re-encoded JSON schemas during validation,
which served no purpose since json.RawMessage already validates JSON
syntax. Worse, the re-encoding lost field ordering from the original
schema, which affects inference quality during step-by-step reasoning.

While fixing this ordering issue by using json.RawMessage directly,
testing revealed that schema_to_grammar (from llama.cpp) also fails to
preserve field order during grammar generation. This appears to be the
root cause of inference degradation.

This change prevents us from mangling the user's original schema order,
but we still need to address the ordering issue in schema_to_grammar.
That will be a separate change.

Updates #7978

9039c821

llama: update vendored code to commit 40c6d79f (#7875) · 527cc978
Jeffrey Morgan authored Dec 10, 2024

527cc978