Commits · e1c9a2a00fd555f33dae7f97b7900a9d636166b3 · OpenDAS / ollama

01 Apr, 2024 3 commits

Switch back to subprocessing for llama.cpp · 58d95cc9

Daniel Hiltgen authored Mar 14, 2024

This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.

58d95cc9

update memory calcualtions · 91b3e4d2
Michael Yang authored Mar 18, 2024
```
count each layer independently when deciding gpu offloading
```
91b3e4d2
refactor model parsing · d338d704
Michael Yang authored Mar 13, 2024

d338d704

26 Mar, 2024 1 commit
- change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347) · 1b272d5b
  Patrick Devine authored Mar 26, 2024
  
  1b272d5b
09 Mar, 2024 1 commit
- disable gpu for certain model architectures and fix divide-by-zero on memory estimation · f9cd55c7
  Jeffrey Morgan authored Mar 09, 2024
  
  f9cd55c7
07 Mar, 2024 1 commit

Revamp ROCm support · 6c5ccb11

Daniel Hiltgen authored Feb 15, 2024

This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.

We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.

For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.

6c5ccb11

08 Feb, 2024 1 commit

Ensure the libraries are present · a1dfab43

Daniel Hiltgen authored Feb 07, 2024

When we store our libraries in a temp dir, a reaper might clean
them when we are idle, so make sure to check for them before
we reload.

a1dfab43

23 Jan, 2024 1 commit
- Load all layers on `arm64` macOS if model is small enough (#2149) · 4458efb7
  Jeffrey Morgan authored Jan 22, 2024
  
  4458efb7
18 Jan, 2024 1 commit
- Mechanical switch from log to slog · fedd705a
  Daniel Hiltgen authored Jan 18, 2024
```
A few obvious levels were adjusted, but generally everything mapped to "info" level.
```
  fedd705a
12 Jan, 2024 1 commit
- add max context length check · eaed6f8c
  Michael Yang authored Jan 12, 2024
  
  eaed6f8c
11 Jan, 2024 7 commits

Fix up the CPU fallback selection · 7427fa13

Daniel Hiltgen authored Jan 11, 2024

The memory changes and multi-variant change had some merge
glitches I missed.  This fixes them so we actually get the cpu llm lib
and best variant for the given system.

7427fa13

import fmt · d7af35d3
Michael Yang authored Jan 11, 2024

d7af35d3

Always dynamically load the llm server library · 39928a42

Daniel Hiltgen authored Jan 09, 2024

This switches darwin to dynamic loading, and refactors the code now that no
static linking of the library is used on any platform

39928a42

Build multiple CPU variants and pick the best · d88c527b

Daniel Hiltgen authored Jan 07, 2024

This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available

d88c527b

revisit memory allocation to account for full kv cache on main gpu · ab6be852
Jeffrey Morgan authored Jan 11, 2024

ab6be852

Support multiple variants for a given llm lib type · 8da7bef0

Daniel Hiltgen authored Jan 05, 2024

In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.

This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.

8da7bef0

Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896) · b24e8d17

Jeffrey Morgan authored Jan 10, 2024

* increase minimum cuda overhead and fix minimum overhead for multi-gpu

* fix multi gpu overhead

* limit overhead to 10% of all gpus

* better wording

* allocate fixed amount before layers

* fixed only includes graph alloc

b24e8d17

09 Jan, 2024 4 commits
- typo · f921e269
  Michael Yang authored Jan 09, 2024
  
  f921e269
- use runner if cuda alloc won't fit · f387e963
  Jeffrey Morgan authored Jan 09, 2024
  
  f387e963
- use 10% vram overhead for cuda · cb534e6a
  Jeffrey Morgan authored Jan 08, 2024
  
  cb534e6a
- better estimate scratch buffer size · 58ce2d82
  Jeffrey Morgan authored Jan 08, 2024
  
  58ce2d82
08 Jan, 2024 1 commit

Offload layers to GPU based on new model size estimates (#1850) · 08f1e189

Jeffrey Morgan authored Jan 08, 2024



* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

08f1e189

04 Jan, 2024 4 commits
- Load dynamic cpu lib on windows · e9ce91e9
  Daniel Hiltgen authored Jan 04, 2024
```
On linux, we link the CPU library in to the Go app and fall back to it
when no GPU match is found. On windows we do not link in the CPU library
so that we can better control our dependencies for the CLI.  This fixes
the logic so we correctly fallback to the dynamic CPU library
on windows.
```
  e9ce91e9
- tweak memory requirements error text · c0285158
  Jeffrey Morgan authored Jan 03, 2024
  
  c0285158
- add macOS memory check for 47B models · 77a66df7
  Jeffrey Morgan authored Jan 03, 2024
  
  77a66df7
- remove unused filetype check · 5b4837f8
  Jeffrey Morgan authored Jan 03, 2024
  
  5b4837f8
20 Dec, 2023 1 commit

Revamp the dynamic library shim · 7555ea44

Daniel Hiltgen authored Dec 20, 2023

This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.

7555ea44

19 Dec, 2023 4 commits

Refine handling of shim presence · 3269535a
Daniel Hiltgen authored Dec 15, 2023
```
This allows the CPU only builds to work on systems with Radeon cards
```
3269535a
Adapted rocm support to cgo based llama.cpp · 35934b2e
Daniel Hiltgen authored Nov 29, 2023

35934b2e

Add cgo implementation for llama.cpp · d4cd6957

Daniel Hiltgen authored Nov 13, 2023

Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.

d4cd6957

deprecate ggml · 811b1f03

Bruce MacDonald authored Nov 24, 2023



- remove ggml runner
- automatically pull gguf models when ggml detected
- tell users to update to gguf in the case automatic pull fails
Co-Authored-By: Jeffrey Morgan <jmorganca@gmail.com>

811b1f03

05 Dec, 2023 3 commits
- load projectors · b9495ea1
  Michael Yang authored Nov 30, 2023
  
  b9495ea1
- chat api endpoint (#1392) · 195e3d9d
  Bruce MacDonald authored Dec 05, 2023
  
  195e3d9d
- Revert "chat api (#991)" while context variable is fixed · 00d06619
  Jeffrey Morgan authored Dec 04, 2023
```
This reverts commit 7a0899d6.
```
  00d06619
04 Dec, 2023 1 commit

chat api (#991) · 7a0899d6

Bruce MacDonald authored Dec 04, 2023

- update chat docs
- add messages chat endpoint
- remove deprecated context and template generate parameters from docs
- context and template are still supported for the time being and will continue to work as expected
- add partial response to chat history

7a0899d6

20 Nov, 2023 1 commit
- recent llama.cpp update added kernels for fp32, q5_0, and q5_1 · 19b7a4d7
  Michael Yang authored Nov 20, 2023
  
  19b7a4d7
10 Nov, 2023 1 commit

JSON mode: add `"format" as an api parameter (#1051) · 5cba29b9

Jeffrey Morgan authored Nov 09, 2023



* add `"format": "json"` as an API parameter
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

5cba29b9

02 Nov, 2023 1 commit
- default rope params to 0 for new models (#968) · 2e537046
  Jeffrey Morgan authored Nov 02, 2023
  
  2e537046
19 Oct, 2023 2 commits
- simpler check for model loading compatibility errors · 7ed5a39b
  Jeffrey Morgan authored Oct 19, 2023
  
  7ed5a39b
- add error for `falcon` and `starcoder` vocab compatibility (#844) · a7dad24d
  Jeffrey Morgan authored Oct 19, 2023
```
add error for falcon and starcoder vocab compatibility
---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>
```
  a7dad24d