Commits · f4bf1d514f537af9166f72fa00feda04556fc3d5 · OpenDAS / ollama

14 Jan, 2024 1 commit
- Let gpu.go and gen_linux.sh also find CUDA on Arch Linux · f4bf1d51
  Alexander F. Rødseth authored Jan 14, 2024
  
  f4bf1d51
11 Jan, 2024 3 commits

Build multiple CPU variants and pick the best · d88c527b

Daniel Hiltgen authored Jan 07, 2024

This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available

d88c527b

Support multiple variants for a given llm lib type · 8da7bef0

Daniel Hiltgen authored Jan 05, 2024

In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.

This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.

8da7bef0

Increase minimum CUDA memory allocation overhead and fix minimum overhead for multi-gpu (#1896) · b24e8d17

Jeffrey Morgan authored Jan 10, 2024

* increase minimum cuda overhead and fix minimum overhead for multi-gpu

* fix multi gpu overhead

* limit overhead to 10% of all gpus

* better wording

* allocate fixed amount before layers

* fixed only includes graph alloc

b24e8d17

10 Jan, 2024 1 commit

Harden GPU mgmt library lookup · 3c49c3ab

Daniel Hiltgen authored Jan 10, 2024

When there are multiple management libraries installed on a system
not every one will be compatible with the current driver. This change
improves our management library algorithm to build up a set of discovered
libraries based on glob patterns, and then try all of them until we're able to
load one without error.

3c49c3ab

09 Jan, 2024 10 commits
- calculate overhead based number of gpu devices (#1875) · c336693f
  Jeffrey Morgan authored Jan 09, 2024
  
  c336693f
- Set corret CUDA minimum compute capability version · 1961a81f
  Daniel Hiltgen authored Jan 09, 2024
```
If you attempt to run the current CUDA build on compute capability 5.2
cards, you'll hit the following failure:
cuBLAS error 15 at ggml-cuda.cu:7956: the requested functionality is not supported
```
  1961a81f
- update rough cuda overhead estimate to 15% + 384MiB · 6df83e6d
  Jeffrey Morgan authored Jan 09, 2024
  
  6df83e6d
- revert cuda overhead to 20% · 6164f378
  Jeffrey Morgan authored Jan 09, 2024
  
  6164f378
- add `TODO` for cuda overhead · 6566387a
  Jeffrey Morgan authored Jan 09, 2024
  
  6566387a
- update cuda overhead to 20% to fix crashes when switching between models and large context sizes · 37708931
  Jeffrey Morgan authored Jan 09, 2024
  
  37708931
- update cuda overhead to 15% or 400MiB · f6cb0a55
  Jeffrey Morgan authored Jan 08, 2024
  
  f6cb0a55
- fix build on linux · 2680078c
  Jeffrey Morgan authored Jan 08, 2024
  
  2680078c
- update overhead to 15% · f1b7e5f5
  Jeffrey Morgan authored Jan 08, 2024
  
  f1b7e5f5
- use 10% vram overhead for cuda · cb534e6a
  Jeffrey Morgan authored Jan 08, 2024
  
  cb534e6a
08 Jan, 2024 1 commit

Offload layers to GPU based on new model size estimates (#1850) · 08f1e189

Jeffrey Morgan authored Jan 08, 2024



* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

08f1e189

07 Jan, 2024 1 commit

Detect very old CUDA GPUs and fall back to CPU · d74ce6bd

Daniel Hiltgen authored Jan 06, 2024

If we try to load the CUDA library on an old GPU, it panics and crashes
the server. This checks the compute capability before we load the
library so we can gracefully fall back to CPU mode.

d74ce6bd

03 Jan, 2024 1 commit

Fix windows system memory lookup · a2ad9524

Daniel Hiltgen authored Dec 22, 2023

This refines the gpu package error handling and fixes a bug with the
system memory lookup on windows.

a2ad9524

02 Jan, 2024 1 commit

Switch windows build to fully dynamic · d966b730

Daniel Hiltgen authored Dec 23, 2023

Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.

d966b730

20 Dec, 2023 1 commit

Revamp the dynamic library shim · 7555ea44

Daniel Hiltgen authored Dec 20, 2023

This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.

7555ea44

19 Dec, 2023 2 commits
- Refine build to support CPU only · 1b991d0b
  Daniel Hiltgen authored Dec 13, 2023
```
If someone checks out the ollama repo and doesn't install the CUDA
library, this will ensure they can build a CPU only version
```
  1b991d0b
- Adapted rocm support to cgo based llama.cpp · 35934b2e
  Daniel Hiltgen authored Nov 29, 2023
  
  35934b2e