Commits · 6fd04ca922e5da7ef8c52d86118fc58b798a7e4a · OpenDAS / ollama

14 Jun, 2024 3 commits

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

Refine GPU discovery to bootstrap once · 43ed358f

Daniel Hiltgen authored May 15, 2024

Now that we call the GPU discovery routines many times to
update memory, this splits initial discovery from free memory
updating.

43ed358f

Revert "Limit GPU lib search for now (#4777)" · efac4886
Daniel Hiltgen authored Jun 03, 2024
```
This reverts commit 476fb8e8.
```
efac4886

13 Jun, 2024 1 commit
- Actually skip PhysX on windows · aac36763
  Daniel Hiltgen authored Jun 13, 2024
  
  aac36763
04 Jun, 2024 1 commit
- lint linux · bf7edb0d
  Michael Yang authored May 22, 2024
  
  bf7edb0d
02 Jun, 2024 1 commit
- Limit GPU lib search for now (#4777) · 476fb8e8
  Jeffrey Morgan authored Jun 01, 2024
```
* fix oneapi errors on windows 10
```
  476fb8e8
24 May, 2024 2 commits
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
- support ollama run on Intel GPUs · fd5971be
  Wang,Zhe authored May 24, 2024
  
  fd5971be
10 May, 2024 1 commit

Bump VRAM buffer back up · 30a7d709

Daniel Hiltgen authored May 10, 2024

Under stress scenarios we're seeing OOMs so this should help stabilize
the allocations under heavy concurrency stress.

30a7d709

09 May, 2024 1 commit

Record more GPU information · 8727a9c1

Daniel Hiltgen authored May 07, 2024

This cleans up the logging for GPU discovery a bit, and can
serve as a foundation to report GPU information in a future UX.

8727a9c1

07 May, 2024 1 commit
- llm: add minimum based on layer size · 4736391b
  Michael Yang authored May 06, 2024
  
  4736391b
06 May, 2024 1 commit

Use our libraries first · 380378cc

Daniel Hiltgen authored May 05, 2024

Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly

380378cc

05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

03 May, 2024 1 commit
- Skip PhysX cudart library · b1ad3a43
  Daniel Hiltgen authored May 03, 2024
```
For some reason this library gives incorrect GPU information, so skip it
```
  b1ad3a43
01 May, 2024 1 commit

Add CUDA Driver API for GPU discovery · 089daaea

Daniel Hiltgen authored Apr 30, 2024

We're seeing some corner cases with cudart which might be resolved by
switching to the driver API which comes bundled with the driver package

089daaea

23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

10 Apr, 2024 1 commit
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
01 Apr, 2024 3 commits
- Refined min memory from testing · 1f11b525
  Daniel Hiltgen authored Apr 01, 2024
  
  1f11b525
- Release gpu discovery library after use · 526d4eb2
  Daniel Hiltgen authored Mar 30, 2024
```
Leaving the cudart library loaded kept ~30m of memory
pinned in the GPU in the main process.  This change ensures
we don't hold GPU resources when idle.
```
  526d4eb2
- update memory calcualtions · 91b3e4d2
  Michael Yang authored Mar 18, 2024
```
count each layer independently when deciding gpu offloading
```
  91b3e4d2
25 Mar, 2024 1 commit
- add support for libcudart.so for CUDA devices (adds Jetson support) · dfc6721b
  Jeremy authored Mar 25, 2024
  
  dfc6721b
07 Mar, 2024 2 commits

Revamp ROCm support · 6c5ccb11

Daniel Hiltgen authored Feb 15, 2024

This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.

We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.

For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.

6c5ccb11

Allow setting max vram for workarounds · be330174

Daniel Hiltgen authored Mar 06, 2024

Until we get all the memory calculations correct, this can provide
and escape valve for users to workaround out of memory crashes.

be330174

17 Feb, 2024 1 commit
- Harden AMD driver lookup logic · 9754c6d9
  Daniel Hiltgen authored Feb 16, 2024
```
It looks like the version file doesnt exist on older(?) drivers
```
  9754c6d9
12 Feb, 2024 1 commit

Detect AMD GPU info via sysfs and block old cards · 6d84f075

Daniel Hiltgen authored Feb 11, 2024

This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.

6d84f075

28 Jan, 2024 2 commits

Don't disable GPUs on arm without AVX · 15562e88
Daniel Hiltgen authored Jan 28, 2024
```
AVX is an x86 feature, so ARM should be excluded from
the check.
```
15562e88

Harden for zero detected GPUs · f07f8b7a

Daniel Hiltgen authored Jan 28, 2024

At least with the ROCm libraries, its possible to have the library
present with zero GPUs. This fix avoids a divide by zero bug in llm.go
when we try to calculate GPU memory with zero GPUs.

f07f8b7a

26 Jan, 2024 2 commits

Detect lack of AVX and fallback to CPU mode · 667a2ba1

Daniel Hiltgen authored Jan 26, 2024

We build the GPU libraries with AVX enabled to ensure that if not all
layers fit on the GPU we get better performance in a mixed mode.
If the user is using a virtualization/emulation system that lacks AVX
this used to result in an illegal instruction error and crash before this
fix. Now we will report a warning in the server log, and just use
CPU mode to ensure we don't crash.

667a2ba1

Ignore AMD integrated GPUs · 9d7b5d6c
Daniel Hiltgen authored Jan 25, 2024
```
Detect and ignore integrated GPUs reported by rocm.
```
9d7b5d6c

24 Jan, 2024 1 commit

More logging for gpu management · 013fd071

Daniel Hiltgen authored Jan 24, 2024

Fix an ordering glitch of dlerr/dlclose and add more logging to help
root cause some crashes users are hitting. This also refines the
function pointer names to use the underlying function names instead
of simplified names for readability.

013fd071

23 Jan, 2024 1 commit

Report more information about GPUs in verbose mode · 987c16b2

Daniel Hiltgen authored Jan 22, 2024

This adds additional calls to both CUDA and ROCm management libraries to
discover additional attributes about the GPU(s) detected in the system, and
wires up runtime verbosity selection. When users hit problems with GPUs we can
ask them to run with `OLLAMA_DEBUG=1 ollama serve` and share the results.

987c16b2

20 Jan, 2024 3 commits
- Add compute capability 5.0, 7.5, and 8.0 · a447a083
  Daniel Hiltgen authored Jan 20, 2024
  
  a447a083
- increase minimum overhead to 1024MiB (#2114) · f32ea81b
  Jeffrey Morgan authored Jan 20, 2024
  
  f32ea81b
- Add support for CUDA 5.2 cards · 681a9149
  Daniel Hiltgen authored Jan 20, 2024
  
  681a9149
19 Jan, 2024 2 commits

More WSL paths · 552db98b
Daniel Hiltgen authored Jan 19, 2024

552db98b

Fix CPU-only build under Android Termux enviornment. · eb76f3e3

Self Denial authored Jan 15, 2024

Update gpu.go initGPUHandles() to declare gpuHandles variable before
reading it. This resolves an "invalid memory address or nil pointer
dereference" error.

Update dyn_ext_server.c to avoid setting the RTLD_DEEPBIND flag under
__TERMUX__ (Android).

eb76f3e3

18 Jan, 2024 1 commit
- Mechanical switch from log to slog · fedd705a
  Daniel Hiltgen authored Jan 18, 2024
```
A few obvious levels were adjusted, but generally everything mapped to "info" level.
```
  fedd705a
14 Jan, 2024 1 commit
- Let gpu.go and gen_linux.sh also find CUDA on Arch Linux · f4bf1d51
  Alexander F. Rødseth authored Jan 14, 2024
  
  f4bf1d51
11 Jan, 2024 2 commits

Build multiple CPU variants and pick the best · d88c527b

Daniel Hiltgen authored Jan 07, 2024

This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available

d88c527b

Support multiple variants for a given llm lib type · 8da7bef0

Daniel Hiltgen authored Jan 05, 2024

In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.

This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.

8da7bef0