Commits · 34b9db5afc43b352c5ef04fe6ef52684bfdd57b5 · OpenDAS / ollama

23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

16 Apr, 2024 2 commits
- scale graph based on gpu count · 26df6747
  Michael Yang authored Apr 16, 2024
  
  26df6747
- darwin: no partial offloading if required memory greater than system · 41a272de
  Michael Yang authored Apr 16, 2024
  
  41a272de
10 Apr, 2024 1 commit
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
01 Apr, 2024 6 commits

Refined min memory from testing · 1f11b525
Daniel Hiltgen authored Apr 01, 2024

1f11b525

Release gpu discovery library after use · 526d4eb2

Daniel Hiltgen authored Mar 30, 2024

Leaving the cudart library loaded kept ~30m of memory
pinned in the GPU in the main process.  This change ensures
we don't hold GPU resources when idle.

526d4eb2

Safeguard for noexec · 0a74cb31

Daniel Hiltgen authored Mar 28, 2024

We may have users that run into problems with our current
payload model, so this gives us an escape valve.

0a74cb31

Detect too-old cuda driver · 10ed1b62
Daniel Hiltgen authored Mar 28, 2024
```
"cudart init failure: 35" isn't particularly helpful in the logs.
```
10ed1b62

Switch back to subprocessing for llama.cpp · 58d95cc9

Daniel Hiltgen authored Mar 14, 2024

This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.

58d95cc9

update memory calcualtions · 91b3e4d2
Michael Yang authored Mar 18, 2024
```
count each layer independently when deciding gpu offloading
```
91b3e4d2

28 Mar, 2024 1 commit
- Update troubleshooting link · f31f2bed
  Michael Yang authored Mar 28, 2024
  
  f31f2bed
25 Mar, 2024 1 commit
- add support for libcudart.so for CUDA devices (adds Jetson support) · dfc6721b
  Jeremy authored Mar 25, 2024
  
  dfc6721b
20 Mar, 2024 1 commit

Better tmpdir cleanup · 74788b48

Daniel Hiltgen authored Mar 13, 2024

If expanding the runners fails, don't leave a corrupt/incomplete payloads dir
We now write a pid file out to the tmpdir, which allows us to scan for stale tmpdirs
and remove this as long as there isn't still a process running.

74788b48

12 Mar, 2024 2 commits

Fix iGPU detection for linux · 82b0c7c2

Daniel Hiltgen authored Mar 12, 2024

This fixes a few bugs in the new sysfs discovery logic. iGPUs are now
correctly identified by their <1G VRAM reported. the sysfs IDs are off
by one compared to what HIP wants due to the CPU being reported
in amdgpu, but HIP only cares about GPUs.

82b0c7c2

fix gpu_info_cuda.c compile warning (#3077) · 51578d85
mofanke authored Mar 13, 2024

51578d85

11 Mar, 2024 1 commit

Avoid rocm runner and dependency clash · bc13da2b

Daniel Hiltgen authored Mar 11, 2024

Putting the rocm symlink next to the runners is risky.  This moves
the payloads into a subdir to avoid potential clashes.

bc13da2b

10 Mar, 2024 1 commit

Add ollama executable peer dir for rocm · 00ec2693

Daniel Hiltgen authored Mar 10, 2024

This allows people who package up ollama on their own to place
the rocm dependencies in a peer directory to the ollama executable
much like our windows install flow.

00ec2693

09 Mar, 2024 2 commits

tidy cleanup logs · 0bd0f4a2
Jeffrey Morgan authored Mar 09, 2024

0bd0f4a2

Finish unwinding idempotent payload logic · 4a5c9b80

Daniel Hiltgen authored Mar 08, 2024

The recent ROCm change partially removed idempotent
payloads, but the ggml-metal.metal file for mac was still
idempotent.  This finishes switching to always extract
the payloads, and now that idempotentcy is gone, the
version directory is no longer useful.

4a5c9b80

07 Mar, 2024 2 commits

Revamp ROCm support · 6c5ccb11

Daniel Hiltgen authored Feb 15, 2024

This refines where we extract the LLM libraries to by adding a new
OLLAMA_HOME env var, that defaults to `~/.ollama` The logic was already
idempotenent, so this should speed up startups after the first time a
new release is deployed. It also cleans up after itself.

We now build only a single ROCm version (latest major) on both windows
and linux. Given the large size of ROCms tensor files, we split the
dependency out. It's bundled into the installer on windows, and a
separate download on windows. The linux install script is now smart and
detects the presence of AMD GPUs and looks to see if rocm v6 is already
present, and if not, then downloads our dependency tar file.

For Linux discovery, we now use sysfs and check each GPU against what
ROCm supports so we can degrade to CPU gracefully instead of having
llama.cpp+rocm assert/crash on us. For Windows, we now use go's windows
dynamic library loading logic to access the amdhip64.dll APIs to query
the GPU information.

6c5ccb11

Allow setting max vram for workarounds · be330174

Daniel Hiltgen authored Mar 06, 2024

Until we get all the memory calculations correct, this can provide
and escape valve for users to workaround out of memory crashes.

be330174

29 Feb, 2024 1 commit
- fix: print usedMemory size right (#2827) · fa2f2b35
  tylinux authored Mar 01, 2024
  
  fa2f2b35
25 Feb, 2024 1 commit

Determine max VRAM on macOS using `recommendedMaxWorkingSetSize` (#2354) · a189810d

peanut256 authored Feb 26, 2024

* read iogpu.wired_limit_mb on macOS

Fix for https://github.com/ollama/ollama/issues/1826

* improved determination of available vram on macOS

read the recommended maximal vram on macOS via Metal API

* Removed macOS-specific logging

* Remove logging from gpu_darwin.go

* release Core Foundation object

fixes a possible memory leak

a189810d

17 Feb, 2024 1 commit
- Harden AMD driver lookup logic · 9754c6d9
  Daniel Hiltgen authored Feb 16, 2024
```
It looks like the version file doesnt exist on older(?) drivers
```
  9754c6d9
12 Feb, 2024 1 commit

Detect AMD GPU info via sysfs and block old cards · 6d84f075

Daniel Hiltgen authored Feb 11, 2024

This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.

6d84f075

28 Jan, 2024 2 commits

Don't disable GPUs on arm without AVX · 15562e88
Daniel Hiltgen authored Jan 28, 2024
```
AVX is an x86 feature, so ARM should be excluded from
the check.
```
15562e88

Harden for zero detected GPUs · f07f8b7a

Daniel Hiltgen authored Jan 28, 2024

At least with the ROCm libraries, its possible to have the library
present with zero GPUs. This fix avoids a divide by zero bug in llm.go
when we try to calculate GPU memory with zero GPUs.

f07f8b7a

27 Jan, 2024 1 commit
- Update gpu_info_rocm.c · 59d87127
  Jagadish Krishnamoorthy authored Jan 26, 2024
  
  59d87127
26 Jan, 2024 3 commits

Detect lack of AVX and fallback to CPU mode · 667a2ba1

Daniel Hiltgen authored Jan 26, 2024

We build the GPU libraries with AVX enabled to ensure that if not all
layers fit on the GPU we get better performance in a mixed mode.
If the user is using a virtualization/emulation system that lacks AVX
this used to result in an illegal instruction error and crash before this
fix. Now we will report a warning in the server log, and just use
CPU mode to ensure we don't crash.

667a2ba1

Ignore AMD integrated GPUs · 9d7b5d6c
Daniel Hiltgen authored Jan 25, 2024
```
Detect and ignore integrated GPUs reported by rocm.
```
9d7b5d6c
Fix crash on cuda ml init failure · 5d9c4a5f
Daniel Hiltgen authored Jan 26, 2024
```
The new driver lookup code was triggering after init failure due to a missing return
```
5d9c4a5f

24 Jan, 2024 1 commit

More logging for gpu management · 013fd071

Daniel Hiltgen authored Jan 24, 2024

Fix an ordering glitch of dlerr/dlclose and add more logging to help
root cause some crashes users are hitting. This also refines the
function pointer names to use the underlying function names instead
of simplified names for readability.

013fd071

23 Jan, 2024 1 commit

Report more information about GPUs in verbose mode · 987c16b2

Daniel Hiltgen authored Jan 22, 2024

This adds additional calls to both CUDA and ROCm management libraries to
discover additional attributes about the GPU(s) detected in the system, and
wires up runtime verbosity selection. When users hit problems with GPUs we can
ask them to run with `OLLAMA_DEBUG=1 ollama serve` and share the results.

987c16b2

20 Jan, 2024 3 commits
- Add compute capability 5.0, 7.5, and 8.0 · a447a083
  Daniel Hiltgen authored Jan 20, 2024
  
  a447a083
- increase minimum overhead to 1024MiB (#2114) · f32ea81b
  Jeffrey Morgan authored Jan 20, 2024
  
  f32ea81b
- Add support for CUDA 5.2 cards · 681a9149
  Daniel Hiltgen authored Jan 20, 2024
  
  681a9149
19 Jan, 2024 2 commits

More WSL paths · 552db98b
Daniel Hiltgen authored Jan 19, 2024

552db98b

Fix CPU-only build under Android Termux enviornment. · eb76f3e3

Self Denial authored Jan 15, 2024

Update gpu.go initGPUHandles() to declare gpuHandles variable before
reading it. This resolves an "invalid memory address or nil pointer
dereference" error.

Update dyn_ext_server.c to avoid setting the RTLD_DEEPBIND flag under
__TERMUX__ (Android).

eb76f3e3

18 Jan, 2024 1 commit
- Mechanical switch from log to slog · fedd705a
  Daniel Hiltgen authored Jan 18, 2024
```
A few obvious levels were adjusted, but generally everything mapped to "info" level.
```
  fedd705a
14 Jan, 2024 1 commit
- Let gpu.go and gen_linux.sh also find CUDA on Arch Linux · f4bf1d51
  Alexander F. Rødseth authored Jan 14, 2024
  
  f4bf1d51