Commits · 4879a234c4bd3f2bbc99d9b09c44bd99fc337679 · OpenDAS / ollama

10 Dec, 2024 1 commit

build: Make target improvements (#7499) · 4879a234

Daniel Hiltgen authored Dec 10, 2024

* llama: wire up builtin runner

This adds a new entrypoint into the ollama CLI to run the cgo built runner.
On Mac arm64, this will have GPU support, but on all other platforms it will
be the lowest common denominator CPU build.  After we fully transition
to the new Go runners more tech-debt can be removed and we can stop building
the "default" runner via make and rely on the builtin always.

* build: Make target improvements

Add a few new targets and help for building locally.
This also adjusts the runner lookup to favor local builds, then
runners relative to the executable, and finally payloads.

* Support customized CPU flags for runners

This implements a simplified custom CPU flags pattern for the runners.
When built without overrides, the runner name contains the vector flag
we check for (AVX) to ensure we don't try to run on unsupported systems
and crash.  If the user builds a customized set, we omit the naming
scheme and don't check for compatibility.  This avoids checking
requirements at runtime, so that logic has been removed as well.  This
can be used to build GPU runners with no vector flags, or CPU/GPU
runners with additional flags (e.g. AVX512) enabled.

* Use relative paths

If the user checks out the repo in a path that contains spaces, make gets
really confused so use relative paths for everything in-repo to avoid breakage.

* Remove payloads from main binary

* install: clean up prior libraries

This removes support for v0.3.6 and older versions (before the tar bundle)
and ensures we clean up prior libraries before extracting the bundle(s).
Without this change, runners and dependent libraries could leak when we
update and lead to subtle runtime errors.

4879a234

03 Dec, 2024 1 commit
- llm: introduce k/v context quantization (vRAM improvements) (#6279) · 1bdab9fd
  Sam authored Dec 04, 2024
  
  1bdab9fd
12 Nov, 2024 1 commit
- Jetpack support for Go server (#7217) · df011054
  Daniel Hiltgen authored Nov 12, 2024
```
This adds support for the Jetson JetPack variants into the Go runner
```
  df011054
30 Oct, 2024 1 commit

Refine default thread selection for NUMA systems (#7322) · 16f4eabe

Daniel Hiltgen authored Oct 30, 2024

Until we have full NUMA support, this adjusts the default thread selection
algorithm to count up the number of performance cores across all sockets.

16f4eabe

17 Oct, 2024 1 commit
- Rename gpu package discover (#7143) · 05cd82ef
  Daniel Hiltgen authored Oct 16, 2024
```
Cleaning up go package naming
```
  05cd82ef
15 Oct, 2024 1 commit

Discovery CPU details for default thread selection (#6264) · 24636dfa

Daniel Hiltgen authored Oct 15, 2024

On windows, detect large multi-socket systems and reduce to the number of cores
in one socket for best performance

24636dfa

14 Oct, 2024 1 commit
- Track GPU discovery failure information (#5820) · f3c8b898
  Daniel Hiltgen authored Oct 14, 2024
```
* Expose GPU discovery failure information

* Remove exposed API for now
```
  f3c8b898
23 Aug, 2024 1 commit

gpu: Group GPU Library sets by variant (#6483) · 69be940b

Daniel Hiltgen authored Aug 23, 2024

The recent cuda variant changes uncovered a bug in ByLibrary
which failed to group by common variant for GPU types.

69be940b

19 Aug, 2024 3 commits
- Add cuda v12 variant and selection logic · 4fe3a556
  Daniel Hiltgen authored Jun 13, 2024
```
Based on compute capability and driver version, pick
v12 or v11 cuda variants.
```
  4fe3a556
- Report GPU variant in log · fc3b4cda
  Daniel Hiltgen authored Jun 19, 2024
  
  fc3b4cda
- Add Jetson cuda variants for arm · d470ebe7
  Daniel Hiltgen authored May 30, 2024
```
This adds new variants for arm64 specific to Jetson platforms
```
  d470ebe7
11 Jul, 2024 1 commit

llm: avoid loading model if system memory is too small (#5637) · c4cf8ad5

Jeffrey Morgan authored Jul 11, 2024



* llm: avoid loading model if system memory is too small

* update log

* Instrument swap free space

On linux and windows, expose how much swap space is available
so we can take that into consideration when scheduling models

* use `systemSwapFreeMemory` in check

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

c4cf8ad5

09 Jul, 2024 1 commit

Detect CUDA OS Overhead · f6f759fc

Daniel Hiltgen authored Jul 09, 2024

This adds logic to detect skew between the driver and
management library which can be attributed to OS overhead
and records that so we can adjust subsequent management
library free VRAM updates and avoid OOM scenarios.

f6f759fc

21 Jun, 2024 1 commit

Disable concurrency for AMD + Windows · 9929751c

Daniel Hiltgen authored Jun 19, 2024

Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.

9929751c

14 Jun, 2024 5 commits

Workaround gfx900 SDMA bugs · da3bf233

Daniel Hiltgen authored May 31, 2024

Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly

da3bf233

review comments and coverage · 6f351bf5
Daniel Hiltgen authored Jun 05, 2024

6f351bf5
Refactor intel gpu discovery · 4e2b7e18
Daniel Hiltgen authored May 29, 2024

4e2b7e18

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

Refine GPU discovery to bootstrap once · 43ed358f

Daniel Hiltgen authored May 15, 2024

Now that we call the GPU discovery routines many times to
update memory, this splits initial discovery from free memory
updating.

43ed358f

09 May, 2024 1 commit

Record more GPU information · 8727a9c1

Daniel Hiltgen authored May 07, 2024

This cleans up the logging for GPU discovery a bit, and can
serve as a foundation to report GPU information in a future UX.

8727a9c1

23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

10 Apr, 2024 1 commit
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
01 Apr, 2024 1 commit
- update memory calcualtions · 91b3e4d2
  Michael Yang authored Mar 18, 2024
```
count each layer independently when deciding gpu offloading
```
  91b3e4d2
12 Feb, 2024 1 commit

Detect AMD GPU info via sysfs and block old cards · 6d84f075

Daniel Hiltgen authored Feb 11, 2024

This wires up some new logic to start using sysfs to discover AMD GPU
information and detects old cards we can't yet support so we can fallback to CPU mode.

6d84f075

11 Jan, 2024 1 commit

Support multiple variants for a given llm lib type · 8da7bef0

Daniel Hiltgen authored Jan 05, 2024

In some cases we may want multiple variants for a given GPU type or CPU.
This adds logic to have an optional Variant which we can use to select
an optimal library, but also allows us to try multiple variants in case
some fail to load.

This can be useful for scenarios such as ROCm v5 vs v6 incompatibility
or potentially CPU features.

8da7bef0

09 Jan, 2024 1 commit
- calculate overhead based number of gpu devices (#1875) · c336693f
  Jeffrey Morgan authored Jan 09, 2024
  
  c336693f
03 Jan, 2024 1 commit

Fix windows system memory lookup · a2ad9524

Daniel Hiltgen authored Dec 22, 2023

This refines the gpu package error handling and fixes a bug with the
system memory lookup on windows.

a2ad9524

02 Jan, 2024 1 commit

Switch windows build to fully dynamic · d966b730

Daniel Hiltgen authored Dec 23, 2023

Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.

d966b730

20 Dec, 2023 1 commit

Revamp the dynamic library shim · 7555ea44

Daniel Hiltgen authored Dec 20, 2023

This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.

7555ea44

19 Dec, 2023 1 commit
- Adapted rocm support to cgo based llama.cpp · 35934b2e
  Daniel Hiltgen authored Nov 29, 2023
  
  35934b2e