Commits · 652c273f0ea9006856973b960e6feea85d138d3c · OpenDAS / ollama

19 Aug, 2024 1 commit
- Add Jetson cuda variants for arm · d470ebe7
  Daniel Hiltgen authored May 30, 2024
```
This adds new variants for arm64 specific to Jetson platforms
```
  d470ebe7
02 Aug, 2024 1 commit
- lint · b732beba
  Michael Yang authored Aug 01, 2024
  
  b732beba
11 Jul, 2024 1 commit

llm: avoid loading model if system memory is too small (#5637) · c4cf8ad5

Jeffrey Morgan authored Jul 11, 2024



* llm: avoid loading model if system memory is too small

* update log

* Instrument swap free space

On linux and windows, expose how much swap space is available
so we can take that into consideration when scheduling models

* use `systemSwapFreeMemory` in check

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>

c4cf8ad5

06 Jul, 2024 1 commit
- gpu: report system free memory instead of 0 (#5521) · f8241bfb
  Jeffrey Morgan authored Jul 06, 2024
  
  f8241bfb
14 Jun, 2024 2 commits
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
10 May, 2024 1 commit

Bump VRAM buffer back up · 30a7d709

Daniel Hiltgen authored May 10, 2024

Under stress scenarios we're seeing OOMs so this should help stabilize
the allocations under heavy concurrency stress.

30a7d709

07 May, 2024 1 commit
- llm: add minimum based on layer size · 4736391b
  Michael Yang authored May 06, 2024
  
  4736391b
01 May, 2024 1 commit
- gpu: add 512MiB to darwin minimum, metal doesn't have partial offloading overhead (#4068) · f0c454ab
  Jeffrey Morgan authored May 01, 2024
  
  f0c454ab
23 Apr, 2024 1 commit

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

16 Apr, 2024 2 commits
- scale graph based on gpu count · 26df6747
  Michael Yang authored Apr 16, 2024
  
  26df6747
- darwin: no partial offloading if required memory greater than system · 41a272de
  Michael Yang authored Apr 16, 2024
  
  41a272de
10 Apr, 2024 1 commit
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
07 Mar, 2024 1 commit

Allow setting max vram for workarounds · be330174

Daniel Hiltgen authored Mar 06, 2024

Until we get all the memory calculations correct, this can provide
and escape valve for users to workaround out of memory crashes.

be330174

25 Feb, 2024 1 commit

Determine max VRAM on macOS using `recommendedMaxWorkingSetSize` (#2354) · a189810d

peanut256 authored Feb 26, 2024

* read iogpu.wired_limit_mb on macOS

Fix for https://github.com/ollama/ollama/issues/1826

* improved determination of available vram on macOS

read the recommended maximal vram on macOS via Metal API

* Removed macOS-specific logging

* Remove logging from gpu_darwin.go

* release Core Foundation object

fixes a possible memory leak

a189810d

11 Jan, 2024 3 commits

Fix up the CPU fallback selection · 7427fa13

Daniel Hiltgen authored Jan 11, 2024

The memory changes and multi-variant change had some merge
glitches I missed.  This fixes them so we actually get the cpu llm lib
and best variant for the given system.

7427fa13

Always dynamically load the llm server library · 39928a42

Daniel Hiltgen authored Jan 09, 2024

This switches darwin to dynamic loading, and refactors the code now that no
static linking of the library is used on any platform

39928a42

Build multiple CPU variants and pick the best · d88c527b

Daniel Hiltgen authored Jan 07, 2024

This reduces the built-in linux version to not use any vector extensions
which enables the resulting builds to run under Rosetta on MacOS in
Docker. Then at runtime it checks for the actual CPU vector
extensions and loads the best CPU library available

d88c527b

09 Jan, 2024 1 commit
- calculate overhead based number of gpu devices (#1875) · c336693f
  Jeffrey Morgan authored Jan 09, 2024
  
  c336693f
08 Jan, 2024 1 commit

Offload layers to GPU based on new model size estimates (#1850) · 08f1e189

Jeffrey Morgan authored Jan 08, 2024



* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

08f1e189

03 Jan, 2024 2 commits
- set `num_gpu` to 1 only by default on darwin arm64 (#1771) · c7ea8f23
  Jeffrey Morgan authored Jan 03, 2024
  
  c7ea8f23
- Fix windows system memory lookup · a2ad9524
  Daniel Hiltgen authored Dec 22, 2023
```
This refines the gpu package error handling and fixes a bug with the
system memory lookup on windows.
```
  a2ad9524
02 Jan, 2024 1 commit

Switch windows build to fully dynamic · d966b730

Daniel Hiltgen authored Dec 23, 2023

Refactor where we store build outputs, and support a fully dynamic loading
model on windows so the base executable has no special dependencies thus
doesn't require a special PATH.

d966b730

20 Dec, 2023 1 commit

Revamp the dynamic library shim · 7555ea44

Daniel Hiltgen authored Dec 20, 2023

This switches the default llama.cpp to be CPU based, and builds the GPU variants
as dynamically loaded libraries which we can select at runtime.

This also bumps the ROCm library to version 6 given 5.7 builds don't work
on the latest ROCm library that just shipped.

7555ea44

19 Dec, 2023 3 commits
- Fix darwin intel build · 6558f94e
  Daniel Hiltgen authored Dec 19, 2023
  
  6558f94e
- Adapted rocm support to cgo based llama.cpp · 35934b2e
  Daniel Hiltgen authored Nov 29, 2023
  
  35934b2e
- Add cgo implementation for llama.cpp · d4cd6957
  Daniel Hiltgen authored Nov 13, 2023
```
Run the server.cpp directly inside the Go runtime via cgo
while retaining the LLM Go abstractions.
```
  d4cd6957