Commits · 17b7186cd759337fa98b626e82de150f3789b040 · OpenDAS / ollama

21 Jun, 2024 1 commit

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

20 Jun, 2024 1 commit

Refine mmap default logic on linux · 5bf5aeec

Daniel Hiltgen authored Jun 20, 2024

If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.

5bf5aeec

18 Jun, 2024 1 commit

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

17 Jun, 2024 2 commits

Adjust mmap logic for cuda windows for faster model load · 17179679

Daniel Hiltgen authored Jun 17, 2024

On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.

17179679

Move libraries out of users path · b2799f11

Daniel Hiltgen authored Jun 15, 2024

We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.

b2799f11

14 Jun, 2024 4 commits

Workaround gfx900 SDMA bugs · da3bf233

Daniel Hiltgen authored May 31, 2024

Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly

da3bf233

review comments and coverage · 6f351bf5
Daniel Hiltgen authored Jun 05, 2024

6f351bf5
Refine CPU load behavior with system memory visibility · fc37c192
Daniel Hiltgen authored Jun 03, 2024

fc37c192

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

09 Jun, 2024 1 commit

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape... · b84aea16

Craig Hughes authored Jun 09, 2024

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)

b84aea16

04 Jun, 2024 2 commits
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
- some gocritic · c895a7d1
  Michael Yang authored May 21, 2024
  
  c895a7d1
01 Jun, 2024 1 commit

revert tokenize ffi (#4761) · 829ff87b

Michael Yang authored May 31, 2024

* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65d.

* Revert "vocab only"

This reverts commit bf54c845.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a04.

829ff87b

30 May, 2024 1 commit
- partial offloading: allow flash attention and disable mmap (#4734) · a50a87a7
  Jeffrey Morgan authored May 30, 2024
```
* partial offloading: allow flash attention and disable mmap

* allow mmap with num_gpu=0
```
  a50a87a7
29 May, 2024 1 commit
- use ffi for tokenizing/detokenizing · 26a00a04
  Michael Yang authored May 11, 2024
  
  26a00a04
28 May, 2024 2 commits

Give the final model loading more time · 92c81e81

Daniel Hiltgen authored May 28, 2024

On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.

92c81e81

llm/server.go: Fix 2 minor typos (#4661) · 7487229c
Lei Jitang authored May 28, 2024
```
Signed-off-by: Lei Jitang <leijitang@outlook.com>
```
7487229c

25 May, 2024 1 commit

Report better warning on client closed abort of load · c4209d6d

Daniel Hiltgen authored May 25, 2024

If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode

c4209d6d

24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
23 May, 2024 2 commits

Wire up load progress · b37b496a

Daniel Hiltgen authored May 20, 2024

This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load

b37b496a

Use flash attention flag for now (#4580) · 38255d2a

Jeffrey Morgan authored May 22, 2024

* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests

38255d2a

20 May, 2024 1 commit

feat: add support for flash_attn (#4120) · e15307fd

Sam authored May 21, 2024

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support

e15307fd

15 May, 2024 2 commits
- fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) · d1692fd3
  Patrick Devine authored May 15, 2024
  
  d1692fd3
- Sanitize the env var debug log · 853ae490
  Daniel Hiltgen authored May 15, 2024
```
Only dump env vars we care about in the logs
```
  853ae490
14 May, 2024 1 commit
- Ollama `ps` command for showing currently loaded models (#4327) · 68459888
  Patrick Devine authored May 13, 2024
  
  68459888
11 May, 2024 1 commit
- Revert "only forward some env vars" · 92ca2cca
  jmorganca authored May 10, 2024
```
This reverts commit ce3b212d.
```
  92ca2cca
10 May, 2024 2 commits
- Fall back to CPU runner with zero layers · c4014e73
  Daniel Hiltgen authored May 10, 2024
  
  c4014e73
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
09 May, 2024 5 commits
- fix typo · cf442cd5
  Michael Yang authored May 09, 2024
  
  cf442cd5
- only forward some env vars · ce3b212d
  Michael Yang authored May 09, 2024
  
  ce3b212d
- log clean up · 58876091
  Michael Yang authored May 09, 2024
  
  58876091
- add done_reason to the api (#4235) · cfa84b84
  Bruce MacDonald authored May 09, 2024
  
  cfa84b84
- Refine subprocess reaping · 84ac7ce1
  Daniel Hiltgen authored May 09, 2024
  
  84ac7ce1
08 May, 2024 1 commit
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
07 May, 2024 1 commit

Detect noexec and report a better error · 72700279

Daniel Hiltgen authored May 07, 2024

This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess

72700279

06 May, 2024 3 commits
- Use our libraries first · 380378cc
  Daniel Hiltgen authored May 05, 2024
```
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
```
  380378cc
- Fix `no slots available` error with concurrent requests (#4160) · ed740a25
  Jeffrey Morgan authored May 06, 2024
  
  ed740a25
- Fix llava models not working after first request (#4164) · 1b0e6c9c
  Jeffrey Morgan authored May 05, 2024
```
* fix llava models not working after first request

* individual requests only for llava models
```
  1b0e6c9c
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

01 May, 2024 1 commit
- Removing go routine calling .wait from load. · 321d57e1
  Mark Ward authored May 01, 2024
  
  321d57e1