Commits · 6fd04ca922e5da7ef8c52d86118fc58b798a7e4a · OpenDAS / ollama

14 Jun, 2024 1 commit

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

09 Jun, 2024 1 commit

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape... · b84aea16

Craig Hughes authored Jun 09, 2024

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)

b84aea16

04 Jun, 2024 2 commits
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
- some gocritic · c895a7d1
  Michael Yang authored May 21, 2024
  
  c895a7d1
01 Jun, 2024 1 commit

revert tokenize ffi (#4761) · 829ff87b

Michael Yang authored May 31, 2024

* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65d.

* Revert "vocab only"

This reverts commit bf54c845.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a04.

829ff87b

30 May, 2024 1 commit
- partial offloading: allow flash attention and disable mmap (#4734) · a50a87a7
  Jeffrey Morgan authored May 30, 2024
```
* partial offloading: allow flash attention and disable mmap

* allow mmap with num_gpu=0
```
  a50a87a7
29 May, 2024 1 commit
- use ffi for tokenizing/detokenizing · 26a00a04
  Michael Yang authored May 11, 2024
  
  26a00a04
28 May, 2024 2 commits

Give the final model loading more time · 92c81e81

Daniel Hiltgen authored May 28, 2024

On some systems, 1 minute isn't sufficient to finish the load after it
hits 100% This creates 2 distinct timers, although they're both set to
the same value for now so we can refine the timeouts further.

92c81e81

llm/server.go: Fix 2 minor typos (#4661) · 7487229c
Lei Jitang authored May 28, 2024
```
Signed-off-by: Lei Jitang <leijitang@outlook.com>
```
7487229c

25 May, 2024 1 commit

Report better warning on client closed abort of load · c4209d6d

Daniel Hiltgen authored May 25, 2024

If the client closes the connection before we finish loading the model
we abort, so lets make the log message clearer why to help users
understand this failure mode

c4209d6d

24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
23 May, 2024 2 commits

Wire up load progress · b37b496a

Daniel Hiltgen authored May 20, 2024

This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load

b37b496a

Use flash attention flag for now (#4580) · 38255d2a

Jeffrey Morgan authored May 22, 2024

* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests

38255d2a

20 May, 2024 1 commit

feat: add support for flash_attn (#4120) · e15307fd

Sam authored May 21, 2024

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support

e15307fd

15 May, 2024 2 commits
- fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) · d1692fd3
  Patrick Devine authored May 15, 2024
  
  d1692fd3
- Sanitize the env var debug log · 853ae490
  Daniel Hiltgen authored May 15, 2024
```
Only dump env vars we care about in the logs
```
  853ae490
14 May, 2024 1 commit
- Ollama `ps` command for showing currently loaded models (#4327) · 68459888
  Patrick Devine authored May 13, 2024
  
  68459888
11 May, 2024 1 commit
- Revert "only forward some env vars" · 92ca2cca
  jmorganca authored May 10, 2024
```
This reverts commit ce3b212d.
```
  92ca2cca
10 May, 2024 2 commits
- Fall back to CPU runner with zero layers · c4014e73
  Daniel Hiltgen authored May 10, 2024
  
  c4014e73
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
09 May, 2024 5 commits
- fix typo · cf442cd5
  Michael Yang authored May 09, 2024
  
  cf442cd5
- only forward some env vars · ce3b212d
  Michael Yang authored May 09, 2024
  
  ce3b212d
- log clean up · 58876091
  Michael Yang authored May 09, 2024
  
  58876091
- add done_reason to the api (#4235) · cfa84b84
  Bruce MacDonald authored May 09, 2024
  
  cfa84b84
- Refine subprocess reaping · 84ac7ce1
  Daniel Hiltgen authored May 09, 2024
  
  84ac7ce1
08 May, 2024 1 commit
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
07 May, 2024 1 commit

Detect noexec and report a better error · 72700279

Daniel Hiltgen authored May 07, 2024

This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess

72700279

06 May, 2024 3 commits
- Use our libraries first · 380378cc
  Daniel Hiltgen authored May 05, 2024
```
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
```
  380378cc
- Fix `no slots available` error with concurrent requests (#4160) · ed740a25
  Jeffrey Morgan authored May 06, 2024
  
  ed740a25
- Fix llava models not working after first request (#4164) · 1b0e6c9c
  Jeffrey Morgan authored May 05, 2024
```
* fix llava models not working after first request

* individual requests only for llava models
```
  1b0e6c9c
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

01 May, 2024 4 commits
- Removing go routine calling .wait from load. · 321d57e1
  Mark Ward authored May 01, 2024
  
  321d57e1
- it will always return an error due to Kill() discarding Wait() errors · ba26c7aa
  Mark Ward authored Apr 29, 2024
  
  ba26c7aa
- log when the waiting for the process to stop to help debug when other tasks... · 63c76368
  Mark Ward authored Apr 29, 2024
```
log when the waiting for the process to stop to help debug when other tasks execute during this wait.
expire timer clear the timer reference because it will not be reused.
close will clean up expireTimer if calling code has not already done this.
```
  63c76368
- fix sched to wait for the runner to terminate to ensure following vram check will be more accurate · 948114e3
  Mark Ward authored Apr 28, 2024
  
  948114e3
29 Apr, 2024 1 commit
- llm: dont cap context window limit to training context window (#3988) · 7aa08a77
  Jeffrey Morgan authored Apr 29, 2024
  
  7aa08a77
26 Apr, 2024 1 commit
- return code `499` when user cancels request while a model is loading (#3955) · bb31def0
  Jeffrey Morgan authored Apr 26, 2024
  
  bb31def0
25 Apr, 2024 1 commit
- llm: limit generation to 10x context size to avoid run on generations (#3918) · 993cf8bf
  Jeffrey Morgan authored Apr 25, 2024
```
* llm: limit generation to 10x context size to avoid run on generations

* add comment

* simplify condition statement
```
  993cf8bf
23 Apr, 2024 2 commits

Detect and recover if runner removed · 58888a74

Daniel Hiltgen authored Apr 23, 2024

Tmp cleaners can nuke the file out from underneath us.  This detects the missing
runner, and re-initializes the payloads.

58888a74

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a