Commits · 4cc3be30358efcbd463ec30c30998dacdb0cfb5c · OpenDAS / ollama

24 May, 2024 1 commit
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
23 May, 2024 4 commits
- bump (#4597) · 714adb8b
  Michael Yang authored May 23, 2024
  
  714adb8b
- Wire up load progress · b37b496a
  Daniel Hiltgen authored May 20, 2024
```
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
```
  b37b496a
- Add support for IQ1_S, IQ3_S, IQ2_S, IQ4_XS. IQ4_NL (#4322) · d6f692ad
  Bruce MacDonald authored May 23, 2024
```
Co-authored-by: ManniX-ITA <20623405+mann1x@users.noreply.github.com>
```
  d6f692ad
- Use flash attention flag for now (#4580) · 38255d2a
  Jeffrey Morgan authored May 22, 2024
```
* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests
```
  38255d2a
21 May, 2024 1 commit
- simplify safetensors reading · 171eb040
  Michael Yang authored May 20, 2024
  
  171eb040
20 May, 2024 6 commits
- cleanup · bbbd9f20
  Michael Yang authored May 15, 2024
  
  bbbd9f20
- bpe pretokenizer · 547132e8
  Michael Yang authored May 15, 2024
  
  547132e8
- llama3 conversion · c8cf0d94
  Patrick Devine authored Apr 28, 2024
  
  c8cf0d94
- set llama.cpp submodule commit to `614d3b9` · 5cab1373
  jmorganca authored May 20, 2024
  
  5cab1373
- updated updateURL · 8aadad9c
  Josh Yan authored May 20, 2024
  
  8aadad9c
- feat: add support for flash_attn (#4120) · e15307fd
  Sam authored May 21, 2024
```
* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support
```
  e15307fd
16 May, 2024 1 commit
- update llama.cpp submodule to `614d3b9` (#4414) · 583c1f47
  Jeffrey Morgan authored May 16, 2024
  
  583c1f47
15 May, 2024 3 commits
- Port cuda/rocm skip build vars to linux · c48c1d7c
  Daniel Hiltgen authored May 15, 2024
```
Windows already implements these, carry over to linux.
```
  c48c1d7c
- fix the cpu estimatedTotal memory + get the expiry time for loading models (#4461) · d1692fd3
  Patrick Devine authored May 15, 2024
  
  d1692fd3
- Sanitize the env var debug log · 853ae490
  Daniel Hiltgen authored May 15, 2024
```
Only dump env vars we care about in the logs
```
  853ae490
14 May, 2024 1 commit
- Ollama `ps` command for showing currently loaded models (#4327) · 68459888
  Patrick Devine authored May 13, 2024
  
  68459888
13 May, 2024 2 commits
- typo · 1d359e73
  Michael Yang authored May 13, 2024
  
  1d359e73
- count memory up to NumGPU · 50b9056e
  Michael Yang authored May 10, 2024
  
  50b9056e
11 May, 2024 1 commit
- Revert "only forward some env vars" · 92ca2cca
  jmorganca authored May 10, 2024
```
This reverts commit ce3b212d.
```
  92ca2cca
10 May, 2024 3 commits
- Fall back to CPU runner with zero layers · c4014e73
  Daniel Hiltgen authored May 10, 2024
  
  c4014e73
- add phi2 mem · 1eb382da
  Michael Yang authored May 10, 2024
  
  1eb382da
- Don't clamp ctx size in `PredictServerFit` (#4317) · bb6fd022
  Jeffrey Morgan authored May 10, 2024
```
* dont clamp ctx size in `PredictServerFit`

* minimum 4 context

* remove context warning
```
  bb6fd022
09 May, 2024 5 commits
- fix typo · cf442cd5
  Michael Yang authored May 09, 2024
  
  cf442cd5
- only forward some env vars · ce3b212d
  Michael Yang authored May 09, 2024
  
  ce3b212d
- log clean up · 58876091
  Michael Yang authored May 09, 2024
  
  58876091
- add done_reason to the api (#4235) · cfa84b84
  Bruce MacDonald authored May 09, 2024
  
  cfa84b84
- Refine subprocess reaping · 84ac7ce1
  Daniel Hiltgen authored May 09, 2024
  
  84ac7ce1
08 May, 2024 2 commits
- Record GPU usage information · bee2f4a3
  Daniel Hiltgen authored May 04, 2024
```
This records more GPU usage information for eventual UX inclusion.
```
  bee2f4a3
- skip if same quantization · eeb69526
  Michael Yang authored May 07, 2024
  
  eeb69526
07 May, 2024 2 commits
- Detect noexec and report a better error · 72700279
  Daniel Hiltgen authored May 07, 2024
```
This will bubble up a much more informative error message if noexec
is preventing us from running the subprocess
```
  72700279
- llm: add minimum based on layer size · 4736391b
  Michael Yang authored May 06, 2024
  
  4736391b
06 May, 2024 5 commits
- comments · 01811c17
  Michael Yang authored Apr 23, 2024
  
  01811c17
- quantize any fp16/fp32 model · 9685c345
  Michael Yang authored Apr 12, 2024
```
- FROM /path/to/{safetensors,pytorch}
- FROM /path/to/fp{16,32}.bin
- FROM model:fp{16,32}
```
  9685c345
- Use our libraries first · 380378cc
  Daniel Hiltgen authored May 05, 2024
```
Trying to live off the land for cuda libraries was not the right strategy.  We need to use the version we compiled against to ensure things work properly
```
  380378cc
- Fix `no slots available` error with concurrent requests (#4160) · ed740a25
  Jeffrey Morgan authored May 06, 2024
  
  ed740a25
- Fix llava models not working after first request (#4164) · 1b0e6c9c
  Jeffrey Morgan authored May 05, 2024
```
* fix llava models not working after first request

* individual requests only for llava models
```
  1b0e6c9c
05 May, 2024 1 commit

Centralize server config handling · f56aa200

Daniel Hiltgen authored May 04, 2024

This moves all the env var reading into one central module
and logs the loaded config once at startup which should
help in troubleshooting user server logs

f56aa200

04 May, 2024 1 commit
- omit prompt and generate settings from final response · 44869c59
  Michael Yang authored May 03, 2024
  
  44869c59
01 May, 2024 1 commit
- Removing go routine calling .wait from load. · 321d57e1
  Mark Ward authored May 01, 2024
  
  321d57e1