Commits · 5690e5ce995d54d31505ecc092660d74f7445d6f · OpenDAS / ollama

23 Apr, 2024 2 commits

Request and model concurrency · 34b9db5a

Daniel Hiltgen authored Mar 30, 2024

This change adds support for multiple concurrent requests, as well as
loading multiple models by spawning multiple runners. The default
settings are currently set at 1 concurrent request per model and only 1
loaded model at a time, but these can be adjusted by setting
OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.

34b9db5a

Trim spaces and quotes from llm lib override · aa72281e
Daniel Hiltgen authored Apr 22, 2024

aa72281e

21 Apr, 2024 1 commit
- chore: use errors.New to replace fmt.Errorf will much better (#3789) · 62be2050
  Cheng authored Apr 21, 2024
  
  62be2050
17 Apr, 2024 4 commits
- add stablelm graph calculation · 3cf483fe
  Michael Yang authored Apr 17, 2024
  
  3cf483fe
- rearranged conditional logic for static build, dockerfile updated · 8aec92fa
  Jeremy authored Apr 17, 2024
  
  8aec92fa
- account for all non-repeating layers · a8b9b930
  Michael Yang authored Apr 17, 2024
  
  a8b9b930
- move static build to its own flag · 70261b9b
  Jeremy authored Apr 17, 2024
  
  70261b9b
16 Apr, 2024 6 commits
- fix padding to only return padding · e74163af
  Michael Yang authored Apr 15, 2024
  
  e74163af
- scale graph based on gpu count · 26df6747
  Michael Yang authored Apr 16, 2024
  
  26df6747
- Support unicode characters in model path (#3681) · 7c9792a6
  Jeffrey Morgan authored Apr 16, 2024
```
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
```
  7c9792a6
- darwin: no partial offloading if required memory greater than system · 41a272de
  Michael Yang authored Apr 16, 2024
  
  41a272de
- update llama.cpp submodule to `7593639` (#3665) · f3357222
  Jeffrey Morgan authored Apr 15, 2024
  
  f3357222
- fix padding in decode · 969238b1
  Michael Yang authored Apr 15, 2024
```
TODO: update padding() to _only_ returning the padding
```
  969238b1
15 Apr, 2024 2 commits
- Add llama2 / torch models for `ollama create` (#3607) · 9f8691c6
  Patrick Devine authored Apr 15, 2024
  
  9f8691c6
- Terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading (#3653) · a0b8a32e
  Jeffrey Morgan authored Apr 15, 2024
```
* terminate subprocess if receiving `SIGINT` or `SIGTERM` signals while model is loading

* use `unload` in signal handler
```
  a0b8a32e
13 Apr, 2024 1 commit
- update llama.cpp submodule to `4bd0f93` (#3627) · 309aef7f
  Jeffrey Morgan authored Apr 13, 2024
  
  309aef7f
11 Apr, 2024 1 commit
- mixtral mem · 3397eff0
  Michael Yang authored Apr 11, 2024
  
  3397eff0
10 Apr, 2024 2 commits
- partial offloading · 7e33a017
  Michael Yang authored Apr 05, 2024
  
  7e33a017
- refactor tensor query · 8b2c1006
  Michael Yang authored Apr 03, 2024
  
  8b2c1006
09 Apr, 2024 4 commits

Handle very slow model loads · c5ff443b
Daniel Hiltgen authored Apr 09, 2024
```
During testing, we're seeing some models take over 3 minutes.
```
c5ff443b
Revert "build.go: introduce a friendlier way to build Ollama (#3548)" (#3564) · 1524f323
Blake Mizerany authored Apr 09, 2024

1524f323

build.go: introduce a friendlier way to build Ollama (#3548) · fccf3eec

Blake Mizerany authored Apr 09, 2024

This commit introduces a more friendly way to build Ollama dependencies
and the binary without abusing `go generate` and removing the
unnecessary extra steps it brings with it.

This script also provides nicer feedback to the user about what is
happening during the build process.

At the end, it prints a helpful message to the user about what to do
next (e.g. run the new local Ollama).

fccf3eec

update llama.cpp submodule to `1b67731` (#3561) · 5ec12cec
Jeffrey Morgan authored Apr 09, 2024

5ec12cec

08 Apr, 2024 1 commit
- cgo quantize · 9502e566
  Michael Yang authored Apr 05, 2024
  
  9502e566
07 Apr, 2024 1 commit
- update generate scripts with new `LLAMA_CUDA` variable, set `HIP_PLATFORM` to... · 63efa075
  Jeffrey Morgan authored Apr 07, 2024
```
update generate scripts with new `LLAMA_CUDA` variable, set `HIP_PLATFORM` to avoid compiler errors (#3528)
```
  63efa075
06 Apr, 2024 1 commit
- no rope parameters · be517e49
  Michael Yang authored Apr 05, 2024
  
  be517e49
04 Apr, 2024 3 commits
- add command-r graph estimate · 01f77ae2
  Michael Yang authored Apr 04, 2024
  
  01f77ae2
- Fail fast if mingw missing on windows · 36bd9677
  Daniel Hiltgen authored Apr 04, 2024
  
  36bd9677
- fix dll compress in windows building · 4de01267
  mofanke authored Apr 04, 2024
  
  4de01267
03 Apr, 2024 3 commits
- Fix CI release glitches · e4a7e5b2
  Daniel Hiltgen authored Apr 03, 2024
```
The subprocess change moved the build directory
arm64 builds weren't setting cross-compilation flags when building on x86
```
  e4a7e5b2
- update graph size estimate · 12e923e1
  Michael Yang authored Apr 02, 2024
  
  12e923e1
- Fix macOS builds on older SDKs (#3467) · cd135317
  Jeffrey Morgan authored Apr 03, 2024
  
  cd135317
02 Apr, 2024 4 commits
- Revert options as a ref in the server · 6589eb8a
  Daniel Hiltgen authored Apr 02, 2024
  
  6589eb8a
- default head_kv to 1 · 90f071c6
  Michael Yang authored Apr 02, 2024
  
  90f071c6
- fix metal gpu · 80163ebc
  Michael Yang authored Apr 02, 2024
  
  80163ebc
- Bump to b2581 · 0035e31a
  Daniel Hiltgen authored Mar 25, 2024
  
  0035e31a
01 Apr, 2024 4 commits

Apply 01-cache.diff · 0a0e9f3e
Daniel Hiltgen authored Mar 19, 2024

0a0e9f3e

Switch back to subprocessing for llama.cpp · 58d95cc9

Daniel Hiltgen authored Mar 14, 2024

This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.

58d95cc9

update memory calcualtions · 91b3e4d2
Michael Yang authored Mar 18, 2024
```
count each layer independently when deciding gpu offloading
```
91b3e4d2
refactor model parsing · d338d704
Michael Yang authored Mar 13, 2024

d338d704