- 04 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 01 Jun, 2024 1 commit
-
-
Michael Yang authored
* Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit 763bb65d. * Revert "vocab only" This reverts commit bf54c845. * Revert "use ffi for tokenizing/detokenizing" This reverts commit 26a00a04.
-
- 31 May, 2024 2 commits
-
-
Jeffrey Morgan authored
* use `int32_t` for call to tokenize * variable naming * cleanup * fix crash
-
Jeffrey Morgan authored
-
- 30 May, 2024 3 commits
-
-
Jeffrey Morgan authored
* partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0
-
Michael Yang authored
-
Jeffrey Morgan authored
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603` * add patch
-
- 29 May, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 28 May, 2024 2 commits
-
-
Daniel Hiltgen authored
On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
- 25 May, 2024 1 commit
-
-
Daniel Hiltgen authored
If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode
-
- 24 May, 2024 4 commits
-
-
Michael Yang authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
Michael Yang authored
-
Patrick Devine authored
-
Wang,Zhe authored
-
- 23 May, 2024 4 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load
-
Bruce MacDonald authored
Co-authored-by:ManniX-ITA <20623405+mann1x@users.noreply.github.com>
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 21 May, 2024 1 commit
-
-
Michael Yang authored
-
- 20 May, 2024 6 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
jmorganca authored
-
Josh Yan authored
-
Sam authored
* feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support
-
- 16 May, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 15 May, 2024 3 commits
-
-
Daniel Hiltgen authored
Windows already implements these, carry over to linux.
-
Patrick Devine authored
-
Daniel Hiltgen authored
Only dump env vars we care about in the logs
-
- 14 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 13 May, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 11 May, 2024 1 commit
-
- 10 May, 2024 3 commits
-
-
Daniel Hiltgen authored
-
Michael Yang authored
-
Jeffrey Morgan authored
* dont clamp ctx size in `PredictServerFit` * minimum 4 context * remove context warning
-