- 14 Jun, 2024 5 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Daniel Hiltgen authored
-
- 11 Jun, 2024 2 commits
-
-
Michael Yang authored
This reverts commit f5f245cc, reversing changes made to 94d37fdc. this change broke gguf v2 which is incorrectly detected as big endian
-
Jeffrey Morgan authored
-
- 09 Jun, 2024 2 commits
-
-
Craig Hughes authored
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)
-
Jeffrey Morgan authored
* fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by:Jesper Ek <deadbeef84@gmail.com>
-
- 08 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 07 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This follows the same pattern for cuda and rocm to allow disabling the build even when we detect the dependent libraries
-
Jeffrey Morgan authored
-
- 06 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 04 Jun, 2024 4 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 01 Jun, 2024 1 commit
-
-
Michael Yang authored
* Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit 763bb65d. * Revert "vocab only" This reverts commit bf54c845. * Revert "use ffi for tokenizing/detokenizing" This reverts commit 26a00a04.
-
- 31 May, 2024 2 commits
-
-
Jeffrey Morgan authored
* use `int32_t` for call to tokenize * variable naming * cleanup * fix crash
-
Jeffrey Morgan authored
-
- 30 May, 2024 3 commits
-
-
Jeffrey Morgan authored
* partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0
-
Michael Yang authored
-
Jeffrey Morgan authored
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603` * add patch
-
- 29 May, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 28 May, 2024 2 commits
-
-
Daniel Hiltgen authored
On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
- 25 May, 2024 1 commit
-
-
Daniel Hiltgen authored
If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode
-
- 24 May, 2024 4 commits
-
-
Michael Yang authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
Michael Yang authored
-
Patrick Devine authored
-
Wang,Zhe authored
-
- 23 May, 2024 4 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load
-
Bruce MacDonald authored
Co-authored-by:ManniX-ITA <20623405+mann1x@users.noreply.github.com>
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 21 May, 2024 1 commit
-
-
Michael Yang authored
-
- 20 May, 2024 1 commit
-
-
Michael Yang authored
-