"sgl-kernel/vscode:/vscode.git/clone" did not exist on "aa3eba8eb42cb4a49bd1475e78198734b4f5ada4"
- 11 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 09 Jun, 2024 2 commits
-
-
Craig Hughes authored
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)
-
Jeffrey Morgan authored
* fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by:Jesper Ek <deadbeef84@gmail.com>
-
- 08 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 07 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This follows the same pattern for cuda and rocm to allow disabling the build even when we detect the dependent libraries
-
Jeffrey Morgan authored
-
- 06 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 04 Jun, 2024 4 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 01 Jun, 2024 1 commit
-
-
Michael Yang authored
* Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit 763bb65d. * Revert "vocab only" This reverts commit bf54c845. * Revert "use ffi for tokenizing/detokenizing" This reverts commit 26a00a04.
-
- 31 May, 2024 2 commits
-
-
Jeffrey Morgan authored
* use `int32_t` for call to tokenize * variable naming * cleanup * fix crash
-
Jeffrey Morgan authored
-
- 30 May, 2024 3 commits
-
-
Jeffrey Morgan authored
* partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0
-
Michael Yang authored
-
Jeffrey Morgan authored
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603` * add patch
-
- 29 May, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 28 May, 2024 2 commits
-
-
Daniel Hiltgen authored
On some systems, 1 minute isn't sufficient to finish the load after it hits 100% This creates 2 distinct timers, although they're both set to the same value for now so we can refine the timeouts further.
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
- 25 May, 2024 1 commit
-
-
Daniel Hiltgen authored
If the client closes the connection before we finish loading the model we abort, so lets make the log message clearer why to help users understand this failure mode
-
- 24 May, 2024 4 commits
-
-
Michael Yang authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
Michael Yang authored
-
Patrick Devine authored
-
Wang,Zhe authored
-
- 23 May, 2024 4 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This doesn't expose a UX yet, but wires the initial server portion of progress reporting during load
-
Bruce MacDonald authored
Co-authored-by:ManniX-ITA <20623405+mann1x@users.noreply.github.com>
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 21 May, 2024 1 commit
-
-
Michael Yang authored
-
- 20 May, 2024 6 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
jmorganca authored
-
Josh Yan authored
-
Sam authored
* feat: enable flash attention if supported * feat: enable flash attention if supported * feat: enable flash attention if supported * feat: add flash_attn support
-
- 16 May, 2024 1 commit
-
-
Jeffrey Morgan authored
-