- 27 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 20 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 18 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.
-
Daniel Hiltgen authored
Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.
-
- 17 Jun, 2024 5 commits
-
-
Daniel Hiltgen authored
On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.
-
Daniel Hiltgen authored
nvcc supports parallelism (threads) and cmake + make can use -j, while msbuild requires /p:CL_MPcount=8
-
Daniel Hiltgen authored
This reverts commit 0577af98.
-
Daniel Hiltgen authored
We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.
-
Jeffrey Morgan authored
* llm: update llama.cpp submodule to `7c26775` * disable `LLAMA_BLAS` for now * `-DLLAMA_OPENMP=off`
-
- 15 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Make the build faster
-
- 14 Jun, 2024 6 commits
-
-
Daniel Hiltgen authored
Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Daniel Hiltgen authored
-
- 11 Jun, 2024 2 commits
-
-
Michael Yang authored
This reverts commit f5f245cc, reversing changes made to 94d37fdc. this change broke gguf v2 which is incorrectly detected as big endian
-
Jeffrey Morgan authored
-
- 09 Jun, 2024 2 commits
-
-
Craig Hughes authored
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)
-
Jeffrey Morgan authored
* fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by:Jesper Ek <deadbeef84@gmail.com>
-
- 08 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 07 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This follows the same pattern for cuda and rocm to allow disabling the build even when we detect the dependent libraries
-
Jeffrey Morgan authored
-
- 06 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 04 Jun, 2024 4 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 01 Jun, 2024 1 commit
-
-
Michael Yang authored
* Revert "use `int32_t` for call to tokenize (#4738)" This reverts commit 763bb65d. * Revert "vocab only" This reverts commit bf54c845. * Revert "use ffi for tokenizing/detokenizing" This reverts commit 26a00a04.
-
- 31 May, 2024 2 commits
-
-
Jeffrey Morgan authored
* use `int32_t` for call to tokenize * variable naming * cleanup * fix crash
-
Jeffrey Morgan authored
-
- 30 May, 2024 3 commits
-
-
Jeffrey Morgan authored
* partial offloading: allow flash attention and disable mmap * allow mmap with num_gpu=0
-
Michael Yang authored
-
Jeffrey Morgan authored
* update llama.cpp submodule to `5921b8f089d3b7bda86aac5a66825df6a6c10603` * add patch
-