- 05 Jul, 2024 1 commit
-
-
Jeffrey Morgan authored
* Fix assert on small embedding inputs * Update llm/patches/09-pooling.diff
-
- 04 Jul, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 03 Jul, 2024 3 commits
-
-
royjhan authored
* openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache
-
Daniel Hiltgen authored
When ollama is running a long time, tmp cleaners can remove the runners. This tightens up a few corner cases on arm macs where we failed with "server cpu not listed in available servers map[]"
-
Daniel Hiltgen authored
On windows, if the model dir contained unicode characters clip models would fail to load. This fixes the file name handling in clip.cpp to support utf16 on windows.
-
- 01 Jul, 2024 2 commits
-
-
Josh Yan authored
-
Daniel Hiltgen authored
This uses nil as undefined for a cleaner implementation.
-
- 29 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
* Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2
-
- 27 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 21 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
- 20 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 18 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.
-
Daniel Hiltgen authored
Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.
-
- 17 Jun, 2024 5 commits
-
-
Daniel Hiltgen authored
On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.
-
Daniel Hiltgen authored
nvcc supports parallelism (threads) and cmake + make can use -j, while msbuild requires /p:CL_MPcount=8
-
Daniel Hiltgen authored
This reverts commit 0577af98.
-
Daniel Hiltgen authored
We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.
-
Jeffrey Morgan authored
* llm: update llama.cpp submodule to `7c26775` * disable `LLAMA_BLAS` for now * `-DLLAMA_OPENMP=off`
-
- 15 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Make the build faster
-
- 14 Jun, 2024 6 commits
-
-
Daniel Hiltgen authored
Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Daniel Hiltgen authored
-
- 11 Jun, 2024 2 commits
-
-
Michael Yang authored
This reverts commit f5f245cc, reversing changes made to 94d37fdc. this change broke gguf v2 which is incorrectly detected as big endian
-
Jeffrey Morgan authored
-
- 09 Jun, 2024 2 commits
-
-
Craig Hughes authored
Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)
-
Jeffrey Morgan authored
* fix embedding by adding fixes from llama.cpp upstream * remove assert --------- Co-authored-by:Jesper Ek <deadbeef84@gmail.com>
-
- 08 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 07 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
This follows the same pattern for cuda and rocm to allow disabling the build even when we detect the dependent libraries
-
Jeffrey Morgan authored
-
- 06 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 04 Jun, 2024 1 commit
-
-
Michael Yang authored
-