- 07 Jul, 2024 3 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 06 Jul, 2024 8 commits
-
-
Jeffrey Morgan authored
-
jmorganca authored
-
jmorganca authored
-
jmorganca authored
-
jmorganca authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Revert "fix cmake build (#5505)" This reverts commit 4fd5f352. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf
-
- 05 Jul, 2024 6 commits
-
-
Jeffrey Morgan authored
* llm: put back old include dir * llm: update link paths for old submodule commits
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Use common prefix to select slot * actually report `longest`
-
Jeffrey Morgan authored
* Fix assert on small embedding inputs * Update llm/patches/09-pooling.diff
-
- 04 Jul, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 03 Jul, 2024 3 commits
-
-
royjhan authored
* openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache
-
Daniel Hiltgen authored
When ollama is running a long time, tmp cleaners can remove the runners. This tightens up a few corner cases on arm macs where we failed with "server cpu not listed in available servers map[]"
-
Daniel Hiltgen authored
On windows, if the model dir contained unicode characters clip models would fail to load. This fixes the file name handling in clip.cpp to support utf16 on windows.
-
- 01 Jul, 2024 2 commits
-
-
Josh Yan authored
-
Daniel Hiltgen authored
This uses nil as undefined for a cleaner implementation.
-
- 29 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
* Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2
-
- 27 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 21 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
- 20 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 18 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.
-
Daniel Hiltgen authored
Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.
-
- 17 Jun, 2024 5 commits
-
-
Daniel Hiltgen authored
On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.
-
Daniel Hiltgen authored
nvcc supports parallelism (threads) and cmake + make can use -j, while msbuild requires /p:CL_MPcount=8
-
Daniel Hiltgen authored
This reverts commit 0577af98.
-
Daniel Hiltgen authored
We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.
-
Jeffrey Morgan authored
* llm: update llama.cpp submodule to `7c26775` * disable `LLAMA_BLAS` for now * `-DLLAMA_OPENMP=off`
-
- 15 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Make the build faster
-