- 10 Jul, 2024 3 commits
-
-
Jeffrey Morgan authored
-
Daniel Hiltgen authored
This also adjusts our algorithm to favor our bundled ROCm. I've confirmed VRAM reporting still doesn't work properly so we can't yet enable concurrency by default.
-
Daniel Hiltgen authored
-
- 09 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
This makes sure we statically link the c++ and thread library on windows to avoid unnecessary runtime dependencies on non-standard DLLs
-
- 08 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
Enable the build flag for llama.cpp to use CPU copy for multi-GPU scenarios.
-
- 07 Jul, 2024 4 commits
-
-
Jeffrey Morgan authored
llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535)
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 06 Jul, 2024 8 commits
-
-
Jeffrey Morgan authored
-
jmorganca authored
-
jmorganca authored
-
jmorganca authored
-
jmorganca authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Revert "fix cmake build (#5505)" This reverts commit 4fd5f352. * llm: fix missing dylibs by restoring old build behavior * crlf -> lf
-
- 05 Jul, 2024 7 commits
-
-
Jeffrey Morgan authored
* llm: put back old include dir * llm: update link paths for old submodule commits
-
Jeffrey Morgan authored
-
Michael Yang authored
ensure runtime model changes (template, system prompt, messages, options) are captured on model updates without needing to reload the server
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Use common prefix to select slot * actually report `longest`
-
Jeffrey Morgan authored
* Fix assert on small embedding inputs * Update llm/patches/09-pooling.diff
-
- 04 Jul, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 03 Jul, 2024 3 commits
-
-
royjhan authored
* openai compatibility * Revert "openai compatibility" This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c. * remove erroneous subtraction of prompt cache
-
Daniel Hiltgen authored
When ollama is running a long time, tmp cleaners can remove the runners. This tightens up a few corner cases on arm macs where we failed with "server cpu not listed in available servers map[]"
-
Daniel Hiltgen authored
On windows, if the model dir contained unicode characters clip models would fail to load. This fixes the file name handling in clip.cpp to support utf16 on windows.
-
- 01 Jul, 2024 2 commits
-
-
Josh Yan authored
-
Daniel Hiltgen authored
This uses nil as undefined for a cleaner implementation.
-
- 29 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
* Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2
-
- 27 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 21 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
- 20 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
-
- 19 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 18 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.
-