- 21 Sep, 2024 1 commit
-
-
Daniel Hiltgen authored
When running the subprocess as a background service windows may throttle, which can lead to thrashing and very poor token rate.
-
- 20 Sep, 2024 1 commit
-
-
Daniel Hiltgen authored
* Unified arm/x86 windows installer This adjusts the installer payloads to be architecture aware so we can cary both amd64 and arm64 binaries in the installer, and install only the applicable architecture at install time. * Include arm64 in official windows build * Harden schedule test for slow windows timers This test seems to be a bit flaky on windows, so give it more time to converge
-
- 18 Sep, 2024 1 commit
-
-
Michael Yang authored
-
- 17 Sep, 2024 1 commit
-
-
Michael Yang authored
raw diffs can be applied using `git apply` but not with `git am`. git patches, e.g. through `git format-patch` are both apply-able and am-able
-
- 13 Sep, 2024 1 commit
-
-
Daniel Hiltgen authored
scripts: fix incremental builds on linux or similar
-
- 12 Sep, 2024 2 commits
-
-
Daniel Hiltgen authored
Corrects x86_64 vs amd64 discrepancy
-
Daniel Hiltgen authored
* Optimize container images for startup This change adjusts how to handle runner payloads to support container builds where we keep them extracted in the filesystem. This makes it easier to optimize the cpu/cuda vs cpu/rocm images for size, and should result in faster startup times for container images. * Refactor payload logic and add buildx support for faster builds * Move payloads around * Review comments * Converge to buildx based helper scripts * Use docker buildx action for release
-
- 11 Sep, 2024 1 commit
-
-
Jesse Gross authored
If there are any pending reponses (such as from potential stop tokens) then we should send them back before ending the sequence. Otherwise, we can be missing tokens at the end of a response. Fixes #6707
-
- 10 Sep, 2024 1 commit
-
-
Daniel Hiltgen authored
* Quiet down dockers new lint warnings Docker has recently added lint warnings to build. This cleans up those warnings. * Fix go lint regression
-
- 06 Sep, 2024 1 commit
-
-
Daniel Hiltgen authored
When we determine a GPU is too small for any layers, it's not always clear why. This will help troubleshoot those scenarios.
-
- 05 Sep, 2024 2 commits
-
-
Daniel Hiltgen authored
With the new very large parameter models, some users are willing to wait for a very long time for models to load.
-
Daniel Hiltgen authored
Provide a mechanism for users to set aside an amount of VRAM on each GPU to make room for other applications they want to start after Ollama, or workaround memory prediction bugs
-
- 04 Sep, 2024 2 commits
-
-
Pascal Patry authored
-
Jeffrey Morgan authored
-
- 03 Sep, 2024 2 commits
-
-
Daniel Hiltgen authored
On systems with low system memory, we can hit allocation failures that are difficult to diagnose without debug logs. This will make it easier to spot.
-
FellowTraveler authored
/Users/au/src/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.
-
- 29 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 27 Aug, 2024 1 commit
-
-
Sean Khatiri authored
-
- 25 Aug, 2024 1 commit
-
-
Daniel Hiltgen authored
The numa flag may be having a performance impact on multi-socket systems with GPU loads
-
- 23 Aug, 2024 2 commits
-
-
Patrick Devine authored
-
Daniel Hiltgen authored
Define changed recently and this slipped through the cracks with the old name.
-
- 22 Aug, 2024 1 commit
-
-
Daniel Hiltgen authored
* Fix embeddings memory corruption The patch was leading to a buffer overrun corruption. Once removed though, parallism in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count. To work around this, only use slot 0 for embeddings. * Fix embed integration test assumption The token eval count has changed with recent llama.cpp bumps (0.3.5+)
-
- 21 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 20 Aug, 2024 1 commit
-
-
Daniel Hiltgen authored
We're over budget for github's maximum release artifact size with rocm + 2 cuda versions. This splits rocm back out as a discrete artifact, but keeps the layout so it can be extracted into the same location as the main bundle.
-
- 19 Aug, 2024 6 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This adds new variants for arm64 specific to Jetson platforms
-
Daniel Hiltgen authored
This should help speed things up a little
-
Daniel Hiltgen authored
This adjusts linux to follow a similar model to windows with a discrete archive (zip/tgz) to cary the primary executable, and dependent libraries. Runners are still carried as payloads inside the main binary Darwin retain the payload model where the go binary is fully self contained.
-
- 12 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 11 Aug, 2024 2 commits
-
-
Jeffrey Morgan authored
For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.
-
Daniel Hiltgen authored
Don't allow loading models that would lead to memory exhaustion (across vram, system memory and disk paging). This check was already applied on Linux but should also be applied on Windows as well.
-
- 08 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 07 Aug, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 06 Aug, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 05 Aug, 2024 4 commits
-
-
royjhan authored
-
Daniel Hiltgen authored
If the system has multiple numa nodes, enable numa support in llama.cpp If we detect numactl in the path, use that, else use the basic "distribute" mode.
-
Daniel Hiltgen authored
-
Michael Yang authored
-