- 02 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 31 Jul, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 30 Jul, 2024 2 commits
-
-
royjhan authored
* add prompt tokens to embed response * rm slog * metrics * types * prompt n * clean up * reset submodule * update tests * test name * list metrics
-
Daniel Hiltgen authored
In mult-brand GPU setups, if we couldn't fully load the model we would fall through the scheduler and mistakenly try to load across a mix of brands. This makes sure we find the set of GPU(s) that best fit for the partial load.
-
- 22 Jul, 2024 1 commit
-
-
Michael Yang authored
-
- 21 Jul, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 15 Jul, 2024 1 commit
-
-
royjhan authored
* Initial Batch Embedding * Revert "Initial Batch Embedding" This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29. * Initial Draft * mock up notes * api/embed draft * add server function * check normalization * clean up * normalization * playing around with truncate stuff * Truncation * Truncation * move normalization to go * Integration Test Template * Truncation Integration Tests * Clean up * use float32 * move normalize * move normalize test * refactoring * integration float32 * input handling and handler testing * Refactoring of legacy and new * clear comments * merge conflicts * touches * embedding type 64 * merge conflicts * fix hanging on single string * refactoring * test values * set context length * clean up * testing clean up * testing clean up * remove function closure * Revert "remove function closure" This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787. * remove function closure * remove redundant error check * clean up * more clean up * clean up
-
- 09 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
This breaks up some of the test scenarios to create a more reliable set of tests, as well as adding a little more coverage.
-
- 03 Jul, 2024 2 commits
-
-
Daniel Hiltgen authored
This change fixes the handling of keep_alive so that if client request omits the setting, we only set this on initial load. Once the model is loaded, if new requests leave this unset, we'll keep whatever keep_alive was there.
-
Daniel Hiltgen authored
Users may not realize the siny new model they're trying to load fits on their disk, but can't load into system+GPU memory. Today we crash, but with this fix, we'll give them a better error message before even trying to load it.
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 21 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
- 14 Jun, 2024 4 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Jeffrey Morgan authored
-
- 04 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 24 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 23 May, 2024 1 commit
-
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 14 May, 2024 1 commit
-
-
Patrick Devine authored
-
- 06 May, 2024 2 commits
-
-
Daniel Hiltgen authored
The model processing was recently changed to be deferred but this test scenario hadn't been adjusted for that change in behavior.
-
Jeffrey Morgan authored
-
- 05 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-
- 03 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This gives us more headroom on the scheduler tests to tamp down some flakes.
-
- 28 Apr, 2024 1 commit
-
-
Daniel Hiltgen authored
Prior refactoring passes accidentally removed the logic to bypass VRAM checks for CPU loads. This adds that back, along with test coverage. This also fixes loaded map access in the unit test to be behind the mutex which was likely the cause of various flakes in the tests.
-
- 25 Apr, 2024 1 commit
-
-
Jeffrey Morgan authored
* reload model if `num_gpu` changes * dont reload on -1 * fix tests
-
- 24 Apr, 2024 3 commits
-
-
Bryce Reitano authored
-
Bryce Reitano authored
-
Bryce Reitano authored
-
- 23 Apr, 2024 2 commits
-
-
Daniel Hiltgen authored
Give the go routine a moment to deliver the expired event
-
Daniel Hiltgen authored
This change adds support for multiple concurrent requests, as well as loading multiple models by spawning multiple runners. The default settings are currently set at 1 concurrent request per model and only 1 loaded model at a time, but these can be adjusted by setting OLLAMA_NUM_PARALLEL and OLLAMA_MAX_LOADED_MODELS.
-