- 01 Jul, 2024 2 commits
-
-
Michael Yang authored
-
Daniel Hiltgen authored
-
- 27 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 25 Jun, 2024 1 commit
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF.
-
- 21 Jun, 2024 4 commits
-
-
Daniel Hiltgen authored
Provide consistent ordering for the ps command - longest duration listed first
-
Daniel Hiltgen authored
Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
Michael Yang authored
-
- 19 Jun, 2024 1 commit
-
-
royjhan authored
* API Show Extended * Initial Draft of Information Co-Authored-By:
Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by:
Patrick Devine <pdevine@sonic.net>
-
- 16 Jun, 2024 1 commit
-
-
royjhan authored
* Add Mod Time to Show * Error Handling
-
- 14 Jun, 2024 8 commits
-
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
While models are loading, the VRAM metrics are dynamic, so try to load on a GPU that doesn't have a model actively loading, or wait to avoid races that lead to OOMs
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Our default behavior today is to try to fit into a single GPU if possible. Some users would prefer the old behavior of always spreading across multiple GPUs even if the model can fit into one. This exposes that tunable behavior.
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Jeffrey Morgan authored
-
- 13 Jun, 2024 2 commits
-
-
Patrick Devine authored
-
Jeffrey Morgan authored
-
- 12 Jun, 2024 1 commit
-
-
Michael Yang authored
multiple templates may appear in a model if a model is created from another model that 1) has an autodetected template and 2) defines a custom template
-
- 10 Jun, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 07 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 06 Jun, 2024 3 commits
-
-
Michael Yang authored
-
royjhan authored
* API app/browser access * Add tauri (resolves #2291, #4791, #3799, #4388)
-
royjhan authored
* Remove false time fields * Struct Separation for List and Process * Remove Marshaler
-
- 05 Jun, 2024 1 commit
-
-
Blake Mizerany authored
-
- 04 Jun, 2024 7 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 24 May, 2024 2 commits
-
-
Patrick Devine authored
-
Tim Scheuermann authored
-
- 23 May, 2024 1 commit
-
-
Jeffrey Morgan authored
* put flash attention behind flag for now * add test * remove print * up timeout for sheduler tests
-
- 21 May, 2024 1 commit
-
-
Sang Park authored
The spelling of the term "request" has been corrected, which was previously mistakenly written as "requeset" in the error log message.
-
- 20 May, 2024 1 commit
-
-
Michael Yang authored
-