- 03 Jul, 2024 1 commit
-
-
Daniel Hiltgen authored
Refine the way we log GPU discovery to improve the non-debug output, and report more actionable log messages when possible to help users troubleshoot on their own.
-
- 21 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.
-
- 20 Jun, 2024 3 commits
- 19 Jun, 2024 4 commits
-
-
Daniel Hiltgen authored
This reverts commit 755b4e4f.
-
Daniel Hiltgen authored
pointer deref's weren't correct on a few libraries, which explains some crashes on older systems or miswired symlinks for discovery libraries.
-
Wang,Zhe authored
-
- 17 Jun, 2024 3 commits
-
-
Daniel Hiltgen authored
We update the PATH on windows to get the CLI mapped, but this has an unintended side effect of causing other apps that may use our bundled DLLs to get terminated when we upgrade.
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
Jeffrey Morgan authored
* gpu: add env var for detecting intel oneapi gpus * fix build error
-
- 16 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
Also removes an unused overall count variable
-
- 15 Jun, 2024 1 commit
-
-
Lei Jitang authored
Signed-off-by:Lei Jitang <leijitang@outlook.com>
-
- 14 Jun, 2024 10 commits
-
-
Daniel Hiltgen authored
This should aid in troubleshooting by capturing and reporting the GPU settings at startup in the logs along with all the other server settings.
-
Daniel Hiltgen authored
Implement support for GPU env var workarounds, and leverage this for the Vega RX 56 which needs HSA_ENABLE_SDMA=0 set to work properly
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
This library will give us the most reliable free VRAM reporting on windows to enable concurrent model scheduling.
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Still not complete, needs some refinement to our prediction to understand the discrete GPUs available space so we can see how many layers fit in each one since we can't split one layer across multiple GPUs we can't treat free space as one logical block
-
Daniel Hiltgen authored
Now that we call the GPU discovery routines many times to update memory, this splits initial discovery from free memory updating.
-
Daniel Hiltgen authored
The amdgpu drivers free VRAM reporting omits some other apps, so leverage the upstream DRM driver which keeps better tabs on things
-
Daniel Hiltgen authored
This reverts commit 476fb8e8.
-
- 13 Jun, 2024 1 commit
-
-
Daniel Hiltgen authored
-
- 04 Jun, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 02 Jun, 2024 1 commit
-
-
Jeffrey Morgan authored
* fix oneapi errors on windows 10
-
- 24 May, 2024 2 commits
-
-
Patrick Devine authored
-
Wang,Zhe authored
-
- 10 May, 2024 1 commit
-
-
Daniel Hiltgen authored
Under stress scenarios we're seeing OOMs so this should help stabilize the allocations under heavy concurrency stress.
-
- 09 May, 2024 2 commits
-
-
Daniel Hiltgen authored
The GPU drivers take a while to update their free memory reporting, so we need to wait until the values converge with what we're expecting before proceeding to start another runner in order to get an accurate picture.
-
Daniel Hiltgen authored
This cleans up the logging for GPU discovery a bit, and can serve as a foundation to report GPU information in a future UX.
-
- 07 May, 2024 1 commit
-
-
Michael Yang authored
-
- 06 May, 2024 1 commit
-
-
Daniel Hiltgen authored
Trying to live off the land for cuda libraries was not the right strategy. We need to use the version we compiled against to ensure things work properly
-
- 05 May, 2024 1 commit
-
-
Daniel Hiltgen authored
This moves all the env var reading into one central module and logs the loaded config once at startup which should help in troubleshooting user server logs
-
- 03 May, 2024 1 commit
-
-
Daniel Hiltgen authored
For some reason this library gives incorrect GPU information, so skip it
-
- 01 May, 2024 2 commits
-
-
Daniel Hiltgen authored
-
Jeffrey Morgan authored
-