Commits · 7784ca33ce4cc78629160719f4a84981437071b7 · OpenDAS / ollama

18 Jun, 2024 2 commits

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

Merge pull request #5105 from dhiltgen/cuda_mmap · c9c8c98b
Daniel Hiltgen authored Jun 17, 2024
```
Adjust mmap logic for cuda windows for faster model load
```
c9c8c98b

17 Jun, 2024 9 commits
- Adjust mmap logic for cuda windows for faster model load · 17179679
  Daniel Hiltgen authored Jun 17, 2024
```
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
```
  17179679
- Update import.md · 176d0f70
  Jeffrey Morgan authored Jun 17, 2024
  
  176d0f70
- Merge pull request #5103 from dhiltgen/faster_win_build · 8ed51cac
  Daniel Hiltgen authored Jun 17, 2024
```
Revert powershell jobs, but keep nvcc and cmake parallelism
```
  8ed51cac
- Merge pull request #5069 from dhiltgen/ci_release · c9e6f054
  Daniel Hiltgen authored Jun 17, 2024
```
Implement custom github release action
```
  c9e6f054
- Add back lower level parallel flags · b0930626
  Daniel Hiltgen authored Jun 17, 2024
```
nvcc supports parallelism (threads) and cmake + make can use -j,
while msbuild requires /p:CL_MPcount=8
```
  b0930626
- Revert "More parallelism on windows generate" · e890be48
  Daniel Hiltgen authored Jun 17, 2024
```
This reverts commit 0577af98.
```
  e890be48
- llm: update llama.cpp commit to `7c26775` (#4896) · 152fc202
  Jeffrey Morgan authored Jun 17, 2024
```
* llm: update llama.cpp submodule to `7c26775`

* disable `LLAMA_BLAS` for now

* `-DLLAMA_OPENMP=off`
```
  152fc202
- Fix a build warning (#5096) · 4ad0d4d6
  Lei Jitang authored Jun 18, 2024
```
Signed-off-by: Lei Jitang <leijitang@outlook.com>
```
  4ad0d4d6
- gpu: add env var for detecting Intel oneapi gpus (#5076) · 163cd3e7
  Jeffrey Morgan authored Jun 16, 2024
```
* gpu: add env var for detecting intel oneapi gpus

* fix build error
```
  163cd3e7
16 Jun, 2024 4 commits
- Merge pull request #5080 from dhiltgen/debug_intel_crash · 4c2c8f93
  Daniel Hiltgen authored Jun 16, 2024
```
Add some more debugging logs for intel discovery
```
  4c2c8f93
- Add some more debugging logs for intel discovery · fd1e6e05
  Daniel Hiltgen authored Jun 16, 2024
```
Also removes an unused overall count variable
```
  fd1e6e05
- Add ModifiedAt Field to /api/show (#5033) · 89c79bec
  royjhan authored Jun 15, 2024
```
* Add Mod Time to Show

* Error Handling
```
  89c79bec
- docs: add missing powershell package to windows development instructions (#5075) · c7b77004
  Jeffrey Morgan authored Jun 15, 2024
```
* docs: add missing instruction for powershell build

The powershell script for building Ollama on Windows now requires the `ThreadJob` module. Add this to the instructions and dependency list.

* Update development.md
```
  c7b77004
15 Jun, 2024 9 commits
- Merge pull request #5058 from coolljt0725/fix_build_warning · 07d143f4
  Daniel Hiltgen authored Jun 15, 2024
```
gpu: Fix build warning
```
  07d143f4
- Implement custom github release action · a12283e2
  Daniel Hiltgen authored Jun 15, 2024
```
This implements the release logic we want via gh cli
to support updating releases with rc tags in place and retain
release notes and other community reactions.
```
  a12283e2
- Merge pull request #5037 from dhiltgen/faster_win_build · 4b0050cf
  Daniel Hiltgen authored Jun 15, 2024
```
More parallelism on windows generate
```
  4b0050cf
- More parallelism on windows generate · 0577af98
  Daniel Hiltgen authored Jun 13, 2024
```
Make the build faster
```
  0577af98
- Merge pull request #4875 from dhiltgen/rocm_gfx900_workaround · 17ce203a
  Daniel Hiltgen authored Jun 15, 2024
```
Rocm gfx900 workaround
```
  17ce203a
- Merge pull request #4874 from dhiltgen/rocm_v6_bump · d76555ff
  Daniel Hiltgen authored Jun 15, 2024
```
Rocm v6 bump
```
  d76555ff
- Merge pull request #4264 from dhiltgen/show_gpu_visible_settings · 2786dff5
  Daniel Hiltgen authored Jun 15, 2024
```
Centralize GPU configuration vars
```
  2786dff5
- gpu: Fix build warning · 225f0d12
  Lei Jitang authored Jun 15, 2024
```
Signed-off-by: Lei Jitang <leijitang@outlook.com>
```
  225f0d12
- Merge pull request #4972 from jayson-cloude/main · 532db583
  Daniel Hiltgen authored Jun 14, 2024
```
fix: "Skip searching for network devices"
```
  532db583
14 Jun, 2024 16 commits
- Centralize GPU configuration vars · 6be309e1
  Daniel Hiltgen authored May 08, 2024
```
This should aid in troubleshooting by capturing and reporting the GPU
settings at startup in the logs along with all the other server settings.
```
  6be309e1
- Workaround gfx900 SDMA bugs · da3bf233
  Daniel Hiltgen authored May 31, 2024
```
Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly
```
  da3bf233
- Bump ROCm linux to 6.1.1 · 26ab6773
  Daniel Hiltgen authored Jun 06, 2024
  
  26ab6773
- Merge pull request #4517 from dhiltgen/gpu_incremental · 45cacbaf
  Daniel Hiltgen authored Jun 14, 2024
```
Enhanced GPU discovery and multi-gpu support with concurrency
```
  45cacbaf
- Remove mmap related output calc logic · 17df6520
  Daniel Hiltgen authored Jun 13, 2024
  
  17df6520
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Prevent multiple concurrent loads on the same gpus · ff4f0cbd
  Daniel Hiltgen authored Jun 04, 2024
```
While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs
```
  ff4f0cbd
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
- Reintroduce nvidia nvml library for windows · 434dfe30
  Daniel Hiltgen authored Jun 03, 2024
```
This library will give us the most reliable free VRAM reporting on windows
to enable concurrent model scheduling.
```
  434dfe30
- Refactor intel gpu discovery · 4e2b7e18
  Daniel Hiltgen authored May 29, 2024
  
  4e2b7e18
- Harden unload for empty runners · 48702dd1
  Daniel Hiltgen authored May 30, 2024
  
  48702dd1
- refined test timing · 68dfc623
  Daniel Hiltgen authored May 31, 2024
```
adjust timing on some tests so they don't timeout on small/slow GPUs
```
  68dfc623
- Support forced spreading for multi GPU · 5e8ff556
  Daniel Hiltgen authored May 08, 2024
```
Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one.  This exposes that
tunable behavior.
```
  5e8ff556
- Improve multi-gpu handling at the limit · 6fd04ca9
  Daniel Hiltgen authored May 18, 2024
```
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
```
  6fd04ca9
- Fix concurrency integration test to work locally · 206797bd
  Daniel Hiltgen authored May 23, 2024
```
This worked remotely but wound up trying to spawn multiple servers
locally which doesn't work
```
  206797bd
- Refine GPU discovery to bootstrap once · 43ed358f
  Daniel Hiltgen authored May 15, 2024
```
Now that we call the GPU discovery routines many times to
update memory, this splits initial discovery from free memory
updating.
```
  43ed358f