Commits · 17b7186cd759337fa98b626e82de150f3789b040 · OpenDAS / ollama

21 Jun, 2024 1 commit

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

20 Jun, 2024 9 commits
- Merge pull request #5194 from dhiltgen/linux_mmap_auto · c7c2f3bc
  Daniel Hiltgen authored Jun 20, 2024
```
Refine mmap default logic on linux
```
  c7c2f3bc
- Merge pull request #5125 from dhiltgen/fedora39 · 54a79d6a
  Daniel Hiltgen authored Jun 20, 2024
```
Bump latest fedora cuda repo to 39
```
  54a79d6a
- Refine mmap default logic on linux · 5bf5aeec
  Daniel Hiltgen authored Jun 20, 2024
```
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
```
  5bf5aeec
- Merge pull request #5192 from ollama/mxyng/kv · e01e535c
  Michael Yang authored Jun 20, 2024
```
handle asymmetric embedding KVs
```
  e01e535c
- Merge pull request #5188 from ollama/jyan/tmpdir2 · 0195d6a2
  Josh authored Jun 20, 2024
```
fix: skip os.removeAll() if PID does not exist
```
  0195d6a2
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
- err!=nil check · 662568d4
  Josh Yan authored Jun 20, 2024
  
  662568d4
- reformat error check · 4ebb66c6
  Josh Yan authored Jun 20, 2024
  
  4ebb66c6
- skip os.removeAll() if PID does not exist · 23e899f3
  Josh Yan authored Jun 20, 2024
  
  23e899f3
19 Jun, 2024 15 commits
- Extend api/show and ollama show to return more model info (#4881) · fedf7163
  royjhan authored Jun 19, 2024
```
* API Show Extended

* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>

* Clean Up

* Descriptive arg error messages and other fixes

* Second Draft of Show with Projectors Included

* Remove Chat Template

* Touches

* Prevent wrapping from files

* Verbose functionality

* Docs

* Address Feedback

* Lint

* Resolve Conflicts

* Function Name

* Tests for api/show model info

* Show Test File

* Add Projector Test

* Clean routes

* Projector Check

* Move Show Test

* Touches

* Doc update

---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>
```
  fedf7163
- Merge pull request #5074 from dhiltgen/app_log_rotation · 97c59be6
  Daniel Hiltgen authored Jun 19, 2024
```
Implement log rotation for tray app
```
  97c59be6
- Implement log rotation for tray app · 9d8a4988
  Daniel Hiltgen authored Jun 15, 2024
  
  9d8a4988
- Merge pull request #5147 from ollama/mxyng/cleanup · 1ae0750a
  Michael Yang authored Jun 19, 2024
```
remove confusing log message
```
  1ae0750a
- remove confusing log message · 9d91e5e5
  Michael Yang authored Jun 19, 2024
  
  9d91e5e5
- Merge pull request #5072 from dhiltgen/windows_path · 96624aa4
  Daniel Hiltgen authored Jun 19, 2024
```
Move libraries out of users path
```
  96624aa4
- Merge pull request #5146 from dhiltgen/backout · 10f33b85
  Daniel Hiltgen authored Jun 19, 2024
```
Put back temporary intel GPU env var
```
  10f33b85
- Merge pull request #5145 from dhiltgen/bad_loads · 4a633cc2
  Daniel Hiltgen authored Jun 19, 2024
```
Fix bad symbol load detection
```
  4a633cc2
- Revert "Revert "gpu: add env var for detecting Intel oneapi gpus (#5076)"" · d34d88e4
  Daniel Hiltgen authored Jun 19, 2024
```
This reverts commit 755b4e4f.
```
  d34d88e4
- Fix bad symbol load detection · 52ce350b
  Daniel Hiltgen authored Jun 19, 2024
```
pointer deref's weren't correct on a few libraries, which explains
some crashes on older systems or miswired symlinks for discovery libraries.
```
  52ce350b
- Merge pull request #5128 from zhewang1-intc/fix_levelzero_empty_symbol_detect · 2abebb2c
  Daniel Hiltgen authored Jun 19, 2024
```
Fix levelzero empty symbol detect
```
  2abebb2c
- types/model: remove Digest · 380e06e5
  Blake Mizerany authored Jun 18, 2024
```
The Digest type in its current form is awkward to work with and presents
challenges with regard to how it serializes via String using the '-'
prefix.

We currently only use this in ollama.com, so we'll move our specific
needs around digest parsing and validation there.
```
  380e06e5
- get real func ptr. · badf975e
  Wang,Zhe authored Jun 19, 2024
  
  badf975e
- Revert "gpu: add env var for detecting Intel oneapi gpus (#5076)" · 755b4e4f
  Wang,Zhe authored Jun 19, 2024
```
This reverts commit 163cd3e7.
```
  755b4e4f
- Bump latest fedora cuda repo to 39 · 1a1c99e3
  Daniel Hiltgen authored Jun 18, 2024
  
  1a1c99e3
18 Jun, 2024 7 commits

Merge pull request #5121 from ollama/mxyng/deepseekv2 · 21adf8b6
Michael Yang authored Jun 18, 2024
```
deepseek v2 graph
```
21adf8b6
deepseek v2 graph · e873841c
Michael Yang authored Jun 18, 2024

e873841c
Merge pull request #5117 from dhiltgen/fix_prediction · 26d0bf92
Daniel Hiltgen authored Jun 18, 2024
```
Handle models with divergent layer sizes
```
26d0bf92

Handle models with divergent layer sizes · 359b15a5

Daniel Hiltgen authored Jun 18, 2024

The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.

359b15a5

Merge pull request #5106 from dhiltgen/clean_logs · b55958a5
Daniel Hiltgen authored Jun 18, 2024
```
Tighten up memory prediction logging
```
b55958a5

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

Merge pull request #5105 from dhiltgen/cuda_mmap · c9c8c98b
Daniel Hiltgen authored Jun 17, 2024
```
Adjust mmap logic for cuda windows for faster model load
```
c9c8c98b

17 Jun, 2024 8 commits
- Adjust mmap logic for cuda windows for faster model load · 17179679
  Daniel Hiltgen authored Jun 17, 2024
```
On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.
```
  17179679
- Update import.md · 176d0f70
  Jeffrey Morgan authored Jun 17, 2024
  
  176d0f70
- Merge pull request #5103 from dhiltgen/faster_win_build · 8ed51cac
  Daniel Hiltgen authored Jun 17, 2024
```
Revert powershell jobs, but keep nvcc and cmake parallelism
```
  8ed51cac
- Merge pull request #5069 from dhiltgen/ci_release · c9e6f054
  Daniel Hiltgen authored Jun 17, 2024
```
Implement custom github release action
```
  c9e6f054
- Add back lower level parallel flags · b0930626
  Daniel Hiltgen authored Jun 17, 2024
```
nvcc supports parallelism (threads) and cmake + make can use -j,
while msbuild requires /p:CL_MPcount=8
```
  b0930626
- Revert "More parallelism on windows generate" · e890be48
  Daniel Hiltgen authored Jun 17, 2024
```
This reverts commit 0577af98.
```
  e890be48
- Move libraries out of users path · b2799f11
  Daniel Hiltgen authored Jun 15, 2024
```
We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.
```
  b2799f11
- llm: update llama.cpp commit to `7c26775` (#4896) · 152fc202
  Jeffrey Morgan authored Jun 17, 2024
```
* llm: update llama.cpp submodule to `7c26775`

* disable `LLAMA_BLAS` for now

* `-DLLAMA_OPENMP=off`
```
  152fc202