Commits · 4e262eb2a8aaee31e228febc216c2a83a9a7e4d8 · OpenDAS / ollama

10 Jul, 2024 3 commits
- remove `GGML_CUDA_FORCE_MMQ=on` from build (#5588) · 4e262eb2
  Jeffrey Morgan authored Jul 10, 2024
  
  4e262eb2
- Bump ROCm on windows to 6.1.2 · 1f50356e
  Daniel Hiltgen authored Jul 10, 2024
```
This also adjusts our algorithm to favor our bundled ROCm.
I've confirmed VRAM reporting still doesn't work properly so we
can't yet enable concurrency by default.
```
  1f50356e
- Remove duplicate merge glitch · 22c81f62
  Daniel Hiltgen authored Jul 10, 2024
  
  22c81f62
09 Jul, 2024 1 commit

Statically link c++ and thread lib · b51e3b63

Daniel Hiltgen authored Jul 09, 2024

This makes sure we statically link the c++ and thread library on windows
to avoid unnecessary runtime dependencies on non-standard DLLs

b51e3b63

08 Jul, 2024 1 commit
- Workaround broken ROCm p2p copy · 0bacb300
  Daniel Hiltgen authored Jul 05, 2024
```
Enable the build flag for llama.cpp to use CPU copy for multi-GPU scenarios.
```
  0bacb300
07 Jul, 2024 4 commits
- llm: remove ambiguous comment when putting upper limit on predictions to avoid... · 53da2c69
  Jeffrey Morgan authored Jul 07, 2024
```
llm: remove ambiguous comment when putting upper limit on predictions to avoid infinite generation (#5535)
```
  53da2c69
- llm: allow gemma 2 to context shift (#5534) · d8def1ff
  Jeffrey Morgan authored Jul 07, 2024
  
  d8def1ff
- Update llama.cpp submodule to `a8db2a9c` (#5530) · 571dc619
  Jeffrey Morgan authored Jul 07, 2024
  
  571dc619
- llm: print caching notices in debug only (#5533) · 0e09c380
  Jeffrey Morgan authored Jul 07, 2024
  
  0e09c380
06 Jul, 2024 8 commits
- llm: add `-DBUILD_SHARED_LIBS=off` to common cpu cmake flags (#5520) · 4607c706
  Jeffrey Morgan authored Jul 06, 2024
  
  4607c706
- release: remove unwanted mingw dll.a files · a08f20d9
  jmorganca authored Jul 06, 2024
  
  a08f20d9
- Revert "llm: only statically link libstdc++" · 6cea0360
  jmorganca authored Jul 06, 2024
```
This reverts commit 5796bfc4.
```
  6cea0360
- llm: only statically link libstdc++ · 5796bfc4
  jmorganca authored Jul 06, 2024
  
  5796bfc4
- llm: statically link pthread and stdc++ dependencies in windows build · f1a379aa
  jmorganca authored Jul 06, 2024
  
  f1a379aa
- llm: add `GGML_STATIC` flag to windows static lib · 9ae14699
  jmorganca authored Jul 06, 2024
  
  9ae14699
- llm: add `COMMON_DARWIN_DEFS` to arm static build (#5513) · e0348d3f
  Jeffrey Morgan authored Jul 05, 2024
  
  e0348d3f
- llm: fix missing dylibs by restoring old build behavior on Linux and macOS (#5511) · 2cc854f8
  Jeffrey Morgan authored Jul 05, 2024
```
* Revert "fix cmake build (#5505)"

This reverts commit 4fd5f352.

* llm: fix missing dylibs by restoring old build behavior

* crlf -> lf
```
  2cc854f8
05 Jul, 2024 7 commits
- llm: put back old include dir (#5507) · 5304b765
  Jeffrey Morgan authored Jul 05, 2024
```
* llm: put back old include dir

* llm: update link paths for old submodule commits
```
  5304b765
- fix cmake build (#5505) · 4fd5f352
  Jeffrey Morgan authored Jul 05, 2024
  
  4fd5f352
- fix model reloading · ac7a842e
  Michael Yang authored Jul 03, 2024
```
ensure runtime model changes (template, system prompt, messages,
options) are captured on model updates without needing to reload the
server
```
  ac7a842e
- fix typo in cgo directives in `llm.go` (#5501) · 78fb33dd
  Jeffrey Morgan authored Jul 05, 2024
  
  78fb33dd
- update llama.cpp submodule to `d7fd29f` (#5475) · 8f8e736b
  Jeffrey Morgan authored Jul 05, 2024
  
  8f8e736b
- Use slot with cached prompt instead of least recently used (#5492) · d89454de
  Jeffrey Morgan authored Jul 05, 2024
```
* Use common prefix to select slot

* actually report `longest`
```
  d89454de
- Fix assert on small embedding inputs (#5491) · e9188e97
  Jeffrey Morgan authored Jul 05, 2024
```
* Fix assert on small embedding inputs

* Update llm/patches/09-pooling.diff
```
  e9188e97
04 Jul, 2024 1 commit
- fix error detection by limiting model loading error parsing (#5472) · 4d71c559
  Jeffrey Morgan authored Jul 03, 2024
  
  4d71c559
03 Jul, 2024 3 commits

Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371) · 3b5a4a77

royjhan authored Jul 03, 2024

* openai compatibility

* Revert "openai compatibility"

This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.

* remove erroneous subtraction of prompt cache

3b5a4a77

Fix corner cases on tmp cleaner on mac · 0e982bc1

Daniel Hiltgen authored Jul 03, 2024

When ollama is running a long time, tmp cleaners can remove the
runners. This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"

0e982bc1

Fix clip model loading with unicode paths · 6298f498

Daniel Hiltgen authored Jul 03, 2024

On windows, if the model dir contained unicode characters
clip models would fail to load.  This fixes the file name
handling in clip.cpp to support utf16 on windows.

6298f498

01 Jul, 2024 2 commits
- error · 33a65e3b
  Josh Yan authored Jul 01, 2024
  
  33a65e3b
- Switch use_mmap to a pointer type · 97c9e117
  Daniel Hiltgen authored Jun 28, 2024
```
This uses nil as undefined for a cleaner implementation.
```
  97c9e117
29 Jun, 2024 1 commit
- Do not shift context for sliding window models (#5368) · 717f7229
  Jeffrey Morgan authored Jun 28, 2024
```
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
```
  717f7229
27 Jun, 2024 2 commits
- gemma2 graph · de2163da
  Michael Yang authored Jun 27, 2024
  
  de2163da
- llm: architecture patch (#5316) · 4d311eb7
  Jeffrey Morgan authored Jun 26, 2024
  
  4d311eb7
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 1 commit

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

20 Jun, 2024 2 commits
- Refine mmap default logic on linux · 5bf5aeec
  Daniel Hiltgen authored Jun 20, 2024
```
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
```
  5bf5aeec
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
19 Jun, 2024 1 commit
- remove confusing log message · 9d91e5e5
  Michael Yang authored Jun 19, 2024
  
  9d91e5e5
18 Jun, 2024 2 commits

deepseek v2 graph · e873841c
Michael Yang authored Jun 18, 2024

e873841c

Handle models with divergent layer sizes · 359b15a5

Daniel Hiltgen authored Jun 18, 2024

The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.

359b15a5