Commits · af28b94533ee2d7cbf2d014fc1eab96d70669dd9 · OpenDAS / ollama

05 Jul, 2024 1 commit
- Fix assert on small embedding inputs (#5491) · e9188e97
  Jeffrey Morgan authored Jul 05, 2024
```
* Fix assert on small embedding inputs

* Update llm/patches/09-pooling.diff
```
  e9188e97
04 Jul, 2024 1 commit
- fix error detection by limiting model loading error parsing (#5472) · 4d71c559
  Jeffrey Morgan authored Jul 03, 2024
  
  4d71c559
03 Jul, 2024 3 commits

Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371) · 3b5a4a77

royjhan authored Jul 03, 2024

* openai compatibility

* Revert "openai compatibility"

This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.

* remove erroneous subtraction of prompt cache

3b5a4a77

Fix corner cases on tmp cleaner on mac · 0e982bc1

Daniel Hiltgen authored Jul 03, 2024

When ollama is running a long time, tmp cleaners can remove the
runners. This tightens up a few corner cases on arm macs where
we failed with "server cpu not listed in available servers map[]"

0e982bc1

Fix clip model loading with unicode paths · 6298f498

Daniel Hiltgen authored Jul 03, 2024

On windows, if the model dir contained unicode characters
clip models would fail to load.  This fixes the file name
handling in clip.cpp to support utf16 on windows.

6298f498

01 Jul, 2024 2 commits
- error · 33a65e3b
  Josh Yan authored Jul 01, 2024
  
  33a65e3b
- Switch use_mmap to a pointer type · 97c9e117
  Daniel Hiltgen authored Jun 28, 2024
```
This uses nil as undefined for a cleaner implementation.
```
  97c9e117
29 Jun, 2024 1 commit
- Do not shift context for sliding window models (#5368) · 717f7229
  Jeffrey Morgan authored Jun 28, 2024
```
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
```
  717f7229
27 Jun, 2024 2 commits
- gemma2 graph · de2163da
  Michael Yang authored Jun 27, 2024
  
  de2163da
- llm: architecture patch (#5316) · 4d311eb7
  Jeffrey Morgan authored Jun 26, 2024
  
  4d311eb7
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 1 commit

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

20 Jun, 2024 2 commits
- Refine mmap default logic on linux · 5bf5aeec
  Daniel Hiltgen authored Jun 20, 2024
```
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
```
  5bf5aeec
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
19 Jun, 2024 1 commit
- remove confusing log message · 9d91e5e5
  Michael Yang authored Jun 19, 2024
  
  9d91e5e5
18 Jun, 2024 3 commits

deepseek v2 graph · e873841c
Michael Yang authored Jun 18, 2024

e873841c

Handle models with divergent layer sizes · 359b15a5

Daniel Hiltgen authored Jun 18, 2024

The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.

359b15a5

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

17 Jun, 2024 5 commits

Adjust mmap logic for cuda windows for faster model load · 17179679

Daniel Hiltgen authored Jun 17, 2024

On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.

17179679

Add back lower level parallel flags · b0930626

Daniel Hiltgen authored Jun 17, 2024

nvcc supports parallelism (threads) and cmake + make can use -j,
while msbuild requires /p:CL_MPcount=8

b0930626

Revert "More parallelism on windows generate" · e890be48
Daniel Hiltgen authored Jun 17, 2024
```
This reverts commit 0577af98.
```
e890be48

Move libraries out of users path · b2799f11

Daniel Hiltgen authored Jun 15, 2024

We update the PATH on windows to get the CLI mapped, but this has
an unintended side effect of causing other apps that may use our bundled
DLLs to get terminated when we upgrade.

b2799f11

llm: update llama.cpp commit to `7c26775` (#4896) · 152fc202

Jeffrey Morgan authored Jun 17, 2024

* llm: update llama.cpp submodule to `7c26775`

* disable `LLAMA_BLAS` for now

* `-DLLAMA_OPENMP=off`

152fc202

15 Jun, 2024 1 commit
- More parallelism on windows generate · 0577af98
  Daniel Hiltgen authored Jun 13, 2024
```
Make the build faster
```
  0577af98
14 Jun, 2024 6 commits
- Workaround gfx900 SDMA bugs · da3bf233
  Daniel Hiltgen authored May 31, 2024
```
Implement support for GPU env var workarounds, and leverage
this for the Vega RX 56 which needs
HSA_ENABLE_SDMA=0 set to work properly
```
  da3bf233
- Remove mmap related output calc logic · 17df6520
  Daniel Hiltgen authored Jun 13, 2024
  
  17df6520
- review comments and coverage · 6f351bf5
  Daniel Hiltgen authored Jun 05, 2024
  
  6f351bf5
- Refine CPU load behavior with system memory visibility · fc37c192
  Daniel Hiltgen authored Jun 03, 2024
  
  fc37c192
- Improve multi-gpu handling at the limit · 6fd04ca9
  Daniel Hiltgen authored May 18, 2024
```
Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block
```
  6fd04ca9
- Fix server.cpp for the new cuda build macros · fb9cdfa7
  Daniel Hiltgen authored May 18, 2024
  
  fb9cdfa7
11 Jun, 2024 2 commits
- Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" · 7bdcd1da
  Michael Yang authored Jun 11, 2024
```
This reverts commit f5f245cc, reversing
changes made to 94d37fdc.

this change broke gguf v2 which is incorrectly detected as big endian
```
  7bdcd1da
- llm: fix seed value not being applied to requests (#4986) · ead259d8
  Jeffrey Morgan authored Jun 11, 2024
  
  ead259d8
09 Jun, 2024 2 commits

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape... · b84aea16

Craig Hughes authored Jun 09, 2024

Critical fix from llama.cpp JSON grammar to forbid un-escaped escape characters inside strings, which breaks parsing. (#3782)

b84aea16

llm: always add bos token to prompt (#4941) · 34f14279

Jeffrey Morgan authored Jun 08, 2024



* fix embedding by adding fixes from llama.cpp upstream

* remove assert

---------
Co-authored-by: Jesper Ek <deadbeef84@gmail.com>

34f14279

08 Jun, 2024 1 commit
- fix parsing big endian gguf · 620d5c56
  Michael Yang authored Jun 08, 2024
  
  620d5c56
07 Jun, 2024 3 commits
- fix create model when template detection errors · 030e765e
  Michael Yang authored Jun 07, 2024
  
  030e765e
- Add ability to skip oneapi generate · ab8c929e
  Daniel Hiltgen authored Jun 07, 2024
```
This follows the same pattern for cuda and rocm to allow
disabling the build even when we detect the dependent libraries
```
  ab8c929e
- llm: patch to fix qwen 2 temporarily on nvidia (#4897) · ce0dc33c
  Jeffrey Morgan authored Jun 06, 2024
  
  ce0dc33c
06 Jun, 2024 1 commit
- detect chat template from KV · 9b6c2e6e
  Michael Yang authored Jun 03, 2024
  
  9b6c2e6e
04 Jun, 2024 1 commit
- gofmt, goimports · 6297f856
  Michael Yang authored Jun 04, 2024
  
  6297f856