Commits · cb42e607c5cf4d439ad4d5a93ed13c7d6a09fc34 · OpenDAS / ollama

25 Jun, 2024 2 commits

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

cmd: defer stating model info until necessary (#5248) · 2aa91a93

Blake Mizerany authored Jun 24, 2024

This commit changes the 'ollama run' command to defer fetching model
information until it really needs it. That is, when in interactive mode.

It also removes one such case where the model information is fetch in
duplicate, just before calling generateInteractive and then again, first
thing, in generateInteractive.

This positively impacts the performance of the command:

    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.168 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.220 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.217 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 4% cpu 0.652 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 5% cpu 0.498 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 3% cpu 0.479 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total

2aa91a93

21 Jun, 2024 5 commits
- Merge pull request #5205 from dhiltgen/modelfile_use_mmap · ccef9431
  Daniel Hiltgen authored Jun 21, 2024
```
Fix use_mmap parsing for modelfiles
```
  ccef9431
- Docs (#5149) · 9a9e7d83
  royjhan authored Jun 21, 2024
  
  9a9e7d83
- Merge pull request #5206 from ollama/mxyng/quantize · 189a43ca
  Michael Yang authored Jun 21, 2024
```
fix: quantization with template
```
  189a43ca
- fix: quantization with template · e835ef18
  Michael Yang authored Jun 21, 2024
  
  e835ef18
- Fix use_mmap parsing for modelfiles · 7e774922
  Daniel Hiltgen authored Jun 21, 2024
```
Add the new tristate parsing logic for the code path for modelfiles,
as well as a unit test.
```
  7e774922
20 Jun, 2024 9 commits
- Merge pull request #5194 from dhiltgen/linux_mmap_auto · c7c2f3bc
  Daniel Hiltgen authored Jun 20, 2024
```
Refine mmap default logic on linux
```
  c7c2f3bc
- Merge pull request #5125 from dhiltgen/fedora39 · 54a79d6a
  Daniel Hiltgen authored Jun 20, 2024
```
Bump latest fedora cuda repo to 39
```
  54a79d6a
- Refine mmap default logic on linux · 5bf5aeec
  Daniel Hiltgen authored Jun 20, 2024
```
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
```
  5bf5aeec
- Merge pull request #5192 from ollama/mxyng/kv · e01e535c
  Michael Yang authored Jun 20, 2024
```
handle asymmetric embedding KVs
```
  e01e535c
- Merge pull request #5188 from ollama/jyan/tmpdir2 · 0195d6a2
  Josh authored Jun 20, 2024
```
fix: skip os.removeAll() if PID does not exist
```
  0195d6a2
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
- err!=nil check · 662568d4
  Josh Yan authored Jun 20, 2024
  
  662568d4
- reformat error check · 4ebb66c6
  Josh Yan authored Jun 20, 2024
  
  4ebb66c6
- skip os.removeAll() if PID does not exist · 23e899f3
  Josh Yan authored Jun 20, 2024
  
  23e899f3
19 Jun, 2024 15 commits
- Extend api/show and ollama show to return more model info (#4881) · fedf7163
  royjhan authored Jun 19, 2024
```
* API Show Extended

* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>

* Clean Up

* Descriptive arg error messages and other fixes

* Second Draft of Show with Projectors Included

* Remove Chat Template

* Touches

* Prevent wrapping from files

* Verbose functionality

* Docs

* Address Feedback

* Lint

* Resolve Conflicts

* Function Name

* Tests for api/show model info

* Show Test File

* Add Projector Test

* Clean routes

* Projector Check

* Move Show Test

* Touches

* Doc update

---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>
```
  fedf7163
- Merge pull request #5074 from dhiltgen/app_log_rotation · 97c59be6
  Daniel Hiltgen authored Jun 19, 2024
```
Implement log rotation for tray app
```
  97c59be6
- Implement log rotation for tray app · 9d8a4988
  Daniel Hiltgen authored Jun 15, 2024
  
  9d8a4988
- Merge pull request #5147 from ollama/mxyng/cleanup · 1ae0750a
  Michael Yang authored Jun 19, 2024
```
remove confusing log message
```
  1ae0750a
- remove confusing log message · 9d91e5e5
  Michael Yang authored Jun 19, 2024
  
  9d91e5e5
- Merge pull request #5072 from dhiltgen/windows_path · 96624aa4
  Daniel Hiltgen authored Jun 19, 2024
```
Move libraries out of users path
```
  96624aa4
- Merge pull request #5146 from dhiltgen/backout · 10f33b85
  Daniel Hiltgen authored Jun 19, 2024
```
Put back temporary intel GPU env var
```
  10f33b85
- Merge pull request #5145 from dhiltgen/bad_loads · 4a633cc2
  Daniel Hiltgen authored Jun 19, 2024
```
Fix bad symbol load detection
```
  4a633cc2
- Revert "Revert "gpu: add env var for detecting Intel oneapi gpus (#5076)"" · d34d88e4
  Daniel Hiltgen authored Jun 19, 2024
```
This reverts commit 755b4e4f.
```
  d34d88e4
- Fix bad symbol load detection · 52ce350b
  Daniel Hiltgen authored Jun 19, 2024
```
pointer deref's weren't correct on a few libraries, which explains
some crashes on older systems or miswired symlinks for discovery libraries.
```
  52ce350b
- Merge pull request #5128 from zhewang1-intc/fix_levelzero_empty_symbol_detect · 2abebb2c
  Daniel Hiltgen authored Jun 19, 2024
```
Fix levelzero empty symbol detect
```
  2abebb2c
- types/model: remove Digest · 380e06e5
  Blake Mizerany authored Jun 18, 2024
```
The Digest type in its current form is awkward to work with and presents
challenges with regard to how it serializes via String using the '-'
prefix.

We currently only use this in ollama.com, so we'll move our specific
needs around digest parsing and validation there.
```
  380e06e5
- get real func ptr. · badf975e
  Wang,Zhe authored Jun 19, 2024
  
  badf975e
- Revert "gpu: add env var for detecting Intel oneapi gpus (#5076)" · 755b4e4f
  Wang,Zhe authored Jun 19, 2024
```
This reverts commit 163cd3e7.
```
  755b4e4f
- Bump latest fedora cuda repo to 39 · 1a1c99e3
  Daniel Hiltgen authored Jun 18, 2024
  
  1a1c99e3
18 Jun, 2024 7 commits

Merge pull request #5121 from ollama/mxyng/deepseekv2 · 21adf8b6
Michael Yang authored Jun 18, 2024
```
deepseek v2 graph
```
21adf8b6
deepseek v2 graph · e873841c
Michael Yang authored Jun 18, 2024

e873841c
Merge pull request #5117 from dhiltgen/fix_prediction · 26d0bf92
Daniel Hiltgen authored Jun 18, 2024
```
Handle models with divergent layer sizes
```
26d0bf92

Handle models with divergent layer sizes · 359b15a5

Daniel Hiltgen authored Jun 18, 2024

The recent refactoring of the memory prediction assumed all layers
are the same size, but for some models (like deepseek-coder-v2) this
is not the case, so our predictions were significantly off.

359b15a5

Merge pull request #5106 from dhiltgen/clean_logs · b55958a5
Daniel Hiltgen authored Jun 18, 2024
```
Tighten up memory prediction logging
```
b55958a5

Tighten up memory prediction logging · 7784ca33

Daniel Hiltgen authored Jun 17, 2024

Prior to this change, we logged the memory prediction multiple times
as the scheduler iterates to find a suitable configuration, which can be
confusing since only the last log before the server starts is actually valid.
This now logs once just before starting the server on the final configuration.
It also reports what library instead of always saying "offloading to gpu" when
using CPU.

7784ca33

Merge pull request #5105 from dhiltgen/cuda_mmap · c9c8c98b
Daniel Hiltgen authored Jun 17, 2024
```
Adjust mmap logic for cuda windows for faster model load
```
c9c8c98b

17 Jun, 2024 2 commits

Adjust mmap logic for cuda windows for faster model load · 17179679

Daniel Hiltgen authored Jun 17, 2024

On Windows, recent llama.cpp changes make mmap slower in most
cases, so default to off.  This also implements a tri-state for
use_mmap so we can detect the difference between a user provided
value of true/false, or unspecified.

17179679

Update import.md · 176d0f70
Jeffrey Morgan authored Jun 17, 2024

176d0f70