Commits · 58e3fff311f9e7abec20cdfe20fa43958e447aeb · OpenDAS / ollama

01 Jul, 2024 9 commits
- rename templates to template · 58e3fff3
  Michael Yang authored Jun 10, 2024
  
  58e3fff3
- remove ManifestV2 · 3f0b309a
  Michael Yang authored Jun 10, 2024
  
  3f0b309a
- Merge pull request #5410 from dhiltgen/ctx_cleanup · e70610ef
  Daniel Hiltgen authored Jul 01, 2024
```
Fix case for NumCtx
```
  e70610ef
- Merge pull request #5364 from dhiltgen/concurrency_docs · dfded7e0
  Daniel Hiltgen authored Jul 01, 2024
```
Document concurrent behavior and settings
```
  dfded7e0
- Remove default auto from help message · 173b5504
  Daniel Hiltgen authored Jul 01, 2024
```
This may confuse users thinking "auto" is an acceptable string - it must be numeric
```
  173b5504
- Fix case for NumCtx · cff3f44f
  Daniel Hiltgen authored Jul 01, 2024
  
  cff3f44f
- Merge pull request #4218 from dhiltgen/auto_parallel · 3518aaef
  Daniel Hiltgen authored Jul 01, 2024
```
Enable concurrency by default
```
  3518aaef
- Update README.md (#5214) · 1963c002
  RAPID ARCHITECT authored Jun 30, 2024
```
* Update README.md

Added Mesop example to web & desktop

* Update README.md

---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
```
  1963c002
- Update gpu.md (#5382) · 27402cb7
  Eduard authored Jul 01, 2024
```
Runs fine on a NVIDIA GeForce GTX 1050 Ti
```
  27402cb7
29 Jun, 2024 2 commits
- Update api.md · c1218199
  Jeffrey Morgan authored Jun 29, 2024
  
  c1218199
- Do not shift context for sliding window models (#5368) · 717f7229
  Jeffrey Morgan authored Jun 28, 2024
```
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
```
  717f7229
28 Jun, 2024 4 commits
- Document concurrent behavior and settings · aae56abb
  Daniel Hiltgen authored Jun 28, 2024
  
  aae56abb
- Include Show Info in Interactive (#5342) · 5f034f5b
  royjhan authored Jun 28, 2024
  
  5f034f5b
- Ollama Show: Check for Projector Type (#5307) · b910fa90
  royjhan authored Jun 28, 2024
```
* Check exists projtype

* Maintain Ordering
```
  b910fa90
- Update docs (#5312) · 6d421908
  royjhan authored Jun 28, 2024
  
  6d421908
27 Jun, 2024 5 commits
- Merge pull request #5340 from ollama/mxyng/mem · 1ed4f521
  Michael Yang authored Jun 27, 2024
```
gemma2 graph
```
  1ed4f521
- gemma2 graph · de2163da
  Michael Yang authored Jun 27, 2024
  
  de2163da
- update readme for gemma 2 (#5333) · 2cc7d050
  Michael authored Jun 27, 2024
```
* update readme for gemma 2
```
  2cc7d050
- zip: prevent extracting files into parent dirs (#5314) · 123a722a
  Michael Yang authored Jun 26, 2024
  
  123a722a
- llm: architecture patch (#5316) · 4d311eb7
  Jeffrey Morgan authored Jun 26, 2024
  
  4d311eb7
25 Jun, 2024 2 commits

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

cmd: defer stating model info until necessary (#5248) · 2aa91a93

Blake Mizerany authored Jun 24, 2024

This commit changes the 'ollama run' command to defer fetching model
information until it really needs it. That is, when in interactive mode.

It also removes one such case where the model information is fetch in
duplicate, just before calling generateInteractive and then again, first
thing, in generateInteractive.

This positively impacts the performance of the command:

    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.168 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.220 total
    ; time ./before run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./before run llama3 'hi'  0.02s user 0.01s system 2% cpu 1.217 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 4% cpu 0.652 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 5% cpu 0.498 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with or would you like to chat?

    ./after run llama3 'hi'  0.01s user 0.01s system 3% cpu 0.479 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total
    ; time ./after run llama3 'hi'
    Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat?

    ./after run llama3 'hi'  0.02s user 0.01s system 5% cpu 0.507 total

2aa91a93

21 Jun, 2024 8 commits

Merge pull request #5205 from dhiltgen/modelfile_use_mmap · ccef9431
Daniel Hiltgen authored Jun 21, 2024
```
Fix use_mmap parsing for modelfiles
```
ccef9431

Sort the ps output · 642cee13

Daniel Hiltgen authored Jun 21, 2024

Provide consistent ordering for the ps command - longest duration listed first

642cee13

Docs (#5149) · 9a9e7d83
royjhan authored Jun 21, 2024

9a9e7d83

Disable concurrency for AMD + Windows · 9929751c

Daniel Hiltgen authored Jun 19, 2024

Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.

9929751c

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

Merge pull request #5206 from ollama/mxyng/quantize · 189a43ca
Michael Yang authored Jun 21, 2024
```
fix: quantization with template
```
189a43ca
fix: quantization with template · e835ef18
Michael Yang authored Jun 21, 2024

e835ef18

Fix use_mmap parsing for modelfiles · 7e774922

Daniel Hiltgen authored Jun 21, 2024

Add the new tristate parsing logic for the code path for modelfiles,
as well as a unit test.

7e774922

20 Jun, 2024 9 commits
- Merge pull request #5194 from dhiltgen/linux_mmap_auto · c7c2f3bc
  Daniel Hiltgen authored Jun 20, 2024
```
Refine mmap default logic on linux
```
  c7c2f3bc
- Merge pull request #5125 from dhiltgen/fedora39 · 54a79d6a
  Daniel Hiltgen authored Jun 20, 2024
```
Bump latest fedora cuda repo to 39
```
  54a79d6a
- Refine mmap default logic on linux · 5bf5aeec
  Daniel Hiltgen authored Jun 20, 2024
```
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
```
  5bf5aeec
- Merge pull request #5192 from ollama/mxyng/kv · e01e535c
  Michael Yang authored Jun 20, 2024
```
handle asymmetric embedding KVs
```
  e01e535c
- Merge pull request #5188 from ollama/jyan/tmpdir2 · 0195d6a2
  Josh authored Jun 20, 2024
```
fix: skip os.removeAll() if PID does not exist
```
  0195d6a2
- handle asymmetric embedding KVs · 8e0641a9
  Michael Yang authored Jun 20, 2024
  
  8e0641a9
- err!=nil check · 662568d4
  Josh Yan authored Jun 20, 2024
  
  662568d4
- reformat error check · 4ebb66c6
  Josh Yan authored Jun 20, 2024
  
  4ebb66c6
- skip os.removeAll() if PID does not exist · 23e899f3
  Josh Yan authored Jun 20, 2024
  
  23e899f3
19 Jun, 2024 1 commit

Extend api/show and ollama show to return more model info (#4881) · fedf7163

royjhan authored Jun 19, 2024



* API Show Extended

* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>

* Clean Up

* Descriptive arg error messages and other fixes

* Second Draft of Show with Projectors Included

* Remove Chat Template

* Touches

* Prevent wrapping from files

* Verbose functionality

* Docs

* Address Feedback

* Lint

* Resolve Conflicts

* Function Name

* Tests for api/show model info

* Show Test File

* Add Projector Test

* Clean routes

* Projector Check

* Move Show Test

* Touches

* Doc update

---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>

fedf7163