Commits · 88bcd79bb9a4b2baa739efe2ccabcbcf3c89bdb5 · OpenDAS / ollama

01 Jul, 2024 2 commits
- err on insecure path · 88bcd79b
  Michael Yang authored Jun 30, 2024
  
  88bcd79b
- Fix case for NumCtx · cff3f44f
  Daniel Hiltgen authored Jul 01, 2024
  
  cff3f44f
27 Jun, 2024 1 commit
- zip: prevent extracting files into parent dirs (#5314) · 123a722a
  Michael Yang authored Jun 26, 2024
  
  123a722a
25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

21 Jun, 2024 4 commits

Sort the ps output · 642cee13

Daniel Hiltgen authored Jun 21, 2024

Provide consistent ordering for the ps command - longest duration listed first

642cee13

Disable concurrency for AMD + Windows · 9929751c

Daniel Hiltgen authored Jun 19, 2024

Until ROCm v6.2 ships, we wont be able to get accurate free memory
reporting on windows, which makes automatic concurrency too risky.
Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes.
All other platforms and GPUs have accurate VRAM reporting wired
up now, so we can turn on concurrency by default.

9929751c

Enable concurrency by default · 17b7186c

Daniel Hiltgen authored May 06, 2024

This adjusts our default settings to enable multiple models and parallel
requests to a single model. Users can still override these by the same
env var settings as before. Parallel has a direct impact on
num_ctx, which in turn can have a significant impact on small VRAM GPUs
so this change also refines the algorithm so that when parallel is not
explicitly set by the user, we try to find a reasonable default that fits
the model on their GPU(s). As before, multiple models will only load
concurrently if they fully fit in VRAM.

17b7186c

fix: quantization with template · e835ef18
Michael Yang authored Jun 21, 2024

e835ef18

19 Jun, 2024 1 commit

Extend api/show and ollama show to return more model info (#4881) · fedf7163

royjhan authored Jun 19, 2024



* API Show Extended

* Initial Draft of Information
Co-Authored-By: Patrick Devine <pdevine@sonic.net>

* Clean Up

* Descriptive arg error messages and other fixes

* Second Draft of Show with Projectors Included

* Remove Chat Template

* Touches

* Prevent wrapping from files

* Verbose functionality

* Docs

* Address Feedback

* Lint

* Resolve Conflicts

* Function Name

* Tests for api/show model info

* Show Test File

* Add Projector Test

* Clean routes

* Projector Check

* Move Show Test

* Touches

* Doc update

---------
Co-authored-by: Patrick Devine <pdevine@sonic.net>

fedf7163

16 Jun, 2024 1 commit
- Add ModifiedAt Field to /api/show (#5033) · 89c79bec
  royjhan authored Jun 15, 2024
```
* Add Mod Time to Show

* Error Handling
```
  89c79bec
14 Jun, 2024 8 commits

review comments and coverage · 6f351bf5
Daniel Hiltgen authored Jun 05, 2024

6f351bf5

Prevent multiple concurrent loads on the same gpus · ff4f0cbd

Daniel Hiltgen authored Jun 04, 2024

While models are loading, the VRAM metrics are dynamic, so try
to load on a GPU that doesn't have a model actively loading, or wait
to avoid races that lead to OOMs

ff4f0cbd

Refine CPU load behavior with system memory visibility · fc37c192
Daniel Hiltgen authored Jun 03, 2024

fc37c192

Reintroduce nvidia nvml library for windows · 434dfe30

Daniel Hiltgen authored Jun 03, 2024

This library will give us the most reliable free VRAM reporting on windows
to enable concurrent model scheduling.

434dfe30

Harden unload for empty runners · 48702dd1
Daniel Hiltgen authored May 30, 2024

48702dd1

Support forced spreading for multi GPU · 5e8ff556

Daniel Hiltgen authored May 08, 2024

Our default behavior today is to try to fit into a single GPU if possible.
Some users would prefer the old behavior of always spreading across
multiple GPUs even if the model can fit into one.  This exposes that
tunable behavior.

5e8ff556

Improve multi-gpu handling at the limit · 6fd04ca9

Daniel Hiltgen authored May 18, 2024

Still not complete, needs some refinement to our prediction to understand the
discrete GPUs available space so we can see how many layers fit in each one
since we can't split one layer across multiple GPUs we can't treat free space
as one logical block

6fd04ca9

server: longer timeout in `TestRequests` (#5046) · dd7c9ebe
Jeffrey Morgan authored Jun 14, 2024

dd7c9ebe

13 Jun, 2024 2 commits
- add OLLAMA_MODELS to envconfig (#5029) · 94618b23
  Patrick Devine authored Jun 13, 2024
  
  94618b23
- server: remove jwt decoding error (#5027) · 1fd236d1
  Jeffrey Morgan authored Jun 13, 2024
  
  1fd236d1
12 Jun, 2024 1 commit

fix: multiple templates when creating from model · c16f8af9

Michael Yang authored Jun 12, 2024

multiple templates may appear in a model if a model is created from
another model that 1) has an autodetected template and 2) defines a
custom template

c16f8af9

10 Jun, 2024 2 commits
- fix: skip removing layers that no longer exist · 515f497e
  Michael Yang authored Jun 10, 2024
  
  515f497e
- add test · b27268aa
  Michael Yang authored Jun 10, 2024
  
  b27268aa
07 Jun, 2024 1 commit
- fix create model when template detection errors · 030e765e
  Michael Yang authored Jun 07, 2024
  
  030e765e
06 Jun, 2024 3 commits
- detect chat template from KV · 9b6c2e6e
  Michael Yang authored Jun 03, 2024
  
  9b6c2e6e
- API app/browser access (#4879) · 1a29e9a8
  royjhan authored Jun 06, 2024
```
* API app/browser access

* Add tauri (resolves #2291, #4791, #3799, #4388)
```
  1a29e9a8
- Separate ListResponse and ModelResponse for api/tags vs api/ps (#4842) · 4bf1da49
  royjhan authored Jun 06, 2024
```
* Remove false time fields

* Struct Separation for List and Process

* Remove Marshaler
```
  4bf1da49
05 Jun, 2024 1 commit
- server: skip blob verification for already verified blobs · de5beb06
  Blake Mizerany authored May 24, 2024
  
  de5beb06
04 Jun, 2024 7 commits
- update create handler to use model.Name · d61ef8b9
  Michael Yang authored May 08, 2024
  
  d61ef8b9
- gofmt, goimports · 6297f856
  Michael Yang authored Jun 04, 2024
  
  6297f856
- more lint · 8ce4032e
  Michael Yang authored May 29, 2024
  
  8ce4032e
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
- some gocritic · c895a7d1
  Michael Yang authored May 21, 2024
  
  c895a7d1
- nolintlint · 8ffb5174
  Michael Yang authored May 21, 2024
  
  8ffb5174
- replace x/exp/slices with slices · 04f3c12b
  Michael Yang authored May 21, 2024
  
  04f3c12b
24 May, 2024 2 commits
- Move envconfig and consolidate env vars (#4608) · 4cc3be30
  Patrick Devine authored May 24, 2024
  
  4cc3be30
- Fix download retry issue · db2ffa79
  Tim Scheuermann authored May 24, 2024
  
  db2ffa79
23 May, 2024 1 commit

Use flash attention flag for now (#4580) · 38255d2a

Jeffrey Morgan authored May 22, 2024

* put flash attention behind flag for now

* add test

* remove print

* up timeout for sheduler tests

38255d2a

21 May, 2024 1 commit

Correct typo in error message (#4535) · 4434d7f4

Sang Park authored May 22, 2024

The spelling of the term "request" has been corrected, which was previously mistakenly written as "requeset" in the error log message.

4434d7f4

20 May, 2024 1 commit
- fix quantize file types · 807d0927
  Michael Yang authored May 17, 2024
  
  807d0927