Commits · cb42e607c5cf4d439ad4d5a93ed13c7d6a09fc34 · OpenDAS / ollama

25 Jun, 2024 1 commit

llm: speed up gguf decoding by a lot (#5246) · cb42e607

Blake Mizerany authored Jun 24, 2024

Previously, some costly things were causing the loading of GGUF files
and their metadata and tensor information to be VERY slow:

  * Too many allocations when decoding strings
  * Hitting disk for each read of each key and value, resulting in a
    not-okay amount of syscalls/disk I/O.

The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro
m3.

This commit also prevents collecting large arrays of values when
decoding GGUFs (if desired). When such keys are encountered, their
values are null, and are encoded as such in JSON.

Also, this fixes a broken test that was not encoding valid GGUF.

cb42e607

11 Jun, 2024 1 commit

Revert "Merge pull request #4938 from ollama/mxyng/fix-byte-order" · 7bdcd1da

Michael Yang authored Jun 11, 2024

This reverts commit f5f245cc, reversing
changes made to 94d37fdc.

this change broke gguf v2 which is incorrectly detected as big endian

7bdcd1da

08 Jun, 2024 1 commit
- fix parsing big endian gguf · 620d5c56
  Michael Yang authored Jun 08, 2024
  
  620d5c56
07 Jun, 2024 1 commit
- fix create model when template detection errors · 030e765e
  Michael Yang authored Jun 07, 2024
  
  030e765e
04 Jun, 2024 1 commit
- lint · e40145a3
  Michael Yang authored May 21, 2024
  
  e40145a3
21 May, 2024 1 commit
- simplify safetensors reading · 171eb040
  Michael Yang authored May 20, 2024
  
  171eb040
20 May, 2024 3 commits
- cleanup · bbbd9f20
  Michael Yang authored May 15, 2024
  
  bbbd9f20
- bpe pretokenizer · 547132e8
  Michael Yang authored May 15, 2024
  
  547132e8
- llama3 conversion · c8cf0d94
  Patrick Devine authored Apr 28, 2024
  
  c8cf0d94
24 Apr, 2024 1 commit
- fixes for gguf (#3863) · 14476d48
  Patrick Devine authored Apr 23, 2024
  
  14476d48
16 Apr, 2024 2 commits
- fix padding to only return padding · e74163af
  Michael Yang authored Apr 15, 2024
  
  e74163af
- fix padding in decode · 969238b1
  Michael Yang authored Apr 15, 2024
```
TODO: update padding() to _only_ returning the padding
```
  969238b1
15 Apr, 2024 1 commit
- Add llama2 / torch models for `ollama create` (#3607) · 9f8691c6
  Patrick Devine authored Apr 15, 2024
  
  9f8691c6
10 Apr, 2024 1 commit
- refactor tensor query · 8b2c1006
  Michael Yang authored Apr 03, 2024
  
  8b2c1006
01 Apr, 2024 1 commit
- refactor model parsing · d338d704
  Michael Yang authored Mar 13, 2024
  
  d338d704
29 Mar, 2024 1 commit
- Add gemma safetensors conversion (#3250) · 5a5efee4
  Patrick Devine authored Mar 28, 2024
```
Co-authored-by: Michael Yang <mxyng@pm.me>
```
  5a5efee4
26 Mar, 2024 1 commit
- change `github.com/jmorganca/ollama` to `github.com/ollama/ollama` (#3347) · 1b272d5b
  Patrick Devine authored Mar 26, 2024
  
  1b272d5b
15 Mar, 2024 1 commit

llm,readline: use errors.Is instead of simple == check (#3161) · 6ce37e4d

Blake Mizerany authored Mar 15, 2024



This fixes some brittle, simple equality checks to use errors.Is. Since
go1.13, errors.Is is the idiomatic way to check for errors.
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

6ce37e4d

12 Mar, 2024 1 commit
- refactor readseeker · 00852979
  Michael Yang authored Mar 09, 2024
  
  00852979
08 Mar, 2024 1 commit
- decode ggla · 76bdebba
  Michael Yang authored Mar 08, 2024
  
  76bdebba
07 Mar, 2024 1 commit
- Convert Safetensors to an Ollama model (#2824) · 2c017ca4
  Patrick Devine authored Mar 06, 2024
  
  2c017ca4
21 Feb, 2024 1 commit
- add gguf file types (#2532) · 949d7b1c
  Michael Yang authored Feb 20, 2024
  
  949d7b1c
24 Jan, 2024 1 commit
- refactor tensor read · cd22855e
  Michael Yang authored Jan 24, 2024
  
  cd22855e
12 Jan, 2024 1 commit
- add max context length check · eaed6f8c
  Michael Yang authored Jan 12, 2024
  
  eaed6f8c
08 Jan, 2024 1 commit

Offload layers to GPU based on new model size estimates (#1850) · 08f1e189

Jeffrey Morgan authored Jan 08, 2024



* select layers based on estimated model memory usage

* always account for scratch vram

* dont load +1 layers

* better estmation for graph alloc

* Update gpu/gpu_darwin.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* Update llm/llm.go

* add overhead for cuda memory

* Update llm/llm.go
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

* fix build error on linux

* address comments

---------
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

08f1e189

11 Dec, 2023 1 commit

remove per-model types · 56ffc302

Michael Yang authored Dec 08, 2023

mostly replaced by decoding tensors except ggml models which only
support llama

56ffc302

05 Dec, 2023 3 commits
- comments · 5a5dca13
  Michael Yang authored Nov 29, 2023
  
  5a5dca13
- seek instead of copyn · 72e7a49a
  Michael Yang authored Nov 29, 2023
  
  72e7a49a
- split from into one or more models · 2cb0fa7d
  Michael Yang authored Nov 24, 2023
  
  2cb0fa7d
22 Nov, 2023 1 commit
- fix: gguf int type · 199941cd
  Michael Yang authored Nov 22, 2023
  
  199941cd
09 Nov, 2023 1 commit

instead of static number of parameters for each model family, get the real... · c5e1bbab

Michael Yang authored Nov 08, 2023

instead of static number of parameters for each model family, get the real number from the tensors (#1022)

* parse tensor info

* refactor decoder

* return actual parameter count

* explicit rounding

* s/Human/HumanNumber/

c5e1bbab

23 Oct, 2023 1 commit

ggufv3 · 125d0a01

Michael Yang authored Oct 23, 2023

ggufv3 adds support for big endianness, mainly for s390x architecture.
while that's not currently supported for ollama, the change is simple.

loosen version check to be more forward compatible. unless specified,
gguf versions other v1 will be decoded into v2.

125d0a01

03 Oct, 2023 1 commit
- starcoder · c02c0cd4
  Michael Yang authored Oct 02, 2023
  
  c02c0cd4
25 Sep, 2023 1 commit
- unbound max num gpu layers (#591) · 86279f4a
  Bruce MacDonald authored Sep 25, 2023
```
---------
Co-authored-by: Michael Yang <mxyng@pm.me>
```
  86279f4a
21 Sep, 2023 1 commit

remove tmp directories created by previous servers (#559) · 4cba75ef

Bruce MacDonald authored Sep 21, 2023



* remove tmp directories created by previous servers

* clean up on server stop

* Update routes.go

* Update server/routes.go
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>

* create top-level temp ollama dir

* check file exists before creating

---------
Co-authored-by: Jeffrey Morgan <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>

4cba75ef

18 Sep, 2023 1 commit

subprocess improvements (#524) · 66003e1d

Bruce MacDonald authored Sep 18, 2023

* subprocess improvements

- increase start-up timeout
- when runner fails to start fail rather than timing out
- try runners in order rather than choosing 1 runner
- embed metal runner in metal dir rather than gpu
- refactor logging and error messages

* Update llama.go

* Update llama.go

* simplify by using glob

66003e1d

14 Sep, 2023 1 commit

support for packaging in multiple cuda runners (#509) · 2540c918

Bruce MacDonald authored Sep 14, 2023



* enable packaging multiple cuda versions
* use nvcc cuda version if available

---------
Co-authored-by: Michael Yang <mxyng@pm.me>

2540c918

12 Sep, 2023 2 commits
- fix model type for 70b · 0c5a4543
  Michael Yang authored Sep 12, 2023
  
  0c5a4543
- fix falcon decode · 7dee25a0
  Michael Yang authored Sep 12, 2023
```
get model and file type from bin file
```
  7dee25a0
07 Sep, 2023 1 commit
- GGUF support (#441) · 09dd2aef
  Bruce MacDonald authored Sep 07, 2023
  
  09dd2aef