Commits · de982616f1dde636e46b2cef2edd971b54ef7691 · OpenDAS / ollama

11 Sep, 2024 1 commit

runner: Flush pending responses before returning · 93ac3760

Jesse Gross authored Sep 11, 2024

If there are any pending reponses (such as from potential stop
tokens) then we should send them back before ending the sequence.
Otherwise, we can be missing tokens at the end of a response.

Fixes #6707

93ac3760

04 Sep, 2024 1 commit
- llm: update llama.cpp commit to 8962422 (#6618) · 5e2653f9
  Jeffrey Morgan authored Sep 03, 2024
  
  5e2653f9
03 Sep, 2024 1 commit

Fix sprintf to snprintf (#5664) · 94fff580

FellowTraveler authored Sep 03, 2024

/Users/au/src/ollama/llm/ext_server/server.cpp:289:9: warning: 'sprintf' is deprecated: This function is provided for compatibility reasons only. Due to security concerns inherent in the design of sprintf(3), it is highly recommended that you use snprintf(3) instead.

94fff580

22 Aug, 2024 1 commit

Fix embeddings memory corruption (#6467) · 90ca8417

Daniel Hiltgen authored Aug 22, 2024

* Fix embeddings memory corruption

The patch was leading to a buffer overrun corruption.  Once removed though, parallism
in server.cpp lead to hitting an assert due to slot/seq IDs being >= token count.  To
work around this, only use slot 0 for embeddings.

* Fix embed integration test assumption

The token eval count has changed with recent llama.cpp bumps (0.3.5+)

90ca8417

11 Aug, 2024 1 commit

server: parallelize embeddings in API web handler instead of in subprocess runner (#6220) · 15c2d8fe

Jeffrey Morgan authored Aug 11, 2024

For simplicity, perform parallelization of embedding requests in the API handler instead of offloading this to the subprocess runner. This keeps the scheduling story simpler as it builds on existing parallel requests, similar to existing text completion functionality.

15c2d8fe

06 Aug, 2024 1 commit
- update llama.cpp submodule to `1e6f6554` (#6208) · e04c7012
  Jeffrey Morgan authored Aug 06, 2024
  
  e04c7012
05 Aug, 2024 1 commit
- sort batch results (#6189) · 86b907f8
  royjhan authored Aug 05, 2024
  
  86b907f8
30 Jul, 2024 1 commit

Add Metrics to `api\embed` response (#5709) · 1b44d873

royjhan authored Jul 30, 2024

* add prompt tokens to embed response

* rm slog

* metrics

* types

* prompt n

* clean up

* reset submodule

* update tests

* test name

* list metrics

1b44d873

29 Jul, 2024 1 commit
- update llama.cpp submodule to `6eeaeba1` (#6039) · 68ee42f9
  Jeffrey Morgan authored Jul 29, 2024
  
  68ee42f9
22 Jul, 2024 1 commit

Enable windows error dialog for subprocess startup · e12fff88

Daniel Hiltgen authored Jul 15, 2024

Make sure if something goes wrong spawning the process, the user gets
enough info to be able to try to self correct, or at least file a bug
with details so we can fix it. Once the process starts, we immediately
change back to the recommended setting to prevent the blocking dialog.
This ensures if the model fails to load (OOM, unsupported model type,
etc.) the process will exit quickly and we can scan the stdout/stderr
of the subprocess for the reason to report via API.

e12fff88

15 Jul, 2024 1 commit

Introduce `/api/embed` endpoint supporting batch embedding (#5127) · b9f5e16c

royjhan authored Jul 15, 2024

* Initial Batch Embedding

* Revert "Initial Batch Embedding"

This reverts commit c22d54895a280b54c727279d85a5fc94defb5a29.

* Initial Draft

* mock up notes

* api/embed draft

* add server function

* check normalization

* clean up

* normalization

* playing around with truncate stuff

* Truncation

* Truncation

* move normalization to go

* Integration Test Template

* Truncation Integration Tests

* Clean up

* use float32

* move normalize

* move normalize test

* refactoring

* integration float32

* input handling and handler testing

* Refactoring of legacy and new

* clear comments

* merge conflicts

* touches

* embedding type 64

* merge conflicts

* fix hanging on single string

* refactoring

* test values

* set context length

* clean up

* testing clean up

* testing clean up

* remove function closure

* Revert "remove function closure"

This reverts commit 55d48c6ed17abe42e7a122e69d603ef0c1506787.

* remove function closure

* remove redundant error check

* clean up

* more clean up

* clean up

b9f5e16c

07 Jul, 2024 2 commits
- llm: allow gemma 2 to context shift (#5534) · d8def1ff
  Jeffrey Morgan authored Jul 07, 2024
  
  d8def1ff
- llm: print caching notices in debug only (#5533) · 0e09c380
  Jeffrey Morgan authored Jul 07, 2024
  
  0e09c380
05 Jul, 2024 1 commit
- Use slot with cached prompt instead of least recently used (#5492) · d89454de
  Jeffrey Morgan authored Jul 05, 2024
```
* Use common prefix to select slot

* actually report `longest`
```
  d89454de
03 Jul, 2024 1 commit

Return Correct Prompt Eval Count Regardless of Cache Prompt (#5371) · 3b5a4a77

royjhan authored Jul 03, 2024

* openai compatibility

* Revert "openai compatibility"

This reverts commit d3f98a811e00fc497d889c8c45b0cfec5b64690c.

* remove erroneous subtraction of prompt cache

3b5a4a77

29 Jun, 2024 1 commit
- Do not shift context for sliding window models (#5368) · 717f7229
  Jeffrey Morgan authored Jun 28, 2024
```
* Do not shift context for sliding window models

* truncate prompt > 2/3 tokens

* only target gemma2
```
  717f7229
19 Jun, 2024 1 commit
- remove confusing log message · 9d91e5e5
  Michael Yang authored Jun 19, 2024
  
  9d91e5e5
14 Jun, 2024 1 commit
- Fix server.cpp for the new cuda build macros · fb9cdfa7
  Daniel Hiltgen authored May 18, 2024
  
  fb9cdfa7
11 Jun, 2024 1 commit
- llm: fix seed value not being applied to requests (#4986) · ead259d8
  Jeffrey Morgan authored Jun 11, 2024
  
  ead259d8
09 Jun, 2024 1 commit

llm: always add bos token to prompt (#4941) · 34f14279

Jeffrey Morgan authored Jun 08, 2024



* fix embedding by adding fixes from llama.cpp upstream

* remove assert

---------
Co-authored-by: Jesper Ek <deadbeef84@gmail.com>

34f14279

01 Jun, 2024 1 commit

revert tokenize ffi (#4761) · 829ff87b

Michael Yang authored May 31, 2024

* Revert "use `int32_t` for call to tokenize (#4738)"

This reverts commit 763bb65d.

* Revert "vocab only"

This reverts commit bf54c845.

* Revert "use ffi for tokenizing/detokenizing"

This reverts commit 26a00a04.

829ff87b

29 May, 2024 3 commits
- rm unused infill · de781b37
  Michael Yang authored May 12, 2024
  
  de781b37
- rm unused system prompt · 3e217993
  Michael Yang authored May 12, 2024
  
  3e217993
- use ffi for tokenizing/detokenizing · 26a00a04
  Michael Yang authored May 11, 2024
  
  26a00a04
23 May, 2024 2 commits
- bump (#4597) · 714adb8b
  Michael Yang authored May 23, 2024
  
  714adb8b
- Wire up load progress · b37b496a
  Daniel Hiltgen authored May 20, 2024
```
This doesn't expose a UX yet, but wires the initial server portion
of progress reporting during load
```
  b37b496a
20 May, 2024 1 commit

feat: add support for flash_attn (#4120) · e15307fd

Sam authored May 21, 2024

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: enable flash attention if supported

* feat: add flash_attn support

e15307fd

09 May, 2024 1 commit
- log clean up · 58876091
  Michael Yang authored May 09, 2024
  
  58876091
04 May, 2024 1 commit
- omit prompt and generate settings from final response · 44869c59
  Michael Yang authored May 03, 2024
  
  44869c59
30 Apr, 2024 3 commits
- llm: add back check for empty token cache · fcf4d60e
  jmorganca authored Apr 30, 2024
  
  fcf4d60e
- update llama.cpp submodule to `f364eb6` (#4060) · 18d9a7e1
  Jeffrey Morgan authored Apr 30, 2024
  
  18d9a7e1
- Update llama.cpp (#4036) · 23d23409
  Daniel Hiltgen authored Apr 29, 2024
```
* Bump llama.cpp to b2761

* Adjust types for bump
```
  23d23409
17 Apr, 2024 1 commit
- Fixed startup sequence to report model loading · c942e4a0
  ManniX-ITA authored Apr 17, 2024
  
  c942e4a0
16 Apr, 2024 1 commit
- Support unicode characters in model path (#3681) · 7c9792a6
  Jeffrey Morgan authored Apr 16, 2024
```
* parse wide argv characters on windows

* cleanup

* move cleanup to end of `main`
```
  7c9792a6
01 Apr, 2024 2 commits

Apply 01-cache.diff · 0a0e9f3e
Daniel Hiltgen authored Mar 19, 2024

0a0e9f3e

Switch back to subprocessing for llama.cpp · 58d95cc9

Daniel Hiltgen authored Mar 14, 2024

This should resolve a number of memory leak and stability defects by allowing
us to isolate llama.cpp in a separate process and shutdown when idle, and
gracefully restart if it has problems. This also serves as a first step to be
able to run multiple copies to support multiple models concurrently.

58d95cc9

26 Mar, 2024 1 commit
- add license in file header for vendored llama.cpp code (#3351) · f5ca7f8c
  Jeffrey Morgan authored Mar 26, 2024
  
  f5ca7f8c
23 Mar, 2024 1 commit
- Bump llama.cpp to b2474 · 43799532
  Daniel Hiltgen authored Mar 23, 2024
```
The release just before ggml-cuda.cu refactoring
```
  43799532
16 Mar, 2024 1 commit
- llama: remove server static assets (#3174) · e95ffc74
  Jeffrey Morgan authored Mar 15, 2024
  
  e95ffc74
12 Mar, 2024 1 commit
- Adapt our build for imported server.cpp · 85129d3a
  Daniel Hiltgen authored Mar 12, 2024
  
  85129d3a