- 22 Nov, 2024 1 commit
-
-
Edwin.JH.Lee authored
-
- 21 Nov, 2024 28 commits
-
-
Elias authored
OrionChat is a free web-based chat interface that simplifies interactions with multiple AI model providers. It provides a unified platform for chatting and exploring multiple large language models (LLMs).
-
湛露先生 authored
Signed-off-by:zhanluxianshen <zhanluxianshen@163.com>
-
Jeffrey Morgan authored
-
R0CKSTAR authored
Signed-off-by:Xiaodong Ye <xiaodong.ye@mthreads.com>
-
Paul Robello authored
-
毛巳煜 authored
-
xuyangbocn authored
-
emrgnt-cmplxty authored
-
Cyril Blaecke authored
-
Christian Tzolov authored
-
Philippe Charrière authored
Parakeet is a GoLang SDK for Ollama --------- Co-authored-by:Parth Sareen <parth.sareen@ollama.com>
-
Marcin Szczygliński authored
-
Michael authored
-
Jakub Burkiewicz authored
-
Dezoito authored
-
Franco Lombardo authored
-
Aarushi authored
-
Kevin Brake authored
-
chyok authored
-
Nico authored
-
Laurent Eschenauer authored
-
Andy Gill authored
Haverscript uses classical functional programming techniques to provide a composable interface for interacting with ollama-hosted LLMs.
-
drunkwcodes authored
-
boessu authored
-
奶茶叔叔 authored
-
Alexander F. Rødseth authored
-
Nikita Ganzikov authored
-
Daniel Hiltgen authored
-
- 20 Nov, 2024 11 commits
-
-
Jesse Gross authored
Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. Fixes #7762
-
Jesse Gross authored
We need to track which tokens are in the cache ourselves. We currently add tokens to the cache tracker when we add them to batch but they are not actually in the cache until we call Decode. This can cause confusion when we are shifting the cache. Avoids "could not find a KV slot for the batch" issues. Bug #7545
-
Jesse Gross authored
We try to recover from errors by dropping the tokens that caused the problem and re-trying. However, dropping the tokens is not correct and continuing often leads to infinite loops. To avoid, this we end the sequence if such a condition is detected, which is also surprising. At this point, it is better to just report the error. This will make it easier to find problems and the alternatives are perhaps even more surprising to users. This is not a very satisfactory solution either - we should isolate the error and return it to the user without killing the whole process. However, this is an incremental step and consistent with most other failures (which either manifest as abort() or panic).
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
Jesse Gross authored
This doesn't have any impact currently because NUM_PARALLEL is forced to 1 for embeddings, so both indicies will always be 0.
-
Emir Sahin authored
-
Marcus Ziadé authored
-
thewh1teagle authored
-
Adarsh Mishra authored
-
rohitanshu authored
change 'containg' to 'containing'
-
Gordon Kamer authored
-