- 20 Nov, 2024 14 commits
-
-
Jesse Gross authored
Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. Fixes #7762
-
Jesse Gross authored
We need to track which tokens are in the cache ourselves. We currently add tokens to the cache tracker when we add them to batch but they are not actually in the cache until we call Decode. This can cause confusion when we are shifting the cache. Avoids "could not find a KV slot for the batch" issues. Bug #7545
-
Jesse Gross authored
We try to recover from errors by dropping the tokens that caused the problem and re-trying. However, dropping the tokens is not correct and continuing often leads to infinite loops. To avoid, this we end the sequence if such a condition is detected, which is also surprising. At this point, it is better to just report the error. This will make it easier to find problems and the alternatives are perhaps even more surprising to users. This is not a very satisfactory solution either - we should isolate the error and return it to the user without killing the whole process. However, this is an incremental step and consistent with most other failures (which either manifest as abort() or panic).
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
Jesse Gross authored
This doesn't have any impact currently because NUM_PARALLEL is forced to 1 for embeddings, so both indicies will always be 0.
-
Emir Sahin authored
-
Marcus Ziadé authored
-
thewh1teagle authored
-
Adarsh Mishra authored
-
rohitanshu authored
change 'containg' to 'containing'
-
Gordon Kamer authored
-
Jonathan Hecl authored
-
Daniel Hiltgen authored
Many model crashes are masked behind "An existing connection was forcibly closed by the remote host" This captures that common error message and wires in any detected errors from the log. This also adds the deepseek context shift error to the known errors we capture.
-
Daniel Hiltgen authored
Avoid a round-trip asking users for logs to see what went wrong.
-
- 19 Nov, 2024 5 commits
-
-
Gabe Goodhart authored
https://github.com/ollama/ollama/issues/7656 Branch: Granite3StoppingBug-7656 Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
Blake Mizerany authored
This change allows for mixed-case model names to be pushed, pulled, copied, and created, which was previously disallowed because the Ollama registry was backed by a Docker registry that enforced a naming convention that disallowed mixed-case names, which is no longer the case. This does not break existing, intended, behaviors. Also, make TestCase test a story of creating, updating, pulling, and copying a model with case variations, ensuring the model's manifest is updated correctly, and not duplicated across different files with different case variations.
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Patrick Devine authored
-
Patrick Sy authored
-
- 18 Nov, 2024 5 commits
-
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Daniel Hiltgen authored
Enable both left and right click on the pop-up menu
-
Daniel Hiltgen authored
If the model doesn't fit any layers on metal, and we load zero layers we would panic trying to look up the GPU size during scheduling ops
-
Vinh Nguyen authored
-
Nicolas Bonamy authored
-
- 17 Nov, 2024 5 commits
-
-
Darius Kocar authored
-
Tushar Adhatrao authored
-
Vinh Nguyen authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 16 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
Follow up to #7217 - merge after release
-
- 15 Nov, 2024 3 commits
-
-
Jesse Gross authored
This is a partial revert of 8a35bb92 "runner.go: Increase survivability of main processing loop", removing the panic handler. Although we want to avoid errors taking down the runner, we also should make the user aware of problems when they happen. In the future, we can restructure things so both parts are true.
-
Jesse Gross authored
Currently, if an error occurs during the prep stages (such as tokenizing) of a single request, it will only affect that request. However, if an error happens during decoding, it can take down the entire runner. Instead, it's better to drop the tokens that triggered the error and try to keep going. However, we also need to stop when we run out of tokens, otherwise, this just causes an infinite loop. This is likely the cause of at least some of the hanging issues that have been reported. Bug #7573
-
Daniel Hiltgen authored
Fix a rebase glitch from the old C++ runner build model
-
- 14 Nov, 2024 7 commits
-
-
Patrick Devine authored
-
Bruce MacDonald authored
- golang.org/x/sync v0.3.0 -> v0.9.0 - golang.org/x/image v0.14.0 -> v0.22.0 - golang.org/x/text v0.15.0 -> v0.20.0
-
Jesse Gross authored
It's possible to get prompts that consist entirely of whitespace - this is most likely to happen when generating embeddings. Currently, we will trim this away, leaving an empty prompt, which will then generate an error. Generating embeddings from whitespace should not trigger an error, as this may break pipelines. It's better to just leave the whitespace in place and process what we are given. This is consistent with past versions of Ollama. Bug #7578
-
Jesse Gross authored
NUM_PARALEL is currently enforced by the Ollama server process - it will only issue requests to the runner if the maximum number of concurrent requests has not been exceeded. Although this should be sufficient, it is good for the runner to protect its own data structures. Currently, if too many requests get through to the runner, they will just get stuck and never return. This may help with reports of Ollama hanging, though it is unclear how it would actually occur. Bug #7573
-
Michael Yang authored
fix(mllama): sync backend between batches
-
Blake Mizerany authored
-
Michael Yang authored
-