- 20 Nov, 2024 11 commits
-
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
Jesse Gross authored
This doesn't have any impact currently because NUM_PARALLEL is forced to 1 for embeddings, so both indicies will always be 0.
-
Emir Sahin authored
-
Marcus Ziadé authored
-
thewh1teagle authored
-
Adarsh Mishra authored
-
rohitanshu authored
change 'containg' to 'containing'
-
Gordon Kamer authored
-
Jonathan Hecl authored
-
Daniel Hiltgen authored
Many model crashes are masked behind "An existing connection was forcibly closed by the remote host" This captures that common error message and wires in any detected errors from the log. This also adds the deepseek context shift error to the known errors we capture.
-
Daniel Hiltgen authored
Avoid a round-trip asking users for logs to see what went wrong.
-
- 19 Nov, 2024 5 commits
-
-
Gabe Goodhart authored
https://github.com/ollama/ollama/issues/7656 Branch: Granite3StoppingBug-7656 Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
Blake Mizerany authored
This change allows for mixed-case model names to be pushed, pulled, copied, and created, which was previously disallowed because the Ollama registry was backed by a Docker registry that enforced a naming convention that disallowed mixed-case names, which is no longer the case. This does not break existing, intended, behaviors. Also, make TestCase test a story of creating, updating, pulling, and copying a model with case variations, ensuring the model's manifest is updated correctly, and not duplicated across different files with different case variations.
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Patrick Devine authored
-
Patrick Sy authored
-
- 18 Nov, 2024 5 commits
-
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Daniel Hiltgen authored
Enable both left and right click on the pop-up menu
-
Daniel Hiltgen authored
If the model doesn't fit any layers on metal, and we load zero layers we would panic trying to look up the GPU size during scheduling ops
-
Vinh Nguyen authored
-
Nicolas Bonamy authored
-
- 17 Nov, 2024 5 commits
-
-
Darius Kocar authored
-
Tushar Adhatrao authored
-
Vinh Nguyen authored
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 16 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
Follow up to #7217 - merge after release
-
- 15 Nov, 2024 3 commits
-
-
Jesse Gross authored
This is a partial revert of 8a35bb92 "runner.go: Increase survivability of main processing loop", removing the panic handler. Although we want to avoid errors taking down the runner, we also should make the user aware of problems when they happen. In the future, we can restructure things so both parts are true.
-
Jesse Gross authored
Currently, if an error occurs during the prep stages (such as tokenizing) of a single request, it will only affect that request. However, if an error happens during decoding, it can take down the entire runner. Instead, it's better to drop the tokens that triggered the error and try to keep going. However, we also need to stop when we run out of tokens, otherwise, this just causes an infinite loop. This is likely the cause of at least some of the hanging issues that have been reported. Bug #7573
-
Daniel Hiltgen authored
Fix a rebase glitch from the old C++ runner build model
-
- 14 Nov, 2024 7 commits
-
-
Patrick Devine authored
-
Bruce MacDonald authored
- golang.org/x/sync v0.3.0 -> v0.9.0 - golang.org/x/image v0.14.0 -> v0.22.0 - golang.org/x/text v0.15.0 -> v0.20.0
-
Jesse Gross authored
It's possible to get prompts that consist entirely of whitespace - this is most likely to happen when generating embeddings. Currently, we will trim this away, leaving an empty prompt, which will then generate an error. Generating embeddings from whitespace should not trigger an error, as this may break pipelines. It's better to just leave the whitespace in place and process what we are given. This is consistent with past versions of Ollama. Bug #7578
-
Jesse Gross authored
NUM_PARALEL is currently enforced by the Ollama server process - it will only issue requests to the runner if the maximum number of concurrent requests has not been exceeded. Although this should be sufficient, it is good for the runner to protect its own data structures. Currently, if too many requests get through to the runner, they will just get stuck and never return. This may help with reports of Ollama hanging, though it is unclear how it would actually occur. Bug #7573
-
Michael Yang authored
fix(mllama): sync backend between batches
-
Blake Mizerany authored
-
Michael Yang authored
-
- 12 Nov, 2024 3 commits
-
-
Jesse Gross authored
-
Daniel Hiltgen authored
It looks like 8 minutes isn't quite enough and we're seeing sporadic timeouts
-
Daniel Hiltgen authored
This adds support for the Jetson JetPack variants into the Go runner
-